Fork me on GitHub

Scaling out for analytics

Distributed and parallel systems
Image via Wikipedia

In a post announcing the open sourcing of Crane (more on Crane later), Bradford Cross writes

A big concern with the modern JVM languages like Scala and Clojure is the ability to scale out from the single JVM address space into distributed environments. Different approaches include a distributed JVM (terracotta), distributed actors (akka), message queues (AMQP/rabbitmq), or solutions for specific computational models, like hadoop

He also writes

For those of us who are researchers, all our jobs are ad hoc transient clusters, and the problem is even more pronounced unless we have a dedicated research cluster, for the cluster must be brought up and torn down for every job. Researchers become “babysitters” to clusters and jobs; the master serves the servant.

I have covered these areas in many of my talks and discussions with folks in the scientific computing, especially informatics, communities in recent months. As our data sizes increase accompanied by a need to scale out in distributed environments, I find a lot of resistance to let go of the old models, and a lack of knowledge of how to scale in general. Additionally, our usage of resources is inefficient, and too many people spend unnecessary time either babysitting infrastructure and schedulers or just waiting for resources to become available (yes I think even grad student time is valuable).

The solution? Well I like to think of this as a set of trends, or a movement in a particular direction. On-demand, high scale cloud services give you the capability of using task-based clusters against single dataspaces. Even a single dataspace only need exist in the context of a project, but the general trend is, and needs to be for Big Data, to keep the data always available and then make ad-hoc compute resources available when required.

The second key trend, and this one is still in it’s infancy in the life science community, is the move to write scale out applications, although perhaps more important might be the development abstractions that allow the non-engineers, i.e. the informaticians and biologists to run jobs (the former will run more ad hoc, experimental jobs) without necessarily worrying about the underlying scaling, etc problems. We’ve talked about Hive, Pig and Cascading before. These abstract out Hadoop and allow you to use non map-reduce programming constructs to do analytical work. Libraries like Wukong make it easier for programmers in languages like Ruby to leverage the power of Hadoop without writing Java code. Perhaps most interestingly there is a whole new family of tools that is coming on board that makes this space very fascinating. If we can build similar platforms and abstractions for the life sciences, we’ll see much more rapid adoption. All we need is a few people who get scale.

Which brings us to Crane. Crane by itself is interesting, as it provides a Clojure-based end-to-end Hadoop solution that can be looked at as a simpler pipleine/functional take on deployment tools like Chef and Puppet, but with a focus on distributed systems on the JVM. Crane however is more interesting as the Hadoop backend for much of the new funcionality in Incanter, which hopes to be an R like environment built using Clojure and leveraging the power of the JVM. Earlier today there was a blog post announcing the merger of a bunch of statistical learning code from Flightcaster into Incanter, code that leverages Hadoop via Crane. You can now embed Incanter in statistical and machine learning apps. These tools are very focused around the JVM and Clojure, but I can see similar setups in other languages, but targeted and life science applications. They might even embed Incanter. The key is that we need to think with distributed mindset and also about how we can make powerful tools available to the non-engineers, or platforms that can be leveraged by experts at scale.

Fun times!!!

Reblog this post [with Zemanta]

This entry was posted in Bytes, Computing, Informatics. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

2 Comments

  1. Posted December 4, 2009 at 23:13 | Permalink

    I concur and think you really hit the nail on the head when you mentioned the need for domain specific tooling. Non-engineers (and I'd even argue most engineers) will not speak raw map reduce natively. Pig, Hive, and Cascading get the job done in an elegant fashion, but in my opinion, the greatest potential for broad adoption in informatics hinges on the need for further abstraction. Rethinking algorithms that take advantage of such infrastructures out of box (think CrossBow and Bowtie) is a step in the right direction. Bundling and sharing AMIs another step in the right direction. Baby steps, but steps nonetheless.

  2. Posted December 5, 2009 at 06:13 | Permalink

    I concur and think you really hit the nail on the head when you mentioned the need for domain specific tooling. Non-engineers (and I'd even argue most engineers) will not speak raw map reduce natively. Pig, Hive, and Cascading get the job done in an elegant fashion, but in my opinion, the greatest potential for broad adoption in informatics hinges on the need for further abstraction. Rethinking algorithms that take advantage of such infrastructures out of box (think CrossBow and Bowtie) is a step in the right direction. Bundling and sharing AMIs another step in the right direction. Baby steps, but steps nonetheless.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present