Fork me on GitHub

More musings on MapReduce and bioinformatics

Jeff Dean and Sanjay Ghemawat have an updated MapReduce paper (doi) in the Communications of the ACM. The paper is a pretty strong rebuttal to some claims by Mike Stonebraker and others on the value of the MapReduce model. I am going to let you read the paper (as well as the original papers). What I wanted to talk about were some of the key aspects of the MapReduce model and how this way of thinking is relevant to the life sciences.

The first point that Dean and Ghemawat talk about is heterogenous systems. The way I see it, the entire field of bioinformatics is full of heterogenous systems. Even data we generate in internal systems needs to be combined with data from other systems. In fact, I am pretty sure that as we improve delivery models and APIs for life science data resources, we will see more and more combining and aggregation being done without the need to maintain all the data resources internally. As the authors point out, MapReduce provides a very simple model for analyzing data in heterogenous systems and adding new storage systems is a relatively simple exercise making it easy to process and combine data from a variety of storage systems. The other area of interest (and one I cover often in my life science talks) is the ability to aggregate data to push into a purpose built warehouse for offline analysis (although using systems like Hive you can do that in Hadoop as well). I suspect that’s where we will see Hadoop and other MapReduce methods being used first, but eventually I see large scale heterogenous processing becoming increasingly popular.

Another thing I talk about a lot is fault tolerance. At scale, failure is inevitable (see slides 63-86 of my SC09 talk) and that is one of the core principles of the MapReduce paradigm. As we work with larger and larger data sets, we can no longer restart failed computation from scratch (I have actually seen people lose several 1000 hours of compute time doing just that). While Hadoop many not necessarily have all the tricks in place that the Google implementation has in place (yet, since the framework is hardly at the end of its development), the ability to process data at scale is a critical principle and should overcome some of the current limitations.

One area that I need to look into more is serialization. Google apparently makes extensive use of protocol buffers as part of its MapReduce stack. In the GWAS and sequencing world there are formats like HDF that are not uncommon. In the Hadoop world you have Avro. I wonder if there is a way to combine Avro with the SAM/BAM formats, especially for large scale data processing and application integration.

I still maintain that for MapReduce (specifically Hadoop) to get more popular in the life science community you’ll need the ability to abstract out the MapReduce part itself. Which is why Pig, Cascading, Incanter/Crane, or life science specific frameworks in the mould of Incanter/Crane will be critical.

Please read this disclaimer

Reblog this post [with Zemanta]

This entry was posted in Big Data, Bytes, Genes, Informatics. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

2 Trackbacks

  1. [...] Read more… [...]

  2. By Reference implementations and education on April 2, 2010 at 08:54

    [...] More musings on MapReduce and bioinformatics (mndoci.com) [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present