Fork me on GitHub

To handle lots of data, we need to think differently

In a recent editorial (sub might be required) talking about next-gen sequencing and cloud computing, Nature Biotech makes an all to familiar error.

It remains unclear, however, whether the cost of routinely renting time on the cloud would be cost effective in the long term, particularly if a user intends to analyze billions of base pairs of genome sequence on a regular basis. What’s more, if the wide uptake of sequence analysis on clouds depends on the availability of user-friendly, debugged software, bioinformaticians might not be willing to spend the time to familiarize themselves with hadoop, the open source program needed to process large data sets on a cloud—especially when their jobs focus on developing algorithms for their own local computer clusters.

The context for that paragraph is the recent publication of the Crossbow paper, but the error is in the following line

bioinformaticians might not be willing to spend the time to familiarize themselves with hadoop, the open source program needed to process large data sets on a cloud—especially when their jobs focus on developing algorithms for their own local computer clusters

Using Hadoop is not about the cloud (sure it fits the paradigm really well especially for access to resources), but about new ways to analyze large data sets, regardless of where you are running them. The community should be moving away from a “let us throw a bigger box with more RAM mentality” towards more robust, failure-tolerant, frameworks for large scale distributed computing [1][2][3]. These were developed for a reason, to handle lots of data, so let’s use them, wherever we are operating.

The next paragraph has the right approach though (emphasis mine)

Thus, for next-generation sequencing to move out of genome centers, more effort must focus on creating software compatible for use in a cloud or better still, infrastructure software (similar to Apache for web servers) that would allow community-generated software for all types of sequence analysis to be plugged into it. This approach is likely to be particularly valuable for smaller laboratories lacking software development resources. And although it will not solve all the data management and analysis problems associated with next-generation platforms, it could give many the opportunity to adopt a powerful and rapidly advancing technology that would otherwise remain out of reach

That platform approach, one that makes it easy for those outside the genome centers to access and process large quantities of data is the right way to go.

Please read the usual disclaimer

1. Talks from SC09
2. Data platforms for science
3. Matt’s Scidata manifesto)

Reblog this post [with Zemanta]

This entry was posted in Big Data, Bytes, Computing, Genes, Informatics. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present