Facebook
36 PB of uncompressed data
2250 machines
23,000 cores
32 GB of RAM per machine
processing 80-90TB/day
Yahoo
70 PB of data in HDFS
170 PB spread across the globe
34000 servers
Processing 3 PB per day
120 TB flow through Hadoop every day
Twitter
7 TB/day into HDFS
LinkedIn
120 Billion relationships
82 Hadoop jobs daily (IIRC)
16 TB of intermedia data
2 engineers
These are just some examples from Hadoop Summit. Many of these are production systems, others research systems. Also discussed were massive graphs (trillions of edges), insights from TBs of data ingested daily, etc. All held by a common thread, the Hadoop ecosystem (Hadoop is a lot more now than just an implementation of MapReduce). The next time I hear life science people complain about data volumes, shared storage, etc, I am just going to ignore them. If our data is important, we should be jumping all in and rethinking our approach to computing and data exploration.
Is Hadoop the right solution to every problem? No. Are large-scale key-value stores the right solution to every problem? No. But they are for a number of large data problems and there is a thriving community that seems to realize that problems in computational biology are tractable and will be willing to help. I hope funding agencies, and just perhaps some innovative startup, will take the initiative here and drive forward an approach to computing that is fundamentally different from our previous approaches. It will require a lot of code re-write, but it’s worth it.
Update. Jimmy Lin puts things in perspective with this tweet. Google was processing 20 PB of data/day in MapReduce in 2008
Related articles by Zemanta
- 5 years later, Hadoop has matured (developer.yahoo.net)
- Hadoop at Twitter (Hadoop Summit 2010) (slideshare.net)




10 Comments
Before you ignore the concerns of “life science people” I think you should consider the fact that all of the “data” rich resources you listed are from for profit companies, and that “life science people” are working with biological data, rather than simply more structured relationships or status updates. Imagine trying to define all those relationships and status update de novo. Also, I think a quick PB/profit calculation will reveal that “life science people” are doing a lot more useful work with a lot less resources.
As a “life science people” I buy that, but only to a degree. I had more resources at my disposal when I was a graduate student than I did working at a startup. And I know many startups, often bootstrapped, without funding, knowing where the value lies.
Having said that, I do understand that the funding needs to come from somewhere and on the academic side, funding agencies need to recognize that.
Last but not the least, all I hear are excuses and at every workshop/conference I go to, all I hear are why we can't. No one talks about “this is how we solve this problem”.
Hear hear! Numbers that would any traditional Enterprise Architect to shame. And this is all open source.
The only excuse they have is that Enterprises use transactions, more complex data schemas and provide *ahem* low latency.
No one I know in the life sciences has at their disposal Yahoo like funding! If all you hear is negativity at conferences, perhaps you are going to the wrong ones
There are alot of enthusatic life sciences who want MAUR data please. Take the TCGA or Human Genome Project for examples.
Key word, useful….
Yahoo yes, but not two person bootstrapped startups (and there are plenty). In the end, even the 1000 genomes project is not that much data.
I don't understand your comparsion to two person “bootstrapped” startups. If you are a startup, your job is to make money, plain and simple. Life scientists are for the most part trying to improve human health. So rather that “ignoring” the growing pains of the funding strapped life scientists, why not engage them?
I don't think we're on the same page here. and I apologize for digressing as well. My primary complaint is that analyzing and processing large quantities of data is a solved problem, but we are not trying to do it at scale, not due to a lack of funding, but due to a lack of effort or a recognition of battle tested methods, just because they come from a different industry. This was true when I was in the life science industry, but it stands out a lot more when I am out of it.
I think it's unfair to lump all life scientists into one bucket. Certainly there are those who are not learning from other industries (perhaps because we are being “ignored”), but there are many many many other scientists who are benefiting from multi-modal engagement of multiple disciplines and rather than ignoring the problem, we ARE engaging and making an effort to learn from others in other industries and fields. My point is, rather than ignoring (as you suggest) those who have not be brought up in the PB culture, who are trying their best to do the best with what limited resources they've been given why not reach out and engage these people? It is certainly shortsighted to say that analysis of DNA sequencing data or ANY life science data is or can be analyzed in the same manner in which Facebook updates are “analyzed”. That's not to say that we can't learn a thing or two from computer scientists.
For the most part, I think we ARE on the same page, but the direction we are looking at the issue is different. That's the point! The only way we (both life scientists and computer scientists) will be successful is through cross-pollination of ideas and engagement.
Agree completely with your final point. That's one reason I switched sides to the CS side of things and will continue to evangelize till I go hoarse (to both sides). The part that frustrates me is an apparent unwillingness to take risks, to throw out the past and start afresh. It's happening, but far too slowly as we fight our own legacy.
Life science is not just academia, so I hope to see some for-profit and perhaps non-profits push the envelope, take the risks while life science funding models and priorities catch up with the needs of the community. The one thing that always makes me envious is seeing the smartest in distributed systems spending their time on social applications or GIS (another community we can learn a lot from)
One Trackback
[...] on my massive data processing theme here is a more practical post. In the world of large scale distributed processing, [...]