This weeks Bio-IT World meeting was all about data storage. Driven by the needs of integrating complex, heterogenous data and most of all by next gen sequencing, it’s amazing how much data the life sciences are generating and how poorly prepared we are. I won’t necessarily mention names, but there are places which have data hitting the petabytes AFTER throwing away most of it. How do you access this data? How do you back it up? What kind of data centers do you need? What kind of power do you need? When people are worried about the city being able to handle their power needs then there is cause for concern.
It is also why I think the future of scientific data generation needs to be thought about like Google, etc view data, infrastructure and data access. What if we had a Big Table like distributed file system where all this data could be uploaded to? What data would be uploaded there? How would we access it? Ideally data from public genome projects would be made available as Open Data, available to everyone for downstream analysis under a CC0 or similar license. Of course there is a lot more to these data than just whole genome sequencing. There is also the challenge of just the pipes that the data needs to travel through. These are really large files.
Whatever the solution(s), next gen sequencing and the resultant data glut were top of mind. And this is just the start. Personally, I think that those in small labs who want access to sequencers for their own work really need to reconsider. Their utilization rates are unlikely to justify the cost and they will almost certainly run into data storage, access and archiving issues, especially when something like PacBio comes online. A utility model works best here, a model where people get access to time on machines or access to machines hosted at core facilities, etc.
The more I think about these issues, the more I am convinced that the life sciences really need to embrace something like CDNs. With the sheer volume and variety of data, we need people who can step up to the plate and provide the infrastructure instead of depending on a few people who aren’t necessarily thinking about data the way Google or Yahoo do on a daily basis (although the way it looks some of them are doing just that). I am especially worried about smaller groups and labs who might just get left out if we don’t develop the appropriate ecosystem.
The economics of all this? That’s another issue for another day.
Mike, it's true that we tend to default to Google, but there are dedicated CDNs out there (bit gravity, Limewire, etc), I think what it will take here is a combination of technology and distribution models. Most of these data are going to be public. Even for the one that are private, just moving the data around is going to be a challenge.
That said, Google has in interest in this space, so will be very curious to see if the do something to help out at least some of the big genome centers
I absolutely agree that there needs to be some massive, distributed data repository that is infinitely scalable and accessible at Gigabit speeds to large companies and research centers. Google seems like the only place that would have the technology to do this, with their new Big Table technology like you mentioned. It's pretty sad that Google is the first and last company we think of for this task, when there are so many other large tech companies with billions of dollars who the public just doesn't think is as advanced as Google.
I still remember reading a research paper by a Google engineer about PageRank, and they said that to accurately calculate it on the fly for each page load, Google essentially has a snapshot of the entire Internet in RAM across their servers. That was unbelievable to me, and is why I think Google is the one that can provide this service.
Tons of data everywhere. Do we need life science CDNs?
This weeks Bio-IT World meeting was all about data storage. Driven by the needs of integrating complex, heterogenous data and most of all by next gen sequencing, it’s amazing how much data the life sciences are generating and how poorly prepared we are. I won’t necessarily mention names, but there are places which have data hitting the petabytes AFTER throwing away most of it. How do you access this data? How do you back it up? What kind of data centers do you need? What kind of power do you need? When people are worried about the city being able to handle their power needs then there is cause for concern.
It is also why I think the future of scientific data generation needs to be thought about like Google, etc view data, infrastructure and data access. What if we had a Big Table like distributed file system where all this data could be uploaded to? What data would be uploaded there? How would we access it? Ideally data from public genome projects would be made available as Open Data, available to everyone for downstream analysis under a CC0 or similar license. Of course there is a lot more to these data than just whole genome sequencing. There is also the challenge of just the pipes that the data needs to travel through. These are really large files.
Whatever the solution(s), next gen sequencing and the resultant data glut were top of mind. And this is just the start. Personally, I think that those in small labs who want access to sequencers for their own work really need to reconsider. Their utilization rates are unlikely to justify the cost and they will almost certainly run into data storage, access and archiving issues, especially when something like PacBio comes online. A utility model works best here, a model where people get access to time on machines or access to machines hosted at core facilities, etc.
The more I think about these issues, the more I am convinced that the life sciences really need to embrace something like CDNs. With the sheer volume and variety of data, we need people who can step up to the plate and provide the infrastructure instead of depending on a few people who aren’t necessarily thinking about data the way Google or Yahoo do on a daily basis (although the way it looks some of them are doing just that). I am especially worried about smaller groups and labs who might just get left out if we don’t develop the appropriate ecosystem.
The economics of all this? That’s another issue for another day.
Further reading
Chris Dwan’s Bio-IT World presentation
The DNA Data Deluge
Technorati Tags: Data, Open Data, Next Gen Sequencing, CDN, BioIT