Scientists spend years collecting and generating increasing amounts data. The data ranges from raw instrument data, “finished” data (e.g. a genome sequence which is constructed after aligning all the short reads from a next-gen sequencer), and annotated data, which has been marked up to add additional information. We have repositories where a lot of this data goes, RCSB, NCBI, etc. In many cases there is clarity in these destinations and for the better part, resources like RCSB and NCBI are well funded and long lived (although I am always nervous about RCSB). However, many data repositories are dependent on funding, with no guarantees that the funding will be renewed. Given the size of some of these data resources, shouldn’t we be thinking of a more sustainable model for funding? This is a general problem for infrastructure resources, given the cost and the fact that you shouldn’t be looking at these from a 3-5 year perspective. This especially baffles me when libraries come into play. Shouldn’t the timescale there be in the 10′s of years?
I’d love to hear from those in the thick of funding infrastructure resources, especially data repositories and what their concerns are in this space?
Related articles by Zemanta
- Scientific data sharing (dbms2.com)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=4cf0afba-4d29-4522-84a1-b1670d4b4f6b)



2 Comments
I guess part of the problem is with the grants given from the funding agencies, which include support for the data-generation, but include minimal or none long-term support for people + hardware to maintain generated data. Of course this problem did not exist 5-7 years ago, when nobody was cranking 1000 Affy chips from a soybean population or generating metagenomic data.
There's GenBank as a repository, but please read again aloud the “repository” word. I am saying that because when a small to medium size lab gets a grant to generate a bunch of data (1000 Affy chips – keep insisting on that example because of personal experience), it's not only about placing the data on a FTP website; it's about the analytics built around the data. Like for example a website created as part of the grant by the lab generating the data, which website offers a mini-portal with some viewers, or algorithms to run on the data via a small computational back-end.
Now when the grant runs out, nobody will maintain that website. NCBI is not a repository, so it's up to the users to grab the deposited data and find a way to analyze them.
What is a possible solution ? Well, to praise my own craft, put your Affy expression measurements or sequence on a data cloud (NCBI can become a data cloud – they have good hardware, but lack an easy to work with and maybe non-scalable API), and compute on your data using a SaaS approach. What this involves is machine images on the Amazon (or any other) compute cloud, with the appropriate software installed, which machines pull the data from the repository an do computes as needed.
As William Gibson said, “The future is already here, it's not evenly distributed yet”.
I guess part of the problem is with the grants given from the funding agencies, which include support for the data-generation, but include minimal or none long-term support for people + hardware to maintain generated data. Of course this problem did not exist 5-7 years ago, when nobody was cranking 1000 Affy chips from a soybean population or generating metagenomic data.
There's GenBank as a repository, but please read again aloud the “repository” word. I am saying that because when a small to medium size lab gets a grant to generate a bunch of data (1000 Affy chips – keep insisting on that example because of personal experience), it's not only about placing the data on a FTP website; it's about the analytics built around the data. Like for example a website created as part of the grant by the lab generating the data, which website offers a mini-portal with some viewers, or algorithms to run on the data via a small computational back-end.
Now when the grant runs out, nobody will maintain that website. NCBI is not a repository, so it's up to the users to grab the deposited data and find a way to analyze them.
What is a possible solution ? Well, to praise my own craft, put your Affy expression measurements or sequence on a data cloud (NCBI can become a data cloud – they have good hardware, but lack an easy to work with and maybe non-scalable API), and compute on your data using a SaaS approach. What this involves is machine images on the Amazon (or any other) compute cloud, with the appropriate software installed, which machines pull the data from the repository an do computes as needed.
As William Gibson said, “The future is already here, it's not evenly distributed yet”.
One Trackback
[...] Deepak Singh: Scientists spend years collecting and generating increasing amounts data. The data ranges from raw [...]