At SC09, in my Systems Biology talk, I spoke about platforms for data. The idea is hardly original, since I’ve written about this before, and my ideas borrow heavily from Jeff Hammerbacher and Matt Wood among others. But I wanted to add some more meat to it in writing.
Today we live in a world where we generate data from instruments, various experiments or simulations. These data can be used to provide us insights, and we want to add these insights to our data, capture those insights in the context of the data they represent and then keep track of the data and metadata for future changes. We do this in a world where data is generated by different people, different people care about different pieces of the follow on insights and information and perhaps a third set try and put this all together. What we need is are data platforms. Not one platform, but platforms that can communicate via APIs or standards (these days I lean towards being pragmatic and being consistent and using APIs and serialization), and the ability to act upon that data and then make the new insights available. This work will be distributed, even if the data are not.
The following figure borrows from Matt and essentially talks about a data platform and a set of services and APIs that make it possible to do “work”, where work is anything that can be measured (by publications, monetary success, a drug, etc).
As a user you want to pull in data from all these sources (the filter step as I call it) and then take some actions. You need a dataspace, where you can perform those actions with some intention. That’s where a platform like Hadoop really shines (more on that in a separate post), but it doesn’t have to be Hadoop, but rather any platform which allows a set of people to process and query a dataspace.
What I’d like the community to think about is data availability and looking at data as a platform for bigger and better things. Part of that has to do with opening up that data. The other has to do with the technical side. We don’t really have to look too far to think about it. Matt already has a manifesto, and Chempedia is an excellent example of a data platform.
Related articles by Zemanta
- Post Hadoop World thoughts (mndoci.com)
- Modern computing paradigms and the life sciences (mndoci.com)

![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=efd3894d-aacf-416d-90b3-ea72bc04d47b)



2 Trackbacks
[...] written about the distributed self and science data platforms. A lot of the former was around the notion of pubsub, and pushing data to various places. [...]
[...] Data platforms for science – From data to work [...]