Fork me on GitHub

The importance of data portability, microformats and distribution

I chose the source for this post with some reservations, since I have a decent level of involvement in LCMS software as part of my day job (please see this disclaimer), but the quote is purely illustrative. The quote comes from a Bio-IT World article and is from Herbert Thiele.

The ideal would be to handle, distribute, and archive proteomics data in a data repository and incorporate the publishers of science journals to set up specific guidelines. In the past, all manufacturers had their own file formats, with software running just on the vendor’s machine. Nowadays, the vendors are participating with consortia to support initiatives in data standardization

It became very clear to me some years ago that customers of scientific software absolutely loathed proprietary data formats, and for good reason. So over the years a number of formats, either by accident or via a deliberate effort, have ended up becoming the standard for a particular field. Repositories for all the data being generated have also grown at the same time, although in some cases there is competition. All good news.

So why am I still unhappy? The title of the section is Machine-Readable Experiments. I still feel that life scientists tend to go for the “more is more” approach than the “less is more” approach that I believe works better. Sure, the data are extremely complex, but in an increasingly distributed world, should our efforts be focussed on centralization and complex data formats? It almost seems to me that we are missing both some simple solutions and some long term vision.

As you have heard here before and will hear again and again, over the last few years, the internet is finally reaching a level of scale that makes the web a platform, rather than a repository. Unfortunately, while the nature of the web as platform is only going to grow, our efforts as scientists seem to be, by and large, focussed on making the web a repository. In the past, when storage was expensive and bandwidth limited, I could understand the need to store data in one or two central repositories, but is that still a valid argument. Yes, one could still have a central repository, perhaps one that has some quality standards, but along with discussions about comprehensive data formats and complex standards (which are also required), why aren’t we talking about microformats, and indexing. You could have a site that indexes the web, reading microformats or other semantic information and capturing metadata. To use an analogy that I can relate to, the PDB could be an index of protein structures on the web. The structures would be hosted wherever, but some of the key metadata could be submitted to the PDB either manually or scraped by the PDB as it scours the web for microformats (a rel=”hpdb” tag just as an example). The pdb bots couls, since the data are supposed to be in a particular open format, evaluate the structure for basic quality and then generate all additional data associated with the structure somewhere in the cloud or on some dedicated hardware.

Do we need this right now? Not necessarily, but I wish the great minds out there were thinking about indexing, search, distributed file systems, etc. A true scientific grid might be pure utopia, but leveraging the distributed nature of the web and focusing on aggregation and information might be a better use of some of the time spent in building gargantuan data repositories and ultra-complex data standards.

What does this bring us? It improves data distribution, since all you need is for the data to meet certain openly published guidelines. It provides some creative types the ability to develop cool ways of accessing and interacting with the data, and perhaps results in fostering a culture of data sharing and data portability. It also removes the need for an organization to spend a lot of time to think about large IT infrastructure and focusing on more core solutions (data indexing, visualizing, search, etc).

Some of you deal with very large datasets on a routine basis. What would your preference be?

Technorati Tags: , ,

This entry was posted in Informatics, Open Science, Software & Internet. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

One Trackback

  1. [...] has stated his intention to post the data from his PhD online, conforming to MIAPE standards. The topic of standards and their role in a culture of data sharing was discussed by Deepak, where the web should move from a simple repository of data, to a platform [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present