Fork me on GitHub

Will data be our undoing?

No, not our favorite android from Star Trek:TNG, but all those letters and numbers (and increasingly images) that so dominate the life of anyone involved with the biological sciences, whether as a scientist or from the business end.

In recent days there have been a number of posts around the blogsphere on data in bioinformatics. The one that especially caught my attention was Duncan’s post on nodalpoint where he talks about the questions that Peter Norvig asked Tim Berners-Lee at the AAAI meeting (I confess to being somewhat envious of those who got to watch this Q&A session). Peter is not a big fan of the semantic web, citing a number of reasons for his negative opinion (for more read Duncan’s post). Those objections, Sir Berners-Lee’s response and a couple of other posts on nodalpoint and notes from the biomass (good to see you back Roland), got me thinking about data, information interchange, and standards.

It is no secret that many scientists and software developers have been drive batty ever so often due to the inability to read certain file formats in their favorite application. Since the sequencing of the genome, the number and variety of databases has proliferated. Unfortunately, all this has done is result in a mushrooming of the number of data formats and databases with similar or complementary data types that don’t really speak the same language. I would wager that a number of life science projects undertaken by integration service providers like IBM and SAIC deal with data integration related issues. Similarly, novel data pipelining applications have also found a lot of use in trying to integrate disparate data sources in real time. The question that keeps coming up is … why?.

It would seem rather obvious that using open data and communication standards (I still like the concept of semantic web approaches) would make bioinformaticians (and cheminformaticians) the world over a lot more productive. The value of data lies in the interpretations one can make, and by not adopting standards, I feel that the entire field is selling itself short. As we add more “omics” to our lexicon, it becomes even more important to develop a universal ontology for biological (and chemical) data (or a small set of core ontologies). Is there room for a non-profit, non-partisan organization that takes over the responsibility of maintaining these standards for the field at large?. I have heard some ideas being tossed around for a while but nothing has really caught on. In some cases, a format becomes the de facto standard, e.g. the FASTA format. In other cases organizations can agree to merge standards, or some of the leaders in the field form an organization to facilitate sharing and establish standards. There have been cases where the standard that an organization wants to adopt runs into difficulty. As the PDB has found out, moving from an existing, familiar, standard to a new one (pdb to mmCIF) is not as simple as it sounds, since in the end it is the user community that makes the decision to support one format over the other.

I fear that the future will only be more confusing. Biosimulation, pharmacogenomics, systems biology, etc are only going to muddy the waters further. What does this mean? The time is now. Data standards and communication are obviously on the community’s radar, but the field is still a long way from making life easy for its own members. For our own sakes we need to sit down as a community and decide to take action to make communication between different data, experiment types and applications as easy and logical as possible.

My personal biases are towards XML-based formats and approaches, which can be developed in collaboration with the W3C. Biology is becoming increasingly web-based and communicating in the language of the web would appear to be the most logical approach. Whatever we do, we should try our best to avoid binary data formats.

Footnote: I hav a few words to add to the whole document vs. database discussion. I agree with one of the comments on nodalpoint that part of the problem is an unfamiliarity with databases. It is a different mindset which might be easier to reach with more usable tools. With my business hat on, I find that I work best when I query a database to return the rows and columns I want, which I can then pull into excel for further analysis. More recently, I have automated some of these processes, but Iit is not too much of a leap to imagine how intimidating it might be for many people.

Further Reading:
The Gene Ontology
An ontology for macromolecular structure

Are the current ontologies in biology good ontologies?
An ontology for bioinformatics applications

Technorati Tags: , , , , , ,

This entry was posted in Admin, BioIT, Informatics, Life Science. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

4 Trackbacks

  1. [...] Tim O’Reilly had a very provocative headline on the O’Reilly blog recently entited “Open Source Licenses are Obselete.” I am still not sure I get everything that Tim was trying to say (as my comments show), but in general the argument being made is that traditional open source licenses, such as the GPL, were written for a different distribution era. Today, one has access to high quality web services, both free and paid, that may even be developed using open source software. Tim’s call is for a clear definition of open services. Two areas, which have a strong implication for the bioinformatics world, are the concepts of open data (which I blogged about earlier), and open APIs. The latter is especially intriguing. Many services, even paid ones like Backpack provide open APIs, which can be used by developers for a number of purposes, e.g. website integration, etc. I would like to see an unambiguous license for access and use of open APIs, otherwise, I can see things getting very complicated at some point.  Bioinformaticians are used to taking advantage of web services at sites like NCBI and Ensembl, so how the community handles open APIs will definitely have an impact. [...]

  2. [...] ISMB included a New Frontiers session at this years meeting and Bioinform has a recap of the events. The speakers included the aforementioned Janet Thornton, Chris Sander, Amos Bairoch, Phil Bourne and Søren Brunak. While a number of subjects were discussed at the event, I was somewhat surprised by the lack of anything substantially new. I got the impression that the field is at a crossroads. I am biased as well, but it is indeed time that experimental biology and computational biology became inextricably linked, a topic covered by many of the speakers. Bairoch gets credit for the boldest move, suggesting that all grant proposals for large-scale experimental projects include a portion for data storage and management. Any such requirement, once again, cries out for the need to standardize data. [...]

  3. By business|bytes|genes|molecules on March 22, 2007 at 08:11

    [...] Further readingMatthew IngramInspiration in the keys of JG and TBLWill data be our undoing? [...]

  4. [...] Further reading Will data be our undoing Biology, computing and web services Biological content, access and monetization Sphere: Related Content [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present