Will data be our undoing?

No, not our favorite android from Star Trek:TNG, but all those letters and numbers (and increasingly images) that so dominate the life of anyone involved with the biological sciences, whether as a scientist or from the business end.

In recent days there have been a number of posts around the blogsphere on data in bioinformatics. The one that especially caught my attention was Duncan’s post on nodalpoint where he talks about the questions that Peter Norvig asked Tim Berners-Lee at the AAAI meeting (I confess to being somewhat envious of those who got to watch this Q&A session). Peter is not a big fan of the semantic web, citing a number of reasons for his negative opinion (for more read Duncan’s post). Those objections, Sir Berners-Lee’s response and a couple of other posts on nodalpoint and notes from the biomass (good to see you back Roland), got me thinking about data, information interchange, and standards.

It is no secret that many scientists and software developers have been drive batty ever so often due to the inability to read certain file formats in their favorite application. Since the sequencing of the genome, the number and variety of databases has proliferated. Unfortunately, all this has done is result in a mushrooming of the number of data formats and databases with similar or complementary data types that don’t really speak the same language. I would wager that a number of life science projects undertaken by integration service providers like IBM and SAIC deal with data integration related issues. Similarly, novel data pipelining applications have also found a lot of use in trying to integrate disparate data sources in real time. The question that keeps coming up is … why?.

It would seem rather obvious that using open data and communication standards (I still like the concept of semantic web approaches) would make bioinformaticians (and cheminformaticians) the world over a lot more productive. The value of data lies in the interpretations one can make, and by not adopting standards, I feel that the entire field is selling itself short. As we add more “omics” to our lexicon, it becomes even more important to develop a universal ontology for biological (and chemical) data (or a small set of core ontologies). Is there room for a non-profit, non-partisan organization that takes over the responsibility of maintaining these standards for the field at large?. I have heard some ideas being tossed around for a while but nothing has really caught on. In some cases, a format becomes the de facto standard, e.g. the FASTA format. In other cases organizations can agree to merge standards, or some of the leaders in the field form an organization to facilitate sharing and establish standards. There have been cases where the standard that an organization wants to adopt runs into difficulty. As the PDB has found out, moving from an existing, familiar, standard to a new one (pdb to mmCIF) is not as simple as it sounds, since in the end it is the user community that makes the decision to support one format over the other.

I fear that the future will only be more confusing. Biosimulation, pharmacogenomics, systems biology, etc are only going to muddy the waters further. What does this mean? The time is now. Data standards and communication are obviously on the community’s radar, but the field is still a long way from making life easy for its own members. For our own sakes we need to sit down as a community and decide to take action to make communication between different data, experiment types and applications as easy and logical as possible.

My personal biases are towards XML-based formats and approaches, which can be developed in collaboration with the W3C. Biology is becoming increasingly web-based and communicating in the language of the web would appear to be the most logical approach. Whatever we do, we should try our best to avoid binary data formats.

Footnote: I hav a few words to add to the whole document vs. database discussion. I agree with one of the comments on nodalpoint that part of the problem is an unfamiliarity with databases. It is a different mindset which might be easier to reach with more usable tools. With my business hat on, I find that I work best when I query a database to return the rows and columns I want, which I can then pull into excel for further analysis. More recently, I have automated some of these processes, but Iit is not too much of a leap to imagine how intimidating it might be for many people.

Further Reading:
The Gene Ontology
An ontology for macromolecular structure

Are the current ontologies in biology good ontologies?
An ontology for bioinformatics applications

Technorati Tags: , , , , , ,

This entry was posted in Admin, BioIT, Informatics, Life Science, Science. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
blog comments powered by Disqus
  • Archives