Fork me on GitHub

What’s scientific data all about?

Interesting post by Frank Gibson on the Triumvirate of Scientific Data. He postulates that there are three properties of scientific data that are fundamental to the understanding of the curation, standardization and representation of scientific data, specifically Content, Syntax and Semantics.

I think Frank encapsulates many of the points that we have been talking about for a while. I am not sure I necessarily agree on the labels as is, but the key points are the problems we are trying to address and I believe to solve them, there are a few things that we should not forget

1. Context. A key for any piece of scientific data is context. What does a particular protein mean in the context of the problem we are trying to solve, or question we are trying to answer. I’ve been in many cases where the data and analysis have centered on a particular technology. We need to move away from that. This is where semantics comes into play.

2. Data standards. We need our vocabularies and standards. These don’t have to be dictated by committees, but we need to come up with some loose set of community rules. Rules we can all agree on. Scientists tend to have the tendency to come up with their own solution to a problem, and I think it’s cause we looking at science inside out. I’ve heard people talk about a particular piece of software as their window into their research. In other words, they did not look at it from a long term perspective. We need to change that mindset.

3. Data services. As we go forward, how we access this data will be critical. In a world of data distributed across data centers, especially as we have fatter pipes to move bits around, and perhaps take advantage of peer-to-peer networks (personally, I think once we get the data to the edge locations, p2p is not that critical for biological data sets. Data can then be delivered where it needs to be via the appropriate services, and even better, we can take our compute right to the data as well based on various criteria, instead of trying to move huge datasets around. We aren’t there yet, but if we want global data access, we need to be able to move towards data services.

One topic where I often disagree with people is curation. Many believe we need humans to curate data. I don’t have an answer, but don’t believe that humans scale and we will run into scale issues at some point.

Reblog this post [with Zemanta]

This entry was posted in Informatics, Life Science. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • davetribbett
    Good post, here's a couple posts that link digital content to the library, also discusses the importance of curating in a digital age. The librarians need to somehow get control of the important content otherwise its the wild west! Here are the links:
    Curating content
    and
    Information and Content topics
  • cariaso
    "One topic where I often disagree with people is curation. Many believe we need humans to curate data. I don’t have an answer, but don’t believe that humans scale and we will run into scale issues at some point."

    Where is there is a need to choose? SNPedia looks like its for humans, but much of the content is written by external bots. It isn't a a choice between them -- software needs to do the heavy lifting AND make it easy for people to curate.
  • Great point. I've run into many people who think that humans have to examine results to trust the quality. SNPedia is actually a great example of what I am talking about, since it does take care of much of the heavy lifting as you say, while making human curation easier.
blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present