Fork me on GitHub

Data produced, analyzed and consumed. The impact of big science

When genome centers have to start thinking about large scale data center operations you know something is different. In Science Big, Science Connected, I talked about how the availability of high throughput instruments has fundamentally changed our approach to science. On Coast to Coast Bio, Hari and I often argue about whether this is for the better (I like big science, he isn’t as fond of it). In the end, those differences boil down to funding priorities.
Triumvirate
The fact remains that today we are moving towards a clear separation between data producers, data consumers and methods developers. There was a time that a small group of people could cover all that ground, but with the industrialization of data production (microarrays are already there, mass specs and sequencers not quite yet), traditional roles, even in an academic setting are not efficient. It is why core facilities and service providers have become so common. With the case of microarray gene expression data, the day will come when we will have all the data we need in databases and our challenge will be that of querying, analysis, integration and interpretation. The question I keep asking myself is how this impacts collaboration? Will one lab with a sequencer be willing to get other labs get at the raw data? Because by themselves they will only be able to make use of a portion of all the data generated. The model I’d like to see is the one that many dislike. One of genome centers and core facilities churning out data. A set of data consumers, individual labs, research centers, supplementing that data with some of their own, but mostly trying to get biological meaning out of it. All brought together by methods developers, pushing the boundaries of current methodology. Is it possible to develop systems that allow us to design experiments after all the data have been collection? Is it possible to do good science, even if such systems existed? We’ll just have to see.

One thing this does not do is take us away from the scientific methods. All that data improves our statistical scores, but what it really does is allow us to ask new questions. Otherwise, it’s just not very useful. That part never changes

Update: Neil says

Next time that tedious “what is bioinformatics, is it a field?” issue comes up, I’ll point the offender to this. Bioinformatician = professional data consumer. The time when a biologist armed only with the student t-test and an Excel spreadsheet could analyse their data is gone. Give it to us instead :-)

They way I look at it, data consumers are of two kinds, the power consumer, which is what Neil is refering to and the output of their work is what biologists should be using, and the casual consumer, i.e. the biologist who can use well written tools to glean some insights as well, potentially working with their friendly neighborhood power consumer aka the informatics guru. Regardless, one should not overlook the importance of good software, both to make it easy for biologists to ask questions, and to allow the bioinformatician to be more effective, since you don’t want them doing random glue work.

Reblog this post [with Zemanta]

This entry was posted in Big Data, Informatics, Life Science. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

9 Comments

  1. Denubis
    Posted March 31, 2009 at 23:56 | Permalink

    Vinge in Rainbows End actually addresses it. How do his “predictions of the present” correlate with these observations?

  2. Gary Stiehr
    Posted April 1, 2009 at 00:35 | Permalink

    There is no doubt that things have changed. At The Genome Center at Washington University, we've seen tremendous growth over the last two years or so as the “next-gen” sequencers came online. The dramatically increased disk and computational capacity requirements quickly filled our existing data center and drove us to build our current large scale data center. Originally, we were a big data producer but quickly recognized the benefit of also consuming that data and doing further analysis on it (which resulted in some interesting findings about cancer genomics published in Nature last year). Interestingly, we and other centers may also be able to share that data out through efforts such as caBIG and other data grids. Of course there are a lot of questions to answer before we can do that.

  3. Posted April 1, 2009 at 12:52 | Permalink

    Interesting article – I like the new 'definition' of bioinformaticians. We've always been consumers of data, haven't we? Even when the genome project was in its infancy, we were still the ones who wrote the code that allowed our end users to ask interesting questions. We have always relied on high quality data to get to those interesting questions, but … where's the curation?

  4. Posted April 1, 2009 at 17:54 | Permalink

    Somehow I long for the days when scientists were all of the above.Method developers ,Data producers and Data consumers.
    When I think data producer I think projects like structural genomics and the hapmap project . My personal biases prompt me to label structural genomics as a bad example for Data production and the hapmap as a great example of data production.
    All of my opposition to big science stems from the fact that if you are spending 10 million dollars , I would rather the money be given to 10 small labs that do all of the above , than one lab that does one of the above and does it badly.
    I am convinced that true innovation and leaps in our understanding come from small labs that enjoy a good level of funding . The big projects are useful , but should be invested in after a lot of debate and only if a whole bunch of small labs can put together a good reason to have a consortium form that produces data .
    Since funding is tight I would reduce or even do away with Big science entirely and instead spread the wealth and see the magic happen.

  5. Posted April 1, 2009 at 19:35 | Permalink

    Hari

    Doing a lot of what it takes to get big data production done is grunt work. It's industrial strength production. The MUCK as we would call it in these parts. Small labs should not be spending their time learning how to spend production systems that need to be optimized for cost and speed. It's just a waste of resources. The consortium idea doesn't work either. Too many egos, not enough efficiencies.

  6. Posted April 3, 2009 at 11:34 | Permalink

    I have to echo Gary's comments. As someone who has been doing arrays since they popped into existence and now doing deep sequencing, it's moved away from a “one person can do it all” era. In my mind the data is much better served if I can collaborate with someone who specializes in writing scripts and computational biology rather than me learning that as well as the molecular biology I already specialize in.

    Patrick

  7. Anonymous
    Posted April 3, 2009 at 23:51 | Permalink

    i’ve tried to leave this comment twice before; maybe third time’s the charm. i agree with the sentiment. managing data is no longer something that every practitioner/data producer should have to do. not only is it inefficient, it also increases the likelihood that domain-specific data silos will be unable to interact successfully. bioinformatics should be concerned with creating the infrastructure to support data storage, manipulation, and analysis both within and across domains, as well as the software for data producers and users to interact with that infrastructure. i diagram the ecology here (http://www.flavourcountryfeedlot.com/2009/04/data-ecologies.html) and would appreciate thoughts.

  8. deirelbahri
    Posted April 3, 2009 at 19:27 | Permalink

    i agree with the sentiment. data management is too complicated for everyone to be a specialist at it but, more than that, having domain specialists manage their own data makes it more likely that the datasets never talk across domains. i think a strong case can be made to think of data management as a set of infrastructure services, and data producers and data consumers as conceptually separate groups that require different sets of software tools to interact appropriately with the data management infrastructure. (this is not to say that data producers cannot also be data consumers, only that their software needs are different depending on how they interact with any data management system.) i've got a diagram of this data ecology that i'd love thoughts on: http://www.flavourcountryfeedlot.com/2009/04/da...

  9. deirelbahri
    Posted April 3, 2009 at 23:27 | Permalink

    i agree with the sentiment. data management is too complicated for everyone to be a specialist at it but, more than that, having domain specialists manage their own data makes it more likely that the datasets never talk across domains. i think a strong case can be made to think of data management as a set of infrastructure services, and data producers and data consumers as conceptually separate groups that require different sets of software tools to interact appropriately with the data management infrastructure. (this is not to say that data producers cannot also be data consumers, only that their software needs are different depending on how they interact with any data management system.) i've got a diagram of this data ecology that i'd love thoughts on: http://www.flavourcountryfeedlot.com/2009/04/da...

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present