Tons of data everywhere. Do we need life science CDNs?

May 3, 2008

This weeks Bio-IT World meeting was all about data storage. Driven by the needs of integrating complex, heterogenous data and most of all by next gen sequencing, it’s amazing how much data the life sciences are generating and how poorly prepared we are. I won’t necessarily mention names, but there are places which have data hitting the petabytes AFTER throwing away most of it. How do you access this data? How do you back it up? What kind of data centers do you need? What kind of power do you need? When people are worried about the city being able to handle their power needs then there is cause for concern.

It is also why I think the future of scientific data generation needs to be thought about like Google, etc view data, infrastructure and data access. What if we had a Big Table like distributed file system where all this data could be uploaded to? What data would be uploaded there? How would we access it? Ideally data from public genome projects would be made available as Open Data, available to everyone for downstream analysis under a CC0 or similar license. Of course there is a lot more to these data than just whole genome sequencing. There is also the challenge of just the pipes that the data needs to travel through. These are really large files.

Whatever the solution(s), next gen sequencing and the resultant data glut were top of mind. And this is just the start. Personally, I think that those in small labs who want access to sequencers for their own work really need to reconsider. Their utilization rates are unlikely to justify the cost and they will almost certainly run into data storage, access and archiving issues, especially when something like PacBio comes online. A utility model works best here, a model where people get access to time on machines or access to machines hosted at core facilities, etc.

The more I think about these issues, the more I am convinced that the life sciences really need to embrace something like CDNs. With the sheer volume and variety of data, we need people who can step up to the plate and provide the infrastructure instead of depending on a few people who aren’t necessarily thinking about data the way Google or Yahoo do on a daily basis (although the way it looks some of them are doing just that). I am especially worried about smaller groups and labs who might just get left out if we don’t develop the appropriate ecosystem.

The economics of all this? That’s another issue for another day.

Further reading
Chris Dwan’s Bio-IT World presentation
The DNA Data Deluge
Technorati Tags: , , , ,

Bio-IT World day 2 - iPhones, Virtualization, EC2 and the Semantic Web

April 30, 2008

Cropped version of :Image:IPhone_Release_-_Seattle_(keyboard).A quick report on Day 2 of Bio-IT World.

The day started with a keynote by Josh Boger, founder and CEO of Vertex. His talk spanned several real world examples and some food for thought. Highlights

  • Vertex has made active use of a MedChem ELN, which has been extended to their entire MedChem community, including external partners. In his own words the goal was “enabling the virtual research organization”
  • Metric of success was user adoption and there were some good analytics supporting uptake
  • He spoke at length about the HCV program, where they have used extensive predictive modeling and simulation
  • Clinical data has backed up their predictive modeling (they’re in Phase III now)
  • They have avoided some experiments (carried out by competitors in one case) that their models suggested they avoid
  • He ended by talking a lot about communication and how technology can impact the healthcare system. Much of this section of his talk was around the iPhone. For example how the iPhone can be used to track RFID tagged pill bottles, patient exercise regimens, carry patient records, monitor weight, etc. They’re actually implementing some of these ideas

There were many other talks to attend, and I won’t bore you with some of the details, but I will talk about one talk, a talk by Chris Dagdigian of The BioTeam, a small boutique consulting shop, which readers of this blog will know via mentions of Michael Cariaso. Chris spent a lot his talk discussing the economics of storage and the kinds of storage, etc available these days and trends in storage and computing. Perhaps it shows how much of a geek I am, but this was a dream talk, one full of hardware specs, pictures of data centers, etc. It is clear that virtualization is big; Chris’ preference being Xen. There was a cool slide on meta-virtualization (a virtual machine inside a virtual machine inside a virtual machine). Two thoughts really resonated with me; first was his distaste for classical Grid Computing, which I have long considered impractical for most companies. The second was his strong support for Amazon Web Services, especially EC2. Apparently, every single BioTeam consultant has independently deployed an EC2 solution, i.e. they’ve all come to the same conclusion. Can’t wait to see this talk next year to find out where they’ve gone with AWS. One thing he said which also resonated was to talk about the death of the small cluster. Today and in the future, we will either have multicore (8-16 cores) on our desktops or dial up cloud resources. His slides will be available somewhere. Can’t wait to get my hands on them. This was a GREAT talk.

One of the highlights for me was attending the W3C Semantic Web HCLSIG lunch. I got to meet people I know (Eric Neumann), people I have interacted with online (Vipul Kashyap) and followed (John Wilbanks from Science Commons). And I got to say hello to Sir Tim Berners-Lee, who needs no introduction.

Another highlight for me. I got to finally meet Joe Landman, whose JackRabbit got a good plug in the BioTeam talk as well. It was great to meet Joe with whom I’ve been having a conversation via our respective blogs for quite a while now.

Met several former colleagues and customers as well. Bio-IT World has definitely been one of the better conferences I have had a chance to attend in terms of interest and people.

Image via Wikipedia

Technorati Tags: , , ,

Bio-IT World Day 1 - Visualization, the cloud and people

April 29, 2008

Collective intelligenceDetailed blog posts will follow when I have some additional cycles, but thought I’d share some quick thoughts on day 1 of Bio-IT World. My conference started with a workshop on data visualization, which was mostly about the importance of visualization for making sense of multidimensional data sets and what kind of visualizations could be done. My take aways from the talks

  • There was a distinction made between statistical methods and data mining and presenting information to humans.
  • Life science data is inherently multiscalar and reducing dimensions without losing information or creating artifacts is not trivial
  • Importance to create systems that can help scientists go through a workflow and predict visualizations, and help guide the user to the most appropriate visualization for the relevant questions
  • APIs are important for Pfizer. If a full API is not available, they are not interested in a visualization package
  • and last but not the least, as I Twittered during the workshop, they need to invite Ben Fry to give a talk on visualization. I am sure he would have a lot to contribute

Perhaps the highlight was the keynote by John Reynder from Johnson and Johnson PRD. He gave us a tour of his experiences through his career, including his time at Los Alamos. The talk was not in any great depth, but I left it very encouraged. Encouraged that the head of an IT organization at a large pharma company understood the value of collaboration, understood that innovation happens everywhere, and needs to be tapped appropriately and a lot of information is pre-competitive and should be shared across companies. Other things he talked about

  1. The cloud :). There was a slide on how to dial up storage and cycles, with AWS prominently mentioned
  2. Collective intelligence. He spent a lot of time on collective intelligence, from knowledge and innovation networks, to connecting people internally and talking about using new ways to make tools available and connecting people together. There was a suitable amount of web 2.0 jargon and frequent mention of the Semantic Web as essential to the life sciences.
  3. We have the compute power, but the gap comes from the software.
  4. He also warned about getting too caught up in the technology and losing sight of the problem

Would have been nice to have open data mentioned explicitly, but he clearly said that pharma needs to appreciate data and information sharing.

Bio-IT World means meeting old friends, especially from my Accelrys days as well as finally meeting people I admire from my online life, with a special shoutout to Michael Cariaso

On tap on Day 2 - Electronic Data Capture, high throughput data management, supercomputing and a W3C lunch

Image via Wikipedia

Technorati Tags: , ,

Viral forecasting

February 18, 2008

Nathan Wolfe’s work on disease surveillance is fascinating. I wonder what kind of role informatics can play here.

My brain is mashing up iPhones, stream querying, triple stores and maps

Further reading
Streambase: Query your streaming data
The CDC embraces the Semantic Web

Technorati Tags: , ,

Required: A transparent science machine

December 16, 2007

As we build collective intelligence applications, which are the heart of Web 2.0, we need to make sure that their inner workings remain open and transparent, or we may fall into the same trap that ended up bedeviling Wall Street in this past year, in which no one understood any longer just how the machine they’d built was going to perform, and the Golem was out of control.

The above quote comes from a post by Tim O’Reilly where he discusses politics and Warren Buffet. It struck me as telling that the peer review process and the academic model also forms an analogous “machine”. I am not an academic, so I could be wrong, but from what I read, hear, and from my memories as a graduate student, there appears to be a lack of transparency in many processes. What are the criteria for tenure? What types of publications are exptected? What are the metrics behind peer review?

I fear that once science loses credibility, getting it back is going to be very very hard. We don’t want that do we.

Technorati Tags:

Next Page »