Streambase: Query your streaming data
February 15, 2008
Over at the Money:Tech conference (O’Reilly does organize some of the coolest conferences), a couple of talks caught my eye, notably one by Michael Stonebraker on Streambase and another by Steve Skiena on using computer simulations and mathematical modeling to make bets (You can find the slides on the web site). After going through the slides, I wondered, is there anything there which would be useful for the life sciences. My first reaction was not really. Michael’s talk on processing streams of data does not really fit into the timescales that scientific data usually falls into (his other company, Vertica, on the other hand fits right in). Steve’s models also don’t really fit anywhere in the drug hunting world (at least in my brain).
But how about post marketing. We track epidemics, we track adverse events, but we don’t really do a good job. We also try and predict the success of a particular drug, both from a scientific/medical perspective and a financial one. I am sure pharma companies use advance financial modeling to try and predict drug success, so I won’t talk about that too much here, but I would like to ruminate over stream processing.

The above figure (taken from Michael’s talk), shows that in stream processing, you essentially have a stream of input data (event data) coming in from one side and coming out of the other side as alerts and actions based on certain queries. I feel that in a hospital, emergency, pandemic situation such a system might be very useful. In a safety situation, where the flow of data is not necessarily that fast, would such a system be overkill? We are constantly collecting data globally and real time results might be critical to bring up an adverse event signal.
Just some thoughts. We have been using the same old methods for a very long time and they, at least from my vantage point, need a refresh with innovative, nimble methods. It’s fascinating to see what people are thinking about in other fields. I am sure they can learn a lot from the life sciences as well.
Technorati Tags: Stream Processing, Data Analysis, Biomedical informatics
The value of data version 45445
February 11, 2008
As I have mentioned before, I often do stuff with the Talks with Talis podcasts playing in the background. In one of them, Jamie Taylor from Metaweb talks at length about Freebase. Something he said while discussing data as intellectual property (all data on Freebase is public)
It is the services that you deliver around data that is of value. That is where the value lies
He went on to talk about companies/organizations needing to analyze their data assets within the appropriate context. Most data is only contextual, and does not differentiate you from the market, nor is it your core value. Needless to say, I couldn’t agree more. Different people look at that data with different contexts, thus deriving types of value. While this is not universally true (some data must be kept hidden, but that’s a small and shrinking amount), by having data in the open more people and organizations derive greater value, i.e. the overall value pool actually increases. For scientific data that’s especially true. Can you imagine where we’d be today without access to all the publicly available datasets that all life scientists use routinely without even thinking twice? The value comes from the questions being asked and the additional data being generated within the companies which brings that public data to life.
Aside: Is it sad that one of the best arguments for open science and interoperability comes from a non-scientist?
Technorati Tags: Freebase, Open Data
Network bandwidths
January 5, 2008
In Google and large scientific datasets, we talked about Google shipping a drive array to scientific labs where they could slurp terrabytes of data and send it off to Google. During the talk Jon Trowbridge made the statement that “Moore’s law is for wimps” compared to the exponential increase in storage capacity. The problem in this whole argument. Network bandwidth sucks!!! Which is why Google resorts to the mailman to get data from source to the cloud.
Today, Vinnie Mirchandani asks
Can our networks keep up with the storage growth?
The context is information that private sector data is expected to go from 2,500 petabytes to 27,000 petabytes in 2010! We are already aware of the sheer size of many scientific datasets, especially when images are involved. And as next-gen instruments come online, more companies adopt an array of high throughput methods, and especially when microfluidics and multiplexed automated analysis instruments and devices become more common, the data glut is going to become a data deluge. If we are to make the web the hug of information exchange and application distribution, network bandwidth, or lack thereof, is going to be the rate limiting step. I am pretty sure that a number of companies and organizations are thinking long and hard about this problem, but we need to get network bandwidths way beyond where they are right now and available inexpensively. What kind of bandwidths can we generally expect over the next decade? Maybe I should take a harder look at the state of network infrastructure.
Technorati Tags: Networking, Data Storage
The value of data version 223435
December 17, 2007
I have often talking about the value of data coming from what we do with it, rather than the raw data itself. It would appear that Google sees the incredibly useful 1-800-GOOG-411 service as a resource for gathering data for better speech recognition for improving speech-to-text solutions (which most of us pretty much assumed anyway, so I don’t understand the paranoia).
To me, that is another validation of my argument. The data (our voices) are not the value here. The value comes from information that results in better speech recognition software, which will be far easier to monetize.
Update: A post by Tim O’Reilly on data being the Intel Inside might seem to contradict that, but I believe if you look at it the way I wrote earlier it doesn’t. Google is making no attempts to lock up the data. All they are doing is improving their data collection.
Technorati Tags: Google, Data, Voice Recognition, Speech to Text
All about information and the web
December 12, 2007
Two great posts for you to read by two rather wise men
Jon Udell on Discovering vs. teaching principles of social information management
Neil Saunders on the web as science communication platform
If I might add a note. The kind of metadata displayed in the Nature paper that Neil mentions should be happening ALL THE TIME
Technorati Tags: Jon Udell, Neil Saunders, Data, Information


