Big data and a big blogger
September 30, 2008
It’s kind nice having people whose research you’ve followed for years start blogging. I believe Shirley has something to do with this, which means she gets thanks for the fact that Russ Altman is a blogger, even if he finds it hard.
Anyway, in recent days/months the whole concept of big data has been top of mind, partly due to personal interest and partly driven by professional interest at both past and current places of work. So it was nice to see Russ cover the subject in a blog post. In calling big data an informaticians best friend, he talks about what big data is and the impact it will have on informatics. Specifically he points to the need to collect all the data we can, but equally importantly, what will make the data useful and valuable. I think we aren’t there yet, but we’ll get there.
He also talks about the market that big data will generate for informatics tools, algorithms, and solutions from the computer industry. I remember sitting in a talk by Lee Hood some years ago where he talked about how the mathematics for deriving useful information from the millions of data points collected over a variety of analytes across various high throughput technologies wasn’t there yet. That’s the really hard part. Even with our current methods, we can really push the boundaries. Most importantly though, I believe Big Data in the life science will really make us think about data collection, data management, data analysis and data distribution at an industrial scale. This is even more true for the derived data. The days of hacked out code, a server on a grad students computer, and thinking about instruments as personal lab properties are gone. We need to think about capacity, content delivery, knowledge management and a lot of topics that so far only a few have had to worry about, and we need to do so as a community.
Of course, computing will play a big role in all this, one reason I made the move out of the life sciences into the heart of virtualized computing. I can’t wait to see the life science community, both industry and academia, begin to take computing more seriously, both from the programming, and the architecture point of view. We can’t just be casual consumers anymore, we have to be active about leveraging the technologies and paradigms of data intensive technologies that the web has spawned and add to them the compute intensive needs and requirements often unique to science.
So where was I … oh yes, just read Russ’ darn post and ignore the paragraphs above
Thinking about “thinkism”
September 30, 2008
Image via WikipediaSay what you like about Kevin Kelly, he has the ability to write material that makes you think. In a (no pun intended) post called Thinkism, Kelly makes a very effective argument related to the Singularity, one I try and make but not this effectively.
Let’s start with his definition of thinkism (emphasis mine)
Setting aside the Maes-Garreau effect, the major trouble with this scenario is a confusion between intelligence and work. The notion of an instant Singularity rests upon the misguided idea that intelligence alone can solve problems. As an essay called Why Work Toward the Singularity lets slip: “Even humans could probably solve those difficulties given hundreds of years to think about it.” In this approach one only has to think about problems smartly enough to solve them. I call that “thinkism.”
Here are some other choice lines, ones that fit my own world view
Let’s take curing cancer or prolonging longevity. These are problems that thinking along cannot solve.
No intelligence, no matter how super duper, can figure out how human body works simply by reading all the known scientific literature in the world and then contemplating it.
But it’s what he says after all this that really hits the nail on the head. He says that “Between not knowing how things work and knowing how they work is a lot more than thinkism.” So true. I just wrote a post about how we have so many gaps in our data (something that has come up a lot lately). Our hypotheses are only as good as the data that we can collect. As Kelly said, just thinking about he potential data will not yield the correct data. We have our working models, but as we collect all the data that we can, these models have to be refined, till at some point we can’t correct them any longer (you know that thing they call the Scientific Method). We need to do a lot of experiments, collect a lot of data, build a lot more hypotheses before we can come close to addressing the kinds of problems that Singularitarians talk about.
To end, here is the last line of Kelly’s post
Since we did not see them coming, we look back and say, yes, that was the Singularity.
Related articles by Zemanta
When more is easier
September 29, 2008
Image via WikipediaMore goodness from Jeff Jonas. In The Fast Last Puzzle Piece, he talks about how the notion that more data = slower system is not true. The analogy he uses is that of a jigsaw puzzle, which starts easy, gets harder and eventually gets easier again as pieces can only fit in certain specific positions (fewer degrees of freedom in language we are used to). He goes on to add that such behavior needs to fulfill a set of requirements and that’s what caught my eye. Essentially any set of observations must
- Belong to the same universe
- Have enough features to enable contextualization
- Be such that the features can be extracted, enhanced and classified
- Sufficiently saturate observational space
He adds that you need to have enough smarts to stitch everything together.
As I read that list, I kept thinking of the data we are used to seeing as life scientists. One would think it satisfied all the criteria above, so why are things getting harder? I think it has to do with the point around saturation. In many cases, we don’t have saturation, which is why we can’t get the required results. In others, structure prediction comes to mind, we do have sufficient saturation and we are able to get meaningful results as our body of work grows. However, right now, we haven’t hit that tipping point with a lot of data types that we are in a situation where the system gets “faster” and easier to solve.
What do you think?
Publishing workflows
September 29, 2008
Great presentation by Carole Goble that I found via Richard Akerman. In the presentation, Carole hits upon many of my favorite themes including data driven research, the importance of open data, etc.
I will point to slide 24 that really drives home a point that most people do not talk about enough; Methods are scientific commodities. From the beginning of publication, we have had a “methods” section in our papers. So why hasn’t electronically capturing our workflows, and making them available in some common format, become the norm? This could be done via depositing the workflows at a place like OpenWetWare, and computational workflows in a place like MyExperiment. Yes, there are cases when your workflow is somewhat novel, but you are unlikely to publish that anyway, so it’s a somewhat mooth point
The web as platform: We can do more
September 28, 2008
Image by 5348 Franco via FlickrI have had my differences with Tim O’Reilly over the years on certain issues, but his keynote at Web 2.0 Expo resonated with me at many levels. I am writing this post several days later, so hopefully my notes suitably reflect my thoughts at the time. A lot of people have written about what he had to say, but hopefully this post provides a different perspective.
He spent a lot of time talking about a move towards doing something more meaningful and touched upon many themes. Here are some of the phrases that are worth mentioning
“Web meets world”
“Do stuff that needs to be done”
“Create more value than you capture”
“Pascal’s wager”
“We have to assume that the world is going to go to hell in a handbasket unless we do something about it”
The message once again placed an emphasis on tools and services that make our lives better, not just the next viral Facebook app. Something that many of us have been saying for a while, but Tim wields a bigger stick.
That’s where I think the sciences come in. Science needs to be done. The web needs to be brought to science, and not just by big research labs, or by larger companies, but by forward thinking people who “get it”. It’s why I think the value in scientific web apps lies not in the next social network to foster collaboration, especially when there are so many around, but in projects and efforts that really do something to further science. That could come from open source tools, things like SNPedia, or the kinds of efforts that Cameron and Jean-Claude are pushing.
The question Tim asked was “are we working on the right things?”. That’s a question we all need to ask. I am not saying we are not allowed to have silly fun, and utility can be found in the strangest of places (e.g. Twitter or Friendfeed), but if we believe in something, we can do something about it. We can’t wait for the system to change, and there is the reality of making a living, but some of us are in a position to make a difference and we should. Whether it is by building awareness, or even better, building tools and services that enable the community and others around us, we should do so.

![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=29b50742-13e1-48ea-8497-71b471a900ee)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=1b2afdde-e7dc-4a42-b67f-fb466adb8e2c)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=c2a41f01-2773-4136-8f2d-7fd9b2994185)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=854d336f-f589-4030-ba8e-b77ca6fa5145)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=9d74d00a-68d4-40c8-9cd1-2d77ac92f696)

