At breakfast today I had a nice chat about bioinformatics, software, research and the entire ecosystem. In between bemoaning the lack of data architects and an appreciation for software and informatics, we talked about interesting ways to educate people about software and informatics. We ended up talking about github and virtual machines, and probably my favorite use case, Pete Skomoroch‘s great work on building data intensive apps. Here are some of those tutorials
The point of the discussion was this. If we could have more tutorials like this, essentially end-to-end tutorials with code and reference sites like trendingtopics.org but in a bioinformatics/genomics context, I am sure a whole bunch of people would get interested in developing apps and learning new trends and paradigms. Yes, the incentive structure is still wrong, but I’d love to see more educational materials developed along those lines. The key is to identify the people who can do this and help them and encourage them to make these materials available, and let the viral effects take over from there.
Building Data Intensive Apps with Hadoop and EC2 from Cloudera on Vimeo.
Related articles by Zemanta
- More musings on MapReduce and bioinformatics (mndoci.com)
- Hadoop World NYC (hilarymason.com)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=d11fe8a1-835d-4204-8292-9ec282d82d20)



5 Comments
Deepak,
Can you think of a tightly framed problem in bioinformatics, like predicting movie ratings for the the Netflix Prize?
If people are looking to do one of these examples, I'd suggest that building an open source app that manages contest submission and evaluation like Netflix, but for bioinformatics. Basically, the user would upload algorithm predictions for genes & they are evaluated against a gold standard test set (that is the minimal data crunching piece). The app could show some metrics about where the algorithm performed well / poorly along with some visualizations / comparsions to other submissions.
In terms of running the actual algorithms in the cloud, I'd like to see a marketplace that makes input data available on S3 – and allows people to upload black-box pig scripts and associated cache files to process the data and generate output using EMR.
If those workflows could be shared (black box style (no source code – just selected like an AMI from a drop down menu), along with with certified benchmark runtimes and accuracy metrics – then you would have a nice way for “hired gun” analysts to make cash and the platform provider to take a cut. Kind of like an Amazon used book marketplace, but for algorithms.
-Pete
Pete,
I've opened up this question to the community. Let's see what they come up with.
There are automated benchmarks in area of protein structure predictions (list: http://meta.bioinfo.pl/3dbench.pl ). Similar benchmark exist for gene prediction or very likely for protein function prediction (I've seen it couple of years ago, not sure if it's still active). I would expect that any knowledge that people try to extract from metagenomes will also be benchmarkable soon.
Pete,
I've opened up this question to the community. Let's see what they come up with.
There are automated benchmarks in area of protein structure predictions (list: http://meta.bioinfo.pl/3dbench.pl ). Similar benchmark exist for gene prediction or very likely for protein function prediction (I've seen it couple of years ago, not sure if it's still active). I would expect that any knowledge that people try to extract from metagenomes will also be benchmarkable soon.
2 Trackbacks
[...] recently blogged about reference implementations, talking about work by Pete Skomoroch (one of those folks like Pierre and Rajarshi whose ability to [...]
[...] Reference implementations and education (mndoci.com) [...]