Greg Linden (super smart fellow by all accounts) has an really nice article in the ACM blog. In the article, he writes about machine learning and data in the context of the Netflix prize. He concludes his article with the following
There are a lot of lessons that can be taken from the Netflix contest, but a big one should be the importance of constant experimentation and learning. By competing algorithms against each other, by looking carefully at the data, by thinking about what people want and why they do what they do, and by continuous testing and experimentation, you can reap big gains.
Data is peculiar, and throwing of the shelf algorithms at data gets you that far, and usually gives you reasonable results, but in the end you really need to understand your data and it’s peculiarities. Biological data in particular is anything if not peculiar. In the context of scientific discovery, especially biomarker or drug discovery we need clear metrics for success and I’ve seen quite a few projects where those metrics are very poorly defined. Which means that you aren’t always getting optimal results, even if we throw non-standard algorithms at the problem. In this era of big (and somewhat noisy) data, being able to work against testable metrics (not always possible in a scientific discovery context) should be something we try and do when possible. I am not saying you need to always write a new algorithm, but a general awareness of your data and it’s quirks and your goals as you try out new methods is important.
Related articles by Zemanta
- Netflix Prize just about wrapped up (mendicantbug.com)
- Computability in Artificial Intelligence (hunch.net)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=beb00d9c-7191-4d3f-94ec-1cdb3595ebd2)



3 Trackbacks
[...] All data are not the same [...]
[...] All data are not the same (bbgm) [...]
[...] All data are not the same (mndoci.com) [...]