Fork me on GitHub

All data are not the same

Greg Linden (super smart fellow by all accounts) has an really nice article in the ACM blog. In the article, he writes about machine learning and data in the context of the Netflix prize. He concludes his article with the following

There are a lot of lessons that can be taken from the Netflix contest, but a big one should be the importance of constant experimentation and learning. By competing algorithms against each other, by looking carefully at the data, by thinking about what people want and why they do what they do, and by continuous testing and experimentation, you can reap big gains.

Data is peculiar, and throwing of the shelf algorithms at data gets you that far, and usually gives you reasonable results, but in the end you really need to understand your data and it’s peculiarities. Biological data in particular is anything if not peculiar. In the context of scientific discovery, especially biomarker or drug discovery we need clear metrics for success and I’ve seen quite a few projects where those metrics are very poorly defined. Which means that you aren’t always getting optimal results, even if we throw non-standard algorithms at the problem. In this era of big (and somewhat noisy) data, being able to work against testable metrics (not always possible in a scientific discovery context) should be something we try and do when possible. I am not saying you need to always write a new algorithm, but a general awareness of your data and it’s quirks and your goals as you try out new methods is important.

Reblog this post [with Zemanta]

This entry was posted in Big Data, Informatics, Modeling & Simulation, Programming, Software & Internet. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

3 Trackbacks

  1. By Resisting openness on July 16, 2009 at 14:56

    [...] All data are not the same [...]

  2. [...] All data are not the same (bbgm) [...]

  3. [...] All data are not the same (mndoci.com) [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present