Fork me on GitHub

Data distribution and versioning

Remember When Image by matthewsim via FlickrSharing your changes is a great post on some of the advantages of using Git (or any distributed version control system). Rich Apodaca has an even more interesting post on using GitHub for chemistry, particularly in the context of revision controlled datasets.

In general, we are getting increasingly interested in leveraging public data resources. Indeed, even in pharma there are people who have a great interest in combining internal data with public data to try and get more relevant results. But perhaps the biggest trend going forward is going to be the development of mechanisms that allow you to fork and remix data, much in the way we have done with code and media. The same paradigms apply, although the mechanisms might vary. The comment thread on Rich’s post is a must read as well. My particular favorite is one by Rajarshi Guha

I definitely like the idea of mashing up databases. It saves a lot of hassle related to hosting, managing, updating the datasets on my own. If everybody had clean, well documented API‘s that would make life much easier

Reblog this post [with Zemanta]

This entry was posted in Big Data, Informatics. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present