Fork me on GitHub

TrendingTopics.org: A reference site for data analytics in Hadoop and Hive

In episode 21 of Coast to Coast Bio (not yet released) I talk about Hive. For those who may not know, Hive is a data warehouse infrastructure built on top of Hadoop.

One of the most recent Amazon Public Data Sets is a sample of Wikipedia page stat statistics by Peter Skomoroch. The full data set powers trendingtopics.org.

What is TrendingTopics?

This site was built by Data Wrangling to demonstrate how Hadoop can power a simple data driven website. The trend statistics and time series data that run the site are updated periodically by launching a temporary EC2 cluster running the Cloudera Hadoop Distribution. Our initial seed data includes the content of wikipedia and hourly article traffic logs from the wikipedia squid proxy collected by Domas Mituzas.

Why do I like this so much? Apart from the fact that it is a website for data visualization and analytics, it hits a lot of points that can be of huge value to the informatics community.

  • It uses EC2 to compute on data as needed
  • Uses Hadoop and Hive
  • It is a reference architecture and you can find the source on Githuband an example dataset on AWS public data sets

That last bit is important. Peter demostrates how you can use Hadoop, Hive, Ruby on Rails and EC2 as a data crunching and data visualization resource. WOuld love to see more such sites, with biological data sets.

Please see this disclaimer

Reblog this post [with Zemanta]

This entry was posted in Big Data, Computing, Informatics, Life Science. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

3 Trackbacks

  1. By Training, Virtually on July 23, 2009 at 19:06

    [...] TrendingTopics.org: A reference site for data analytics in Hadoop and Hive (mndoci.com) [...]

  2. [...] have written previously about Trendingtopics.org as a reference site for data analytics using Hadoop and [...]

  3. [...] have written about TrendingTopics before. Pete Skomoroch gave a talk on how to build a data intensive web [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present