In episode 21 of Coast to Coast Bio (not yet released) I talk about Hive. For those who may not know, Hive is a data warehouse infrastructure built on top of Hadoop.
One of the most recent Amazon Public Data Sets is a sample of Wikipedia page stat statistics by Peter Skomoroch. The full data set powers trendingtopics.org.
This site was built by Data Wrangling to demonstrate how Hadoop can power a simple data driven website. The trend statistics and time series data that run the site are updated periodically by launching a temporary EC2 cluster running the Cloudera Hadoop Distribution. Our initial seed data includes the content of wikipedia and hourly article traffic logs from the wikipedia squid proxy collected by Domas Mituzas.
Why do I like this so much? Apart from the fact that it is a website for data visualization and analytics, it hits a lot of points that can be of huge value to the informatics community.
- It uses EC2 to compute on data as needed
- Uses Hadoop and Hive
- It is a reference architecture and you can find the source on Githuband an example dataset on AWS public data sets
That last bit is important. Peter demostrates how you can use Hadoop, Hive, Ruby on Rails and EC2 as a data crunching and data visualization resource. WOuld love to see more such sites, with biological data sets.
Please see this disclaimer
Related articles by Zemanta
- Amazon releases Elastic MapReduce web service (cloudave.com)
- Cloudera floats commercial Hadoop distro (theregister.co.uk)
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=2a2e0508-5444-4047-b1c2-ecf227027ac0)



3 Trackbacks
[...] TrendingTopics.org: A reference site for data analytics in Hadoop and Hive (mndoci.com) [...]
[...] have written previously about Trendingtopics.org as a reference site for data analytics using Hadoop and [...]
[...] have written about TrendingTopics before. Pete Skomoroch gave a talk on how to build a data intensive web [...]