Staying on my massive data processing theme here is a more practical post. In the world of large scale distributed processing, the original MapReduce paper will probably hold the most important position. Hadoop remains the most well known of all the MapReduce implementations, and is now a proven, battle-tested commodity. Tom White’s book
is a great place to start if you have an interest in the framework itself, but the book I wanted to point out was Jimmy Lin’s book on Data-Intensive Text Processing with MapReduce (there is a pre-production PDF of the book from the homepage) and it’s a great dive into algorithm design. The book talks about general algo design, indexing, graphs and a fabulous section on expectation maximization that is a must read for bioinformaticians who might be interested in analyzing and processing large data sets.
Related articles by Zemanta
- Papers on MapReduce algorithms (atbrox.com)
- Honu – A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010 (slideshare.net)




2 Trackbacks
[...] Deepak Singh: Recommendations: Data-Intensive Text Processing With Mapreduce – “Tom White’s book is a great place to start if you have an interest in the framework itself, but the book I wanted to point out was Jimmy Lin’s book on Data-Intensive Text Processing with MapReduce (there is a pre-production PDF of the book from the homepage) and it’s a great dive into algorithm design.“ [...]
[...] Data-intensive text processing with MapReduce [...]