Fork me on GitHub

Mapping and reducing MD trajectories with HiMach

I have covered D.E. Shaw in the past here at bbgm. I have always been fascinated by a research organization funded by someone who has money and is interested in science, without necessarily feeling the need to commercialize it, and by all accounts a person who is very intelligent. That D.E. Shaw research was represented at Supercomputing 08 was not surprising, but the topic of the talk was not what I expected. That it was one of the more fun talks at the conference made it that much better.

Tiankai Tu (at least I think it was him) gave a talk about HiMach, a framework for the analysis of very long (millisecond scale) trajectories from molecular dynamics solutions. As most of you probably know, D.E. Shaw Research has been building a purpose-built computer, Anton, for millisecond scale molecular dynamics simulations. The problem is that the trajectories generated by those simulations create problems of their own, with traditional, sequential, trajectory analysis methods essentially untenable. That’s the problem HiMach seeks to address. It is not designed to be run on special purpose machines, but rather on a commodity cluster, and takes advantage of that increasingly ubiquitous distributed computing method of our times, MapReduce. HiMach allows users to write trajectory analysis programs sequentially, and then carries out the parallel execution of the programs automatically.

Trajectory Analysis with MapReduce
Under this model, the map phase corresponds to per frame data acquisition, and the reduce phase to cross frame data analysis. In addition to providing a MapReduce style API and parallel runtime which allows users to write trajectory analysis programs in Python, HiMach also extends the MapReduce paradigm, providing a MD trajectory definition and chained reduce capabilities (similar to Cascading as far as I can tell). The framework takes care of key-value data management, storing intermediate key-value pairs in a local file system on each compute node. They were also prudent, and support automatic loading of frames into VMD. The following figure is an overview of the HiMach runtime on a single processor.
HiMach runtime
The cool architecture wouldn’t mean much if the performance was not that great. Using HiMach, they were able to analyze a 1 TB trajectory in 15 minutes on a 512 node cluster. The paper (accessible from the SC08 abstract page) goes into a lot more detail. This is fascinating stuff and shows how modern distributed programming paradigms can be applied to some classic scientific problems at scale. I am looking forward to these key-value paradigms being extended to data storage and retrieval, e.g. using something like CouchDB as a way to analysis structural and trajectory information (a database of trajectory data would be cool)
Reblog this post [with Zemanta]

This entry was posted in Computing, Modeling & Simulation, Programming, Software & Internet. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

One Trackback

  1. [...] in the world of pharma.  It’s all well and good to use “Elastic Map Reduce” to speed up your molecular dynamics calculations on a bookstore’s spare machines, but come on, what about work that needs a validated system?  [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present