Fork me on GitHub

Protein simulation: At a crossroads

Almost ten years ago, I ran my first molecular dynamics simulation of a protein (bacteriorhodopsin). In these ten years the field has changed a lot.  Simulations were routinely run for a few 100 picoseconds, in vacuo. Today, a few nanoseconds in a box of water molecules is typical (using such improved methods as Particle Mesh Ewald for improved treatment of the electrostatics.

There are many reasons why this has happened. New features, improved programming, competition between programs like CHARMM, AMBER, NAMD, GROMACS, etc. The most significant change however has been the commoditization of high performance computing. Moore’s law, steadily decreasing costs, and the Linux cluster have allowed scientists to pursue new methods and theories, which would have been far too costly to run in the past. However, despite all these advances, I sometimes wonder if the field has stagnated.

While it is true that today academics are running larger and larger simulations to try and understand the structure and function of biomolecules, but in industry, many scientists are still far too comfortable using lower levels of theory to get results faster. There has been some acceptance of higher order methods, but not to the extent that one might have expected. The field of structure-based drug design is fertile ground for scientists to develop new theories and methods, and to some extent this has happened. There is a rich body of work that has been published in recent years on “physics-based” methods to evaluate protein-ligand interactions. However, this has not translated into success at the industrial level, where such methods can really make a difference. And strangely enough, I think this is because computers are not fast enough. The kind of throughput required by pharma requires some shortcuts to be taken, compromising the quality of the results. This means that the best techniques are still not being developed, and more approximate methods are being pursued. Are these methods useful? Absolutely!!!! I spent quite a bit of time looking at methods such as MM-PB(GB)SA and LIE, two of the more commonly used techniques. These methods just scratch the surface of the utility of higher order methods (taking a number of shortcuts), but demonstrate how more expensive methods can improve the results from in silico approaches, if used appropriately. However, to change the name of the game, we need to rethink how we are taking advantage of modern hardware. Multicore CPU’s, FPGA’s, efforts like Blue Gene or the computer being built by D.E. Shaw, are but steps towards bringing a new generating of scientific computing to researchers. These are not commodity machines, but perhaps it is necessary to spend some money on special resources to get special results. Researchers, developers and hardware vendors need to work hand-in-hand to identify core needs and develop the appropriate hardware and software. The costs that the market can bear will be very critical to these developments, which leads me to believe that there will be two “camps”

Ultra-specialized hardware: Machines like Blue Gene, and machines from Fujitsu and D.E. Shaw come to mind. All have specialized or modified software that take advantage the architecture of the machines. These machines are (or will be in the case of the Shaw machine) expensive, but will find a niche for special projects, especially if on-demand computing catches on in the community. This would be somewhat of a return to the expensive supercomputers of the 90′s where users had to purchase time to run longer simulations

Plug n’ play hardware: This term applies to the kinds of hardware manufactured by companies like ClearSpeed and the MDGRAPE card from RIKEN. While these are yet to find common user, perhaps it is time that such hardware became more prevalent as these can be combined with commodity hardware to create superfast machines for specific applications. Of course, I am still waiting for someone to figure out how to use graphics cards for MD simulations.

Technorati Tags: , , , , , ,

This entry was posted in Admin, BioIT, Computing, Life Science, Physical Science. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

7 Comments

  1. Posted July 23, 2006 at 13:42 | Permalink

    Although the subject of the post is rather interesting, I feel I’m a few steps off the subject, and therefore didn’t really grasp the whole picture.

    Anyhow, I would like to alert you to some small spelling mistakes here and there. Namely “imroved” at the end of the first paragraph, and “ulra-specialized” in the penultimate paragraph.

    Keep up the great posts.

    Cheers,
    Rick

  2. Posted July 23, 2006 at 14:39 | Permalink

    Oops … corrected!!! One of the negatives of writing posts in html mode without a spellchecker.

    As to the subject itself … based on what I have seen happening over the past few years, I think scientists have reached the limits of the kinds of computations they can do with current commodity hardware. My concern is that while we wait for the next quantum leap in computing performance, there are options available today that we are not really utilizing to get real answers to real problems.

    Thanks for reading!!!!

  3. Posted July 23, 2006 at 14:59 | Permalink

    Wow, that was a quick fix!

    I happen to have a big interest in bioinformatics but it seems to be non-existant at my University, so I’m reading up on anything I can.
    I have attended some protein folding seminars that always press on the cpu power as their main enemy.
    What other areas are there that would not need power but something else to find out more about such folding prediction and other sort of protein interactions?

    Sorry if my questions seem to basic, I’m new to this stuff :)

  4. Posted July 23, 2006 at 15:14 | Permalink

    I don’t have enough of a life to be away from my computer for too long on the weekends.

    Protein folding is one aspect, and the problems there are beyond just computer power. The physics of protein folding is an unsolved problem. The current potentials and descriptions that we are using for folding are not correct, although a lot of good work has been done in the past decade (especially the late 90′s).

    The part that does not need special compute power is genomic-scale comparative modeling. It’s not too hard, given enough structural coverage, to run threading or homology modeling programs to predict the structure of proteins from their primary sequence. I think there too algorithms for sidechain and loop modeling can be improved (as well as alignment). The CASP “competition” has shown that there has not been enough improvement over the past 4-5 years.

    In general, one needs to develop algorithms that fit current computing capabilities. IMHO, with current commodity hardware, we are close to doing the best we can. To take the next step, we should be using new hardware architectures that allow us to write software differently or solve problems in a different way. Most FPGA type work that I have read about focuses on new ways to run BLAST. Useful, but only a small piece of what the community should be doing.

  5. Posted July 23, 2006 at 18:39 | Permalink

    Oops … corrected!!! One of the negatives of writing posts in html mode without a spellchecker.

    As to the subject itself … based on what I have seen happening over the past few years, I think scientists have reached the limits of the kinds of computations they can do with current commodity hardware. My concern is that while we wait for the next quantum leap in computing performance, there are options available today that we are not really utilizing to get real answers to real problems.

    Thanks for reading!!!!

  6. Posted July 23, 2006 at 18:59 | Permalink

    Wow, that was a quick fix!

    I happen to have a big interest in bioinformatics but it seems to be non-existant at my University, so I'm reading up on anything I can.
    I have attended some protein folding seminars that always press on the cpu power as their main enemy.
    What other areas are there that would not need power but something else to find out more about such folding prediction and other sort of protein interactions?

    Sorry if my questions seem to basic, I'm new to this stuff :)

  7. Posted July 23, 2006 at 19:14 | Permalink

    I don't have enough of a life to be away from my computer for too long on the weekends.

    Protein folding is one aspect, and the problems there are beyond just computer power. The physics of protein folding is an unsolved problem. The current potentials and descriptions that we are using for folding are not correct, although a lot of good work has been done in the past decade (especially the late 90's).

    The part that does not need special compute power is genomic-scale comparative modeling. It's not too hard, given enough structural coverage, to run threading or homology modeling programs to predict the structure of proteins from their primary sequence. I think there too algorithms for sidechain and loop modeling can be improved (as well as alignment). The CASP “competition” has shown that there has not been enough improvement over the past 4-5 years.

    In general, one needs to develop algorithms that fit current computing capabilities. IMHO, with current commodity hardware, we are close to doing the best we can. To take the next step, we should be using new hardware architectures that allow us to write software differently or solve problems in a different way. Most FPGA type work that I have read about focuses on new ways to run BLAST. Useful, but only a small piece of what the community should be doing.

4 Trackbacks

  1. By Nonoscience / Pante Rei - First Edition on July 24, 2006 at 11:02

    [...] Deepak Singh presents Protein simulation: At a crossroads posted at business|bytes|genes|molecules, saying, “This is definitely a little outside of the carnival core topics, but after all I did learn all my protein simulation theory from a book on the simulation of fluids” We don’t complain, as long as the science thoughts flow… [...]

  2. [...] In the introduction, Stevens writes that the most significant trend in modern biology is the “increasing availability of high-throughput data”. With the sequencing of numerous genomes and the development of new “-omics” techniques, there has been an explosion in the amount of biological information. As the article points out, the challenges of generating integrated datasets of suitable quality are critical and are here to stay for the foreseeable future. I also find it refreshing that he talks about simulation and modeling as part of the whole challenge of computational biology (an aspect often overlooked). While truly predictive modeling is still some years away, computer models at the protein structure level are used everyday in academic and commercial research to make informed decisions, often with extremely high quality results. In a few years, assuming computers become more powerful, organism level modeling at multiple temporal and spatial scales will become increasingly predictive and more prevalent (Stevens predicts a 10-20 year timeframs for complex eukaryotes, a number that seems reasonable). [...]

  3. [...] Anyway, to get back to the topic at hand, one thing that should change scientific computing is what Joe calls AC. When it becomes possible to get TFlops performance out of something that fits under my desk or in a little corner, does not need special facilities, etc, the productivity of the scientific community, from those who crunch numbers for a living to those almost unaware of the resources that help them run Blast again and again and again, will change dramatically. The question remains, why isn’t adoption higher?  Economies of scale should be driving the cost of scientific computing down, but it seems to me that other than the cluster (a personal favorite), this is not happening. One should not need Koolaid to make the value of accelerated computing obvious.  In the past, I have argued that computers are not fast enough. I stand by that argument, but on the flip side, I don’t think people are leveraging the resources available to them either. [...]

  4. [...] of problems that most of us are interested in. I have also argued that molecular simulation was stuck in a rut. Anton is a special purpose machine and essentially is designed to solve problems that are not [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present