Programming HPC for the domain

May 11, 2008

Cray designed many supercomputers that used multiprocessing heavily.At Accelrys, a lot of the software I managed was in-licensed from academia. That approach allowed the company to tap into the intellectual resources of some of the smartest academic researchers in the world, but it also created a problem. One was the difference in software development practices. Some of the academic code barely had version control. But that’s the obvious one. In a new post at Computing at Scale, Bill McColl writes about Domain-specific parallel programming. Translating code parallelized for an academic setting, often under the assumption that huge clusters might be available, to an industrial setting where scaling and fault tolerance become critical, where resource availability varies widely, and speed is critical, is always a challenge. This is especially true when you’re trying to shrink wrap software and building interactive interfaces.

So in an era with more scale available, clouds to tap into, accelerators, and new data and distribution models, are we going to see a shift? I still feel that the underlying scientific research has to come from academia. They have the resources, time and incentive to do so, but I think industrial think tanks and expertise can contribute back by working with academia on advanced problems of relevance, e.g. in the area of computing. Will we tap into some of the new domain specific development being done today as a scientific community? It can’t be done by one side or the other. But rather we need to identify approaches as a community and understand what works best, without trying to duplicate efforts. Of course, we need people who understand these new methods and paradigms to implement them.

There will always be a tension between academic research efforts and commercial need. In the life sciences it is especially tough for industry specific apps to be developed from an economic point of view, which is why I believe it will have to be a joint effort.

Your thoughts?

Image via Wikipedia

Technorati Tags: ,

Gamers, get your folding on

May 9, 2008

Protein before and after folding.Technology Review was the first place I saw it, then someone put it up on Friendfeed and now Andrew Perry has a great post on Foldit. Foldit comes out of the lab of a bbgm favorite, David Baker, right here at the University of Washington.

Foldit combines gaming with protein structure prediction. It’s an interesting approach to spreading scientific problems. Folding@home built upon the success of Seti@home and the geek cred of running on gaming consoles and has built quite a following. Will Foldit, which presents a simple, fun interface to get people interested in protein structure (and the existence of Folding@home makes this somewhat familiar to geeks everywhere) be an example of how we can leverage crowdsourcing? Andrew makes some interesting points (which I agree with) on weighting crowdsourcing, although that’s always a hard thing to do, but I’d like to see karma, etc come into play here.

It’s good to see protein structure getting some attention and continuing to be creative. It’s always been my favorite scientific subject. The field lends itself to “pretty pictures”, so getting non-experts involved is a possibility.

The site and server have had connectivity issues since I’ve been trying, so perhaps they need help with web resources, cause lots seem to be interested.

Here is a list of people supporting the project: UW Animation Research Labs, UW Baker Lab, DARPA, Microsoft, and Adobe. Nice list.

Image via Wikipedia

Technorati Tags: , , , , ,

HPC and structure-based drug design

May 5, 2008

Angiotensin-converting enzyme 2Here is the abstract of a paper in Hypertension entitled Structure-based identification of small-molecule angiotensin-converting enzyme 2 activators as novel antihypertensive agents.

Angiotensin-converting enzyme 2 (ACE2) is a key renin-angiotensin system enzyme involved in balancing the adverse effects of angiotensin II on the cardiovascular system, and its overexpression by gene transfer is beneficial in cardiovascular disease. Therefore, our objectives were 2-fold: to identify compounds that enhance ACE2 activity using a novel conformation-based rational drug discovery strategy and to evaluate whether such compounds reverse hypertension-induced pathophysiologies. We used a unique virtual screening approach. In vitro assays revealed 2 compounds (a xanthenone and resorcinolnaphthalein) that enhanced ACE2 activity in a dose-dependent manner. Acute in vivo administration of the xanthenone resulted in a dose-dependent transient and robust decrease in blood pressure (at 10 mg/kg, spontaneously hypertensive rats decreased 71+/-9 mm Hg and Wistar-Kyoto rats decreased 21+/-8 mm Hg; P<0.05). Chronic infusion of the xanthenone (120 microg/day) resulted in a modest decrease in the spontaneously hypertensive rat blood pressure (17 mm Hg; 2-way ANOVA; P<0.05), whereas it had no effect in Wistar-Kyoto rats. Strikingly, the decrease in blood pressure was also associated with improvements in cardiac function and reversal of myocardial, perivascular, and renal fibrosis in the spontaneously hypertensive rats. We conclude that structure-based screening can help identify compounds that activate ACE2, decrease blood pressure, and reverse tissue remodeling. Administration of ACE2 activators may be a valid strategy for antihypertensive therapy.

Here’s the HPCwire story, which really doesn’t tell me much other than really high throughput docking, but they use words like

That in itself is a significant accomplishment because no one has ever specifically identified a compound that enhances the activity of an enzyme using a rational structure-based approach

Anyone have a subscription to Hypertension? I am really curious cause nothing I read screams “unique” to me. Of course, I can just wait till tomorrow and try and get to the paper from work.

Update: Got the paper, and still don’t get the fuss. It’s an elegant virtual screening strategy, but I wouldn’t say it’s revolutionary. I was hoping to see something more advanced, e.g. protein flexibility, better energy functions, etc.

Image via Wikipedia

Technorati Tags: , , ,

Andrew releases the final Golem beta and other cool stuff

May 4, 2008

Another resurrected post

The announcement

The protagonist. What is it?

Golem is a set of tools, and ontology language, for processing data written in the CML, the Chemical Markup Language. The Golem language is XML, and the tools and libraries are written in Python.

A shout out to the MaterialsGrid, which my former employer, Accelrys, is involved with. I’ve talked about some of the cool stuff he does before. That’s only the tip of the iceberg. He writes and records cool electronica and does creates mashups like the one below (using RDF)


Crystallography, 2000-2007 from Andrew Walkingshaw on Vimeo.

Video via Andrew under a Creative Commons license

Technorati Tags: , , ,

Tons of data everywhere. Do we need life science CDNs?

May 3, 2008

This weeks Bio-IT World meeting was all about data storage. Driven by the needs of integrating complex, heterogenous data and most of all by next gen sequencing, it’s amazing how much data the life sciences are generating and how poorly prepared we are. I won’t necessarily mention names, but there are places which have data hitting the petabytes AFTER throwing away most of it. How do you access this data? How do you back it up? What kind of data centers do you need? What kind of power do you need? When people are worried about the city being able to handle their power needs then there is cause for concern.

It is also why I think the future of scientific data generation needs to be thought about like Google, etc view data, infrastructure and data access. What if we had a Big Table like distributed file system where all this data could be uploaded to? What data would be uploaded there? How would we access it? Ideally data from public genome projects would be made available as Open Data, available to everyone for downstream analysis under a CC0 or similar license. Of course there is a lot more to these data than just whole genome sequencing. There is also the challenge of just the pipes that the data needs to travel through. These are really large files.

Whatever the solution(s), next gen sequencing and the resultant data glut were top of mind. And this is just the start. Personally, I think that those in small labs who want access to sequencers for their own work really need to reconsider. Their utilization rates are unlikely to justify the cost and they will almost certainly run into data storage, access and archiving issues, especially when something like PacBio comes online. A utility model works best here, a model where people get access to time on machines or access to machines hosted at core facilities, etc.

The more I think about these issues, the more I am convinced that the life sciences really need to embrace something like CDNs. With the sheer volume and variety of data, we need people who can step up to the plate and provide the infrastructure instead of depending on a few people who aren’t necessarily thinking about data the way Google or Yahoo do on a daily basis (although the way it looks some of them are doing just that). I am especially worried about smaller groups and labs who might just get left out if we don’t develop the appropriate ecosystem.

The economics of all this? That’s another issue for another day.

Further reading
Chris Dwan’s Bio-IT World presentation
The DNA Data Deluge
Technorati Tags: , , , ,

Next Page »