The big machines

TOP500
Image via Wikipedia

So the latest Top 500 list is out, so why doesn’t it excite me as much as it used to. Well partly cause many of those machines are not easily accessible, while other computing resources are within reach. Perhaps partly because for a lot of the work I am interested in doing, you don’t really need a machine in the top 500. Of course, having access to a machine there allows you to address some problems you couldn’t any other way, and IMO they should only be used for such problems.

One of the better posts about this years list comes from Chris Peters at Intel. He presents a different perspective on the list and notes some trends. For example, the 10th fasted machine on this years list drives more FLOPS than all 500 machines on the 2000 list.

While the post has a definite Intel angle to it, Chris notes the point I made earlier. Today, massive computing resources are available a lot more easily, you have new software stacks, whether for clustering, or for massive data-intensive computing. Personally, I think how we consume computing and the nature of our compute codes is going to go through a transformation in the next decade and more people are going to be doing large scale computing and solving interesting problems.

Will the Top 500 list become meaningless? Not really. There is always room for massive floating point performance and certain problems for which you just need the kind of raw horsepower that the big iron provides. For others, we have a lot of resources that we can get our hands on.

Reblog this post [with Zemanta]

Posted in BioIT | Leave a comment

Freerisk – An open platform for risk modeling

Various data from Bioinformatics Career Survey
Image by ynse via Flickr

I’ve been meaning to write about Freerisk.org for a while now, but only got reminded yesterday as I read the Wired article about Toby Segaran’s (and Jesper Anderson’s) new project.

Freerisk.org sucks in financial data from the SEC using the XBRL format, allows the community to add additional annotations, and then makes that data available to standard risk analysis algorithms and, this is the best part, available for others to apply their own algorithms. My first reaction was, this is what we want to be able to do in bioinformatics. Keep the data available, add annotations, and have this sandbox in which algorithms can be applied and developed.

The finance geek part of it is interesting enough, but I got interested in Freerisk for the general idea, especially coming from a field where there is a lot of data publicly available but not necessary sandoxes/platforms for analysis and testing out new algorithms, although there is a lot of intent. From the about page of Freerisk.org

Freerisk is a project with the goal of making freely available the data, algorithms and tools necessary to perform risk modeling. We believe that risk management is too important to society to be an arcane subject or competitive advantage.

You could easily replace “risk management” with biology or genomics, or something similar.

The pieces that Freerisk contains are even more interesting

  • An open repository of financial data, including financial statements for public companies
  • A standards-based API for querying financial data
  • A distributed method for designing and running risk models
  • Open-source tools for parsing and handling financial data
  • Educational materials on risk-management

This is a hackers playground. We need something like this in the informatics community, especially as our data volumes grow. It’s just an ethos that we seem to lack in general, and part is due to the fact that we need to publish our data, but there is a broader community of analysts and developers this could appeal too. Resources like these are needed, not just for finance, but in many other areas. The key is to find enough interested people to contribute. We have some aspects in the bioinformatics space, but it’s somewhat fragmented and the analytics part is the weakness at this point.

Reblog this post [with Zemanta]

Posted in Big Data, Computing, Software & Internet | Leave a comment

Hundred nanoseconds a day

Water in hydrogen bond network
Image by vitroids via Flickr

100 nanoseconds a day. 100 nanoseconds a day. 100 nanoseconds a day

That is amazing. I used to get supercomputing time to do 100 ns simulations during my PhD and those used to last days, but that’s exactly what NAMD has achieved recently. A recent review article by the folks at D.E. Shaw Research lays down the state of protein simulations.

To put the 100 ns in context. That simulation was done on 300 cores. Given that you can get 1000 cores increasingly easily, that’s 1000 ns in 3.3. days assuming linear scaling. So when D.E. Shaw and co write that microsecond simulations are getting practical (increasingly feasible would be a better statement), they’re not just saying that. I think if access to 3000 cores and these compute scales becomes commoditized (not difficult looking at the kinds of trends I am seeing), then we are in business and it is indeed practical.

NAMD, Gromacs, Desmond. For the first time in a long time, I really want to do MD again. Now to make the entire MD ecosystem more practical. I would love to see services around such codes that make it easier to run large jobs, include system preparation, and perhaps even analysis.

Reblog this post [with Zemanta]

Posted in Computing, Life Science, Modeling & Simulation | Tagged , , | 2 Comments

Write heavy file system workloads

In a blog post last year James Hamilton wrote about workloads in large scale network file systems. In his summary about of study on the subject he writes

Some of the important points that spring out for me: the percentage of random access is increasing; for those accesses that are sequential, the runs are longer; file sizes are increasing, data is getting colder; file lifetimes are increasing; and client usage has very high skew.

Those patterns sound a lot like some of the patterns I have seen in the life sciences recently, especially as we have to handle increasingly larger data volumes, which have varying levels of access patterns and usage. Seeing some of the data challenges that people close to home have been seeing, esp significantly higher write to read ratios, which makes caching of limited use, makes one realize that the scale challenges aren’t always the same as the ones you typically see on the web. The study authors actually make a conclusion that since metadata is accessed far more regularly, larger metadata caches are beneficial. Again, a typical access pattern for a lot of ‘omics’ data.

Does it make sense for us to start sharing design patterns for scale in the life sciences? Even in the world of the web and other high scale industries, those design patterns are not well understood, but I think the challenges in the life science world are a little greater since we typically try and make do without people who understand scale and systems, with a few notable exceptions.

Reblog this post [with Zemanta]

Posted in BioIT, Computing | Leave a comment

Supercomputing Masterclass – A request for information

I have been invited to give a Masterworks talk on Data Challenges in Genomics for Supercomputing 09. I would like to dive into the details about the technical and scientific challenges of high throughput genomics, from microarrays to next gen sequencing and beyond and how we need to be manage these data more efficiently. While part of my talk will be about my day job, I want it to be informed by challenges we face today and will face tomorrow as a scientific community. So to try and capture many of these challenges and gather facts and information I have started a wiki page which I have made public. Please feel free to add to that page with ideas and topics that interest you. To do so, you will have to login as user sc09 and password computing. I request that you add your name to any major input. If I can figure out an alternative authentication mechanism, will update this post

Posted in BioIT, Computing, Event, Informatics, Omics | Leave a comment
  • Archives