Some of the important points that spring out for me: the percentage of random access is increasing; for those accesses that are sequential, the runs are longer; file sizes are increasing, data is getting colder; file lifetimes are increasing; and client usage has very high skew.
Those patterns sound a lot like some of the patterns I have seen in the life sciences recently, especially as we have to handle increasingly larger data volumes, which have varying levels of access patterns and usage. Seeing some of the data challenges that people close to home have been seeing, esp significantly higher write to read ratios, which makes caching of limited use, makes one realize that the scale challenges aren’t always the same as the ones you typically see on the web. The study authors actually make a conclusion that since metadata is accessed far more regularly, larger metadata caches are beneficial. Again, a typical access pattern for a lot of ‘omics’ data.
Does it make sense for us to start sharing design patterns for scale in the life sciences? Even in the world of the web and other high scale industries, those design patterns are not well understood, but I think the challenges in the life science world are a little greater since we typically try and make do without people who understand scale and systems, with a few notable exceptions.
Write heavy file system workloads
In a blog post last year James Hamilton wrote about workloads in large scale network file systems. In his summary about of study on the subject he writes
Those patterns sound a lot like some of the patterns I have seen in the life sciences recently, especially as we have to handle increasingly larger data volumes, which have varying levels of access patterns and usage. Seeing some of the data challenges that people close to home have been seeing, esp significantly higher write to read ratios, which makes caching of limited use, makes one realize that the scale challenges aren’t always the same as the ones you typically see on the web. The study authors actually make a conclusion that since metadata is accessed far more regularly, larger metadata caches are beneficial. Again, a typical access pattern for a lot of ‘omics’ data.
Does it make sense for us to start sharing design patterns for scale in the life sciences? Even in the world of the web and other high scale industries, those design patterns are not well understood, but I think the challenges in the life science world are a little greater since we typically try and make do without people who understand scale and systems, with a few notable exceptions.