So why should people in the scientific community read those two resources? Big Data is now a fact of life, both in science and otherwise. Our systems, both from the infrastructure and computation standpoint were written for data sets of different sizes. At the same time we are moving towards a world where the need to combine data resources is becoming even more necessary than in the past. For example, as I talk about in Science Big, Science Connected, we need to move away from looking at data from the point of view of the technology (gene expression or genetics or some other technique) and from a more holistic view. I want to know everything about my subject, and there are data sets available across technologies, with variable quality and information content that need to be mashed up together. Many organizations that seek to do that have built large data warehouses, or at least have tried to with varying degrees of success. What happens when you start collecting data that’s more diverse and even higher throughput? Data interrogation and data modeling becomes even more critical and an essential part of the discovery and development process. If we start bringing pharmacovigilance into play, we add even further complexity to our data. To appropriately interrogate these large volumes of data we need to think along the lines that Jeff talks about in his chapter, and no life science company is there to my knowledge, although some are beginning to do so. We need to build systems that listen in to data streams, e.g. Resolver or GeneSpring for expression data, or some system for assay or tox data. These data can then be sent to a Hadoop cluster (doesn’t have to be Hadoop, but it’s a system that will grow due to the ecosystem around it), which does the processing, transform and extraction, which can then either be fed into a dedicated warehouse for offline querying, or you could use HDFS itself and higher level abstractions like Hive or Pig to query that data. In the language of the chapter we go from a dataspace to a database. All data types are their own dataspace. I like the concept of the dataspace, especially in the life sciences where the technologies generating the data and the data types evolve so rapidly. How do we get from these dataspaces to getting meaningful scientific information and knowledge?
Which brings us to Pete’s blog post. It’s a great, in-depth tutorial on building a data-intensive web application, and as I mentioned in my previous write up, serves as a reference architecture of how you can use Hadoop and Hive to build such a system. The way I see it, informaticians are the epitome of the Data Scientist as Jeff describes them. We have access to some of the most challenging, yet important and fun data problems out there. The web community has done some great work in recent years on developing methods that are robust, scalable, and, importantly, open source. Not only can we use them, but also add back enhancements designed to support the challenges we face as a community. I’d love to see either Pig, etc be adapted, or science aware (e.g. one that is gene ID aware) data flow languages be developed as well.
You know what I think breaks down. Workflow and pipelining systems. As we get more programmers in the life sciences/informatics space, using existing frameworks to develop and deploy algorithms, more use of messaging infrastructures, the market for pipelining and workflow software seems limited to me, at least in the long run. In the past, these systems were great cause they allowed part-time programmers to develop protocols and experts to focus on developing complex workflows. As we move towards a world, the one that Neil will love, a world where the informaticians don’t spend their time just doing data munging, I am not so sure these systems have as much value unless we think of them strictly as containers, i.e. middleware to deploy data analysis pipelines.
Enough overlap with the day job, so please see this disclaimer
Making sense of all that data: Integrating and extracting information from dataspaces
I have written previously about Trendingtopics.org as a reference site for data analytics using Hadoop and Hive.
Pete Skomoroch, who developed the site has written a great follow up article on the Cloudera blog that anyone in the scientific informatics space needs to read. Those same people need to read Chapter 5 in Beautiful Data
. There Jeff Hammerbacher writes about the role of the Data Scientist at Facebook.
So why should people in the scientific community read those two resources? Big Data is now a fact of life, both in science and otherwise. Our systems, both from the infrastructure and computation standpoint were written for data sets of different sizes. At the same time we are moving towards a world where the need to combine data resources is becoming even more necessary than in the past. For example, as I talk about in Science Big, Science Connected, we need to move away from looking at data from the point of view of the technology (gene expression or genetics or some other technique) and from a more holistic view. I want to know everything about my subject, and there are data sets available across technologies, with variable quality and information content that need to be mashed up together. Many organizations that seek to do that have built large data warehouses, or at least have tried to with varying degrees of success. What happens when you start collecting data that’s more diverse and even higher throughput? Data interrogation and data modeling becomes even more critical and an essential part of the discovery and development process. If we start bringing pharmacovigilance into play, we add even further complexity to our data. To appropriately interrogate these large volumes of data we need to think along the lines that Jeff talks about in his chapter, and no life science company is there to my knowledge, although some are beginning to do so. We need to build systems that listen in to data streams, e.g. Resolver or GeneSpring for expression data, or some system for assay or tox data. These data can then be sent to a Hadoop cluster (doesn’t have to be Hadoop, but it’s a system that will grow due to the ecosystem around it), which does the processing, transform and extraction, which can then either be fed into a dedicated warehouse for offline querying, or you could use HDFS itself and higher level abstractions like Hive or Pig to query that data. In the language of the chapter we go from a dataspace to a database. All data types are their own dataspace. I like the concept of the dataspace, especially in the life sciences where the technologies generating the data and the data types evolve so rapidly. How do we get from these dataspaces to getting meaningful scientific information and knowledge?
Which brings us to Pete’s blog post. It’s a great, in-depth tutorial on building a data-intensive web application, and as I mentioned in my previous write up, serves as a reference architecture of how you can use Hadoop and Hive to build such a system. The way I see it, informaticians are the epitome of the Data Scientist as Jeff describes them. We have access to some of the most challenging, yet important and fun data problems out there. The web community has done some great work in recent years on developing methods that are robust, scalable, and, importantly, open source. Not only can we use them, but also add back enhancements designed to support the challenges we face as a community. I’d love to see either Pig, etc be adapted, or science aware (e.g. one that is gene ID aware) data flow languages be developed as well.
You know what I think breaks down. Workflow and pipelining systems. As we get more programmers in the life sciences/informatics space, using existing frameworks to develop and deploy algorithms, more use of messaging infrastructures, the market for pipelining and workflow software seems limited to me, at least in the long run. In the past, these systems were great cause they allowed part-time programmers to develop protocols and experts to focus on developing complex workflows. As we move towards a world, the one that Neil will love, a world where the informaticians don’t spend their time just doing data munging, I am not so sure these systems have as much value unless we think of them strictly as containers, i.e. middleware to deploy data analysis pipelines.
Enough overlap with the day job, so please see this disclaimer
Related articles by Zemanta