Image via WikipediaLet us say you are a researcher and are doing a gene expression study on some tissue. Today, the chances are that you will run some microarrays and look at the expression profile and then try and correlate the expression profiles of a number of samples with associated data.
Fast forward a few years. I am convinced that a lot of such data will be available via search engines or data portals. Already you are beginning to see a number of commercial and public engines come to life (NextBio, Oncomine, etc). Earlier this week I read an announcement (sub reqd) by the NCI to create a Cancer Molecular Analysis Portal, which will integrate data sets from the Cancer Genome Atlas project and other cancer genomics studies.
The key here is that we already have a body of work using microarrays and other molecular profiling systems, and in many cases, people are just repeating experiments which someone, somewhere has already carried out. Unless there is something inherently proprietary in those studies (e.g specific dose-response studies), there is no reason to repeat that experiment, especially for technologies that are relatively stable and don’t have too much cross-platform/cross-lab variation (one of the goals of the MAQC projects has been to understand these variations). The second key, and to an extent perhaps even more important, is how these data are made available. Personally, I really like the NextBio interface. Will the business model work? I am not sure, but definitely the idea and concept make a lot of sense.
It’s a sign of maturity in many ways, accelerated by the way the web has advanced in the past few years. If we trust data not generated internally, enough to make key decisions, then a scenario where data and analysis results are served up via web services, allowing users to mash up different sources, including internal sources, and develop a relevant scientific intelligence is a distinct possibility. Personally, I would like to think that the value and the users expertise comes from how they integrate all these resources in a manner that makes it a unique asset to the user, i.e. the value of the results come from the way the data are brought together and not any individual data sources




5 Comments
I fully agree with you. However, such databases, listed below, are already available for the scientific community –
GEO – http://www.ncbi.nlm.nih.gov/geo/index.cgi?qb=pro
ArrayExpress – http://www.ebi.ac.uk/microarray-as/ae/
Phenogen Informatics – http://phenogen.uchsc.edu/
Stanford Microarray Database – http://genome-www5.stanford.edu/
Users can select any number of experiments/samples to carry out the meta-analysis using the available tools.
Indeed and growing, but most companies still run their own arrays, and integrate public information as required. The future will be the opposite. You run your own arrays as required, but primarily use public (or private) resources. Additionally these won't just be databases, but web services, where the combination or resources and how they are mashed up will be the key value.
While I generally agree with your thoughts, there are a number of issues with simply mashing up different gene expression data sources: platform, sample quality, normalization, reference and annotation are just a few that I can come up with off the top of my head.
I've been involved with a microarray consortium project that is one of a small number of studies (that I'm aware of) that included a technical reference sample in each array processing batch to align the data. Alignment is a HUGE problem with expression data. We further performed extensive statistical analysis to identify potential outlier arrays prior to data analysis. Most gene expression profiling studies don't take these quality control steps. I think pooling data sets that aren't preprocessed in a similar manner is risky.
In my experience, it's easy to show statistical significance with microarray data, especially with large data sets. However, statistical significance doesn't mean it's biologically meaningful.
Walter, very real concerns, and the kinds my previous company spent a lot of its time thinking about.
Like most data types, microarray results will become more commoditized, i.e. the reliability and alignment issues will get resolved. The reference, annotation, etc information will become metadata that accompanies your data sources. It's a question of making sure all the experimental details are included with the results being provided. They key is that there will be a body of work available that should make it less necessary to do your own microarray experiments except when absolutely necessary.
nice article! nice site. you're in my rss feed now
keep it up
One Trackback
[...] Trendspotting: Molecular profiling data resources: [Via business|bytes|genes|molecules] Image via WikipediaLet us say you are a researcher and are doing a gene expression study on some tissue. Today, the chances are that you will run some microarrays and look at the expression profile and then try and correlate the expression profiles of a number of samples with associated data. [...]