Image via WikipediaGreat post by Sandra Porter on GenBank as an information resources. Over the years the NCBIs database resources have mushroomed, but continue to be generally regarded as the primary resources for genes and gene-related information (along with the EBIs resources).
I’m going to focus on one question. Centralization and data quality. In general, to deposit something in a central repository, you should have the following criteria
A minimal set of criteria that need to be met prior to deposition
A robust version control system, which allows you to trace back to changes and allows scientists to compare their results across changing genomic information
APIs that are used both internally and externally, that provide consistent programmatic access to the underlying data model
Come to think of it a good data model
Powerful search capabilities
But those are not necessarily sufficient. Part of the problem with NCBI is the UI, which I’ve never liked, too many databases and somewhat confusing interoperability. There is also a couple of fundamental philosophical questions
Should the community be allowed to edit GenBank? To me, a lot of the efforts to develop WikiProteins, SNPedia, the Genes Wiki project, etc is the community’s perception that they want to be able to make such changes
Is centralization a good thing? Personally, the ideal situation for me would be for NCBI and EBI to join hands (probably not politically possible), perhaps via some form of non-profit foundation or consortium that essentially develops a platform and resources in collaboration with the community, with powerful APIs and thinks about this problem not just as a scientific problem, but a distributed data problem. Sure the data can be deposited in a central repository, but how the data are used, presented, etc should not be a one stop shop. Essentially move from being a site to being a full-fledged service
The problem with the latter is that they tend not to work in practice, although there are examples. Distribution is more important than centralization. The question show not be “how do we store our data”, but “how do we enable scientists to do more with all the data that we have”. The “how do we store” then becomes one of many questions that need to be asked to answer the primary question.
The quality and reliability of biological data is paramount. Being able to access it, both via search UIs and web services APIs is critical. We tend to think of these problems primarily from the scientists point of view, which we need to continue to do so, although I am not sure how well we are doing. However, in developing and organizing these resources, I feel we need to bring in some gurus in distributed data and architectures. The data and information lying with NCBI, EBI, etc has the potential to impact our future, and we are going to continue to see data volumes grow exponentially. Which is why I wonder and hope, if there will ever be something like Wikipedia, or perhaps a Google (not just a search engine, but an organization that makes organizing and indexing life science information a mission statement) that will revolutionize what we do with all the genomes and other associated data at our disposal.
So I was talking to my girlfriend a while back about medical records... which is having a similar problem right now. Where you are stuck in a vendor situation... and people are trying to centralize the data (Microsoft / Google)...
The things that *should* be centralized are data formats and messages... so for example, no matter what medical record or genome database you talk to, you can send particular messages and expect that the data sent in and returned is well known...
Who actually stores the data? I could care less... I just want things to be portable (both message passing and the data)...
I am with you. I could care less about the actual storage but, and its a big but, there is something to be said for trust. Even I am likely to trust resource A more than resource B, all else being equal.
Regardless of where you choose to store your data, or parts of it, you need to be able to make your resources and your data talk to each other, and that's probably going to be a big deal as these services get mature and they better make sure they make stuff interoperable.
Hi Deepak Barend Mons here (Wikiproteins). I know the name used in the Genome Biology paper may be a little confusing. During the review process we already moved to wikiprofessional and what you present as a wish is at least what we are trying to make a reality (long way to go). One comment says that scientists will initially look down their noses. Yes, sure, but we already see quite some registrants and at least the discussion is starting.... Well I believe that separating (but interconnecting) curated and community data (the authoritative source principle) and making annotation immediately useful for the annotators themselves is one way forward. We need a strong discussion forum to improve from what we have an to make this happen, a group of enthusiasts should lock arms I think. There is a lot of talk about wikiproteins and the paper is well viewed, but what we really need is constructive criticism like you give, not the (rare I am happy to say) dismissive blurbs that do not get us anywhere. So, please keep the definition going and give us some directions how to improve GUI and content. We think obviously that the Knowlet approach can capture and present most of the really useful information from all databases (presently working on expression data) and the ability to drill down to the original source is there. We will be very open as a consortium to listen to people like you and make changes. That is what a good Wiki-approach is all about, not just technology, but it should be really community-owned.
Thinking about biological resources
I’m going to focus on one question. Centralization and data quality. In general, to deposit something in a central repository, you should have the following criteria
But those are not necessarily sufficient. Part of the problem with NCBI is the UI, which I’ve never liked, too many databases and somewhat confusing interoperability. There is also a couple of fundamental philosophical questions
The problem with the latter is that they tend not to work in practice, although there are examples. Distribution is more important than centralization. The question show not be “how do we store our data”, but “how do we enable scientists to do more with all the data that we have”. The “how do we store” then becomes one of many questions that need to be asked to answer the primary question.
The quality and reliability of biological data is paramount. Being able to access it, both via search UIs and web services APIs is critical. We tend to think of these problems primarily from the scientists point of view, which we need to continue to do so, although I am not sure how well we are doing. However, in developing and organizing these resources, I feel we need to bring in some gurus in distributed data and architectures. The data and information lying with NCBI, EBI, etc has the potential to impact our future, and we are going to continue to see data volumes grow exponentially. Which is why I wonder and hope, if there will ever be something like Wikipedia, or perhaps a Google (not just a search engine, but an organization that makes organizing and indexing life science information a mission statement) that will revolutionize what we do with all the genomes and other associated data at our disposal.
Technorati Tags: GenBank, Biological Information, Web Services, APIs, Data Management