Thinking about biological resources
June 20, 2008
Image via WikipediaGreat post by Sandra Porter on GenBank as an information resources. Over the years the NCBIs database resources have mushroomed, but continue to be generally regarded as the primary resources for genes and gene-related information (along with the EBIs resources).
I’m going to focus on one question. Centralization and data quality. In general, to deposit something in a central repository, you should have the following criteria
- A minimal set of criteria that need to be met prior to deposition
- A robust version control system, which allows you to trace back to changes and allows scientists to compare their results across changing genomic information
- APIs that are used both internally and externally, that provide consistent programmatic access to the underlying data model
- Come to think of it a good data model
- Powerful search capabilities
But those are not necessarily sufficient. Part of the problem with NCBI is the UI, which I’ve never liked, too many databases and somewhat confusing interoperability. There is also a couple of fundamental philosophical questions
- Should the community be allowed to edit GenBank? To me, a lot of the efforts to develop WikiProteins, SNPedia, the Genes Wiki project, etc is the community’s perception that they want to be able to make such changes
- Is centralization a good thing? Personally, the ideal situation for me would be for NCBI and EBI to join hands (probably not politically possible), perhaps via some form of non-profit foundation or consortium that essentially develops a platform and resources in collaboration with the community, with powerful APIs and thinks about this problem not just as a scientific problem, but a distributed data problem. Sure the data can be deposited in a central repository, but how the data are used, presented, etc should not be a one stop shop. Essentially move from being a site to being a full-fledged service
The problem with the latter is that they tend not to work in practice, although there are examples. Distribution is more important than centralization. The question show not be “how do we store our data”, but “how do we enable scientists to do more with all the data that we have”. The “how do we store” then becomes one of many questions that need to be asked to answer the primary question.
The quality and reliability of biological data is paramount. Being able to access it, both via search UIs and web services APIs is critical. We tend to think of these problems primarily from the scientists point of view, which we need to continue to do so, although I am not sure how well we are doing. However, in developing and organizing these resources, I feel we need to bring in some gurus in distributed data and architectures. The data and information lying with NCBI, EBI, etc has the potential to impact our future, and we are going to continue to see data volumes grow exponentially. Which is why I wonder and hope, if there will ever be something like Wikipedia, or perhaps a Google (not just a search engine, but an organization that makes organizing and indexing life science information a mission statement) that will revolutionize what we do with all the genomes and other associated data at our disposal.
Technorati Tags: GenBank, Biological Information, Web Services, APIs, Data Management




Add New Comment
Viewing 3 Comments
Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Add New Comment
Trackbacks
(Trackback URL)
July 20, 2008 at 10:35 pm
[...] Thinking about biological resources [...]