I think my bullishness for FriendFeed just went up a notch after reading Bret Taylor’s blog. For those who don’t know Bret is one of the co-founders of FriendFeed and an ex-Googler. The other day he started his first blog, and guess how he did so. His last project at Google was App Engine and his first project after App Engine got released was to develop a blog platform deployed there (hence the appspot.com address). Anyway, apart from being impressed by his coding skills and experience, I was equally intrigued by his latest post; We need a Wikipedia for data. In it, Bret writes (all emphasis mine)
I think all of these barriers to data are holding back innovation at a scale that few people realize. The most important part of an environment that encourages innovation is low barriers to entry. The moment a contract and lawyers are involved, you inherently restrict the set of people who can work on a problem to well-funded companies with a profitable product. Likewise, companies that sell data have to protect their investments, so permitted uses for the data are almost always explicitly enumerated in contracts. The entire system is designed to restrict the data to be used in product categories that already exist.
He continues
The interesting thing is, almost every internet company would benefit if this data were freely available. Most internet companies have embraced open source operating systems because every company needs an operating system, and no company wants their OS to be a competitive advantage – they just want it to work. I would argue we are all in the same boat with these factual data sources. No one really wants factual data accuracy and completeness to be their competitive advantage; we all want the best data possible to build the best products possible, and discrepancies in data quality are artifacts of the extremely inefficient economy of buying and selling data we currently live in. If everyone had the same, high quality data, all of our products would be better for it.
I could end this post here and say “I rest my case”, but there is one area I differ. He argues that we should have a Wikipedia for data, a global database of data sources that anyone can use. I disagree. I believe that we should have a web of data, to be precise linked data, with each data point and data set an addressable resource. In the comments some, including myself, mention Freebase and dbpedia. There is also Swivel where you can upload datasets. Fellow scifoo Aaron Swartz has theinfo.org, a resource for really large datasets. That’s all fine, but do we really want a centralized repository of data? Shouldn’t genomic data stay in GenBank and structural data stay at the PDB. If instead we just made our data open, and in formats that can be slurped up or used for the kinds of innovations that Bret talks about, that would be the ideal situation. Then we could use the data our way, act upon it, apply algorithms, etc.
What do you think? Do we need a Wikipedia of data? Or do you think that the web itself should be our open data commons?
Further Reading
Using data for better results
Data should be set free
The value of information
Image via Wikipedia
Technorati Tags: Open Data, Linked Data, Data Commons




7 Comments
What about http://www.numberzoom.com/ for reverse phone number lookup wiki and caller ID listings for known telemarketers and collection agencies?
I've started a project called http://infochimps.org/ as exactly this kind of wikipedia for data — where wikipedia will tell you something about everything, we can help you find everything about something.
It doesn't matter where the download finally comes from. What matters is that the data be centrally discoverable, and more importantly that the data be connected fluidly to the real-world concepts and across datasets and knowledge domains. The genomic data will always come from GenBank, and the baseball data will always come from Retrosheet.org, and the Lunar Phase and Solar Eclipse data will always come from NASA. What you need is a large, community effort to curate this data; to interconnect it; and to present a uniform metadata facade.
You're correct in the long term, but we're just not there yet. It would be great if all this data were offered in open formats and with clear metadata. But the census department can't even tell you how to load their summary files into a modern database program; the different government agencies seem to engage in a bizarre competition for “who can implement the most novel and opaque file format.” For the time being there's going to be a lot more taking than asking, which is why there's centralization of data as well as metadata.
But eventually, people will start building tools on top of this — tools that don't just understand “float” and “string” but know “lat/long, a type of location” or “ISBN number” — and demanding that the data match. Once this consensus evolves, the “Wikipedia for data” will be a discovery and interconnection mechanism, and not a distribution one.
Certainly if there is an open database supporting a particular file format (like pdb) then of course it should be used. But many file formats don't have such databases available. The key to Open Science is to get those raw data points in SOMEWHERE right away. That way others can make use of the information. Down the road, as new services come online we can use them as redundant repositories.
MrFlip … thanks. I hadn't heard of infochimp before. It's good to see sites dedicated to datasets coming out. I don't disagree on the near term need as you describe, but one can always hope and dream.
As a life scientist, I am all too familiar with opaque format,esp in commercial software. You should look up LSID's. Those are the kind of addressable identifiers that should become a lot more common over time.
I agree with the idea of making web itself to be one huge database. Here is how I view this issue.
When I was discussing about the problems in Wikipedia itself, I had argued that we will be better served by a huge collection of vertipedias (vertical wikis specializing in narrow fields of specialization) instead of one centralized repository of knowledge. Such specialized vertical wikipedias will have a much higher levels of accuracy because the participants will be specialists in the niche area.
I would apply similar logic when it comes to warehouses of data. The web should serve as a single database which, in turn, sources from specialized 'mini databases'. Even though I agree with Bret's idea of open data, which is my belief for a long time, I think I disagree with his idea of how it should be implemented.
MrFlip … thanks. I hadn't heard of infochimp before. It's good to see sites dedicated to datasets coming out. I don't disagree on the near term need as you describe, but one can always hope and dream.
As a life scientist, I am all too familiar with opaque format,esp in commercial software. You should look up LSID's. Those are the kind of addressable identifiers that should become a lot more common over time.
I agree with the idea of making web itself to be one huge database. Here is how I view this issue.
When I was discussing about the problems in Wikipedia itself, I had argued that we will be better served by a huge collection of vertipedias (vertical wikis specializing in narrow fields of specialization) instead of one centralized repository of knowledge. Such specialized vertical wikipedias will have a much higher levels of accuracy because the participants will be specialists in the niche area.
I would apply similar logic when it comes to warehouses of data. The web should serve as a single database which, in turn, sources from specialized 'mini databases'. Even though I agree with Bret's idea of open data, which is my belief for a long time, I think I disagree with his idea of how it should be implemented.
2 Trackbacks
[...] Deepak Singh: Web as platform: Bret Taylor on Open Data [...]
[...] from John Cameron Neylon Egon Willighaghen More from Egon Web as platform: Bret Taylor on Open Data Open Science and licensing Protocol for implementing open access data bbgm post on protocol for [...]