The Open Data licensing issue
May 11, 2008
A little tied up this weekend, so will keep it brief. I have added a number of comments on Friendfeed to posts I have shared from Google Reader about what the licensing of data should be.
The whole thing started by Antony Williams announcing CC support for data on ChemSpider. That was followed by a chain of events and a ton of confusion. Let me add my voice to this debate, since Open Data is near and dear to my heart
I classify scientific data into the following categories
- Raw data: This is the kind of data deposited in Tranche, or RCSB, or GenBank. Sequence data, structural data, raw proteomics data. There are associated metadata that are required for quality and reproducibility.
- Processed data: These are the results of doing something with the raw data, e.g. molecular simulation results from a PDB structure and form a continuum
I can’t but agree with John Wilbanks. Here is the part that all of us should read again and again
The public domain is not an “unlicensed commons”. The public domain does not equal the BSD. It is not a licensing option.
It is the natural legal state of data.
It is a damn shame that we no longer think of the public domain as an option that is attractive. It’s a sign of the victory of the content holders that the free licensing movements work against that something without a license – something that is truly free, not just just free “as in” – is somehow thought to be worse. We’ve bought into their games if we allow the public domain to be defined as the BSD. The idea of the public domain has been subjected to continuous erosion thanks to both the big content companies and our own movements, to the point where we think freedom only comes in a contract.
The public domain is not contractually constructed. It just is. It cannot be made more free, only less free. And if we start a culture of licensing and enclosing the public domain (stuff that is actually already free, like the human genome) in the name of “freedom” we’re playing a dangerous game.
The public domain is the natural place for raw scientific data. That’s where it belongs and always has been. We, myself included, have been guilty of making things more complicated than they need to be. There is a data commons already. Our goal should be to make sure people respect it, and make data available in ways that we can take advantage of it.
Our discussion on content licensing should be limited to processed data, i.e. what we do with data in the public domain. There, we need to allow people to make choices, but keep the raw data unfettered. Those who want to associate copy left licenses with raw data are being dogmatic. Scientific data doesn’t have to be viral or anything like that, it’s there for the greater scientific good, and there’s only one logical mechanism for it. In fact, I would argue that putting copy left on it (a sequenced genome doesn’t belong to anyone) is as wrong as full on copy protection. You may have some embargo on making it publicly available, especially with things like structures where you might want to do something with it before anyone, but in the end the data belong in the public domain
I would like to thank John for putting this down so emphatically and clearly. A lot of us have been saying the same thing for a while, but this is the most clear distillation that I’ve read yet.
That does not mean we don’t have to have a discussion around how we make content (not raw data, but follow on content) available and the implications. Antony was confused for good reason.
Further reading
More from John
Cameron Neylon
Egon Willighaghen
More from Egon
Bill Hooker
Web as platform: Bret Taylor on Open Data
Open Science and licensing
Protocol for implementing open access data
bbgm post on protocol for open data
Programming HPC for the domain
May 11, 2008
At Accelrys, a lot of the software I managed was in-licensed from academia. That approach allowed the company to tap into the intellectual resources of some of the smartest academic researchers in the world, but it also created a problem. One was the difference in software development practices. Some of the academic code barely had version control. But that’s the obvious one. In a new post at Computing at Scale, Bill McColl writes about Domain-specific parallel programming. Translating code parallelized for an academic setting, often under the assumption that huge clusters might be available, to an industrial setting where scaling and fault tolerance become critical, where resource availability varies widely, and speed is critical, is always a challenge. This is especially true when you’re trying to shrink wrap software and building interactive interfaces.
So in an era with more scale available, clouds to tap into, accelerators, and new data and distribution models, are we going to see a shift? I still feel that the underlying scientific research has to come from academia. They have the resources, time and incentive to do so, but I think industrial think tanks and expertise can contribute back by working with academia on advanced problems of relevance, e.g. in the area of computing. Will we tap into some of the new domain specific development being done today as a scientific community? It can’t be done by one side or the other. But rather we need to identify approaches as a community and understand what works best, without trying to duplicate efforts. Of course, we need people who understand these new methods and paradigms to implement them.
There will always be a tension between academic research efforts and commercial need. In the life sciences it is especially tough for industry specific apps to be developed from an economic point of view, which is why I believe it will have to be a joint effort.
Your thoughts?
Image via Wikipedia
Technorati Tags: High Performance Computing, Parallel Programming
Around the Web - May 10, 2008
May 10, 2008
Linkfest
- NASA workshop on massively parallel supercomputers
- Aviary - I am out of accounts, but this is sweet
- NY Times - Pursuing the next level of AI
- McKinsey surveys the new software landscape
- Yahoo Design Pattern library
- hackystat - “A framework for collection, analysis, visualization, interpretation, annotation, and dissemination of software development process and product data”
- From the NY Times’ brilliant OSS blog - dbslayer (github repository)
- Erlang vs. MPI
Multimedia & Presentations
- Andrew’s presentation from XTech
- Is it time to throw away your servers
- Abstractions for handling large datasets
Blogspotting
- Greg Linden - This one is from the archives and for all of you interested in computer science, personalized search etc
Self Assembly
Once again, life is very hectic, so not much to report. Follow me on Friendfeed, twitter or check out the Tumblelog, where I am have been putting up some cool stuff lately
Gamers, get your folding on
May 9, 2008
Technology Review was the first place I saw it, then someone put it up on Friendfeed and now Andrew Perry has a great post on Foldit. Foldit comes out of the lab of a bbgm favorite, David Baker, right here at the University of Washington.
Foldit combines gaming with protein structure prediction. It’s an interesting approach to spreading scientific problems. Folding@home built upon the success of Seti@home and the geek cred of running on gaming consoles and has built quite a following. Will Foldit, which presents a simple, fun interface to get people interested in protein structure (and the existence of Folding@home makes this somewhat familiar to geeks everywhere) be an example of how we can leverage crowdsourcing? Andrew makes some interesting points (which I agree with) on weighting crowdsourcing, although that’s always a hard thing to do, but I’d like to see karma, etc come into play here.
It’s good to see protein structure getting some attention and continuing to be creative. It’s always been my favorite scientific subject. The field lends itself to “pretty pictures”, so getting non-experts involved is a possibility.
The site and server have had connectivity issues since I’ve been trying, so perhaps they need help with web resources, cause lots seem to be interested.
Here is a list of people supporting the project: UW Animation Research Labs, UW Baker Lab, DARPA, Microsoft, and Adobe. Nice list.
Image via Wikipedia
Technorati Tags: David Baker, Foldit, Protein Folding, Protein Structure Prediction, Gaming, Crowdsourcing
Does anyone have a clue who this could be?
May 7, 2008
You don’t get to see job descriptions like this too often in the life sciences. Have to love the What you get section.
What does the job description tell us. It’s a web-based consumer focused company with a focus on healthcare and with an informatics backend. Comes out of Stanford and has a Nobel prize winner advising it, which sounds very much like Andy Fire (based on the Stanford angle).
Let’s start the speculation.
Guess where I found this position; by tracking ‘bioinformatics’ on Twitter
Technorati Tags: Andy Fire, Healthcare, Stealth Startup, Stanford, Xooglers





