The Open Data licensing issue

May 11, 2008

A little tied up this weekend, so will keep it brief. I have added a number of comments on Friendfeed to posts I have shared from Google Reader about what the licensing of data should be.

The whole thing started by Antony Williams announcing CC support for data on ChemSpider. That was followed by a chain of events and a ton of confusion. Let me add my voice to this debate, since Open Data is near and dear to my heart

I classify scientific data into the following categories

  • Raw data: This is the kind of data deposited in Tranche, or RCSB, or GenBank. Sequence data, structural data, raw proteomics data. There are associated metadata that are required for quality and reproducibility.
  • Processed data: These are the results of doing something with the raw data, e.g. molecular simulation results from a PDB structure and form a continuum

I can’t but agree with John Wilbanks. Here is the part that all of us should read again and again

The public domain is not an “unlicensed commons”. The public domain does not equal the BSD. It is not a licensing option.

It is the natural legal state of data.

It is a damn shame that we no longer think of the public domain as an option that is attractive. It’s a sign of the victory of the content holders that the free licensing movements work against that something without a license – something that is truly free, not just just free “as in” – is somehow thought to be worse. We’ve bought into their games if we allow the public domain to be defined as the BSD. The idea of the public domain has been subjected to continuous erosion thanks to both the big content companies and our own movements, to the point where we think freedom only comes in a contract.

The public domain is not contractually constructed. It just is. It cannot be made more free, only less free. And if we start a culture of licensing and enclosing the public domain (stuff that is actually already free, like the human genome) in the name of “freedom” we’re playing a dangerous game.

The public domain is the natural place for raw scientific data. That’s where it belongs and always has been. We, myself included, have been guilty of making things more complicated than they need to be. There is a data commons already. Our goal should be to make sure people respect it, and make data available in ways that we can take advantage of it.

Our discussion on content licensing should be limited to processed data, i.e. what we do with data in the public domain. There, we need to allow people to make choices, but keep the raw data unfettered. Those who want to associate copy left licenses with raw data are being dogmatic. Scientific data doesn’t have to be viral or anything like that, it’s there for the greater scientific good, and there’s only one logical mechanism for it. In fact, I would argue that putting copy left on it (a sequenced genome doesn’t belong to anyone) is as wrong as full on copy protection. You may have some embargo on making it publicly available, especially with things like structures where you might want to do something with it before anyone, but in the end the data belong in the public domain

I would like to thank John for putting this down so emphatically and clearly. A lot of us have been saying the same thing for a while, but this is the most clear distillation that I’ve read yet.

That does not mean we don’t have to have a discussion around how we make content (not raw data, but follow on content) available and the implications. Antony was confused for good reason.

Further reading

More from John
Cameron Neylon
Egon Willighaghen
More from Egon
Bill Hooker
Web as platform: Bret Taylor on Open Data
Open Science and licensing
Protocol for implementing open access data
bbgm post on protocol for open data

Programming HPC for the domain

May 11, 2008

Cray designed many supercomputers that used multiprocessing heavily.At Accelrys, a lot of the software I managed was in-licensed from academia. That approach allowed the company to tap into the intellectual resources of some of the smartest academic researchers in the world, but it also created a problem. One was the difference in software development practices. Some of the academic code barely had version control. But that’s the obvious one. In a new post at Computing at Scale, Bill McColl writes about Domain-specific parallel programming. Translating code parallelized for an academic setting, often under the assumption that huge clusters might be available, to an industrial setting where scaling and fault tolerance become critical, where resource availability varies widely, and speed is critical, is always a challenge. This is especially true when you’re trying to shrink wrap software and building interactive interfaces.

So in an era with more scale available, clouds to tap into, accelerators, and new data and distribution models, are we going to see a shift? I still feel that the underlying scientific research has to come from academia. They have the resources, time and incentive to do so, but I think industrial think tanks and expertise can contribute back by working with academia on advanced problems of relevance, e.g. in the area of computing. Will we tap into some of the new domain specific development being done today as a scientific community? It can’t be done by one side or the other. But rather we need to identify approaches as a community and understand what works best, without trying to duplicate efforts. Of course, we need people who understand these new methods and paradigms to implement them.

There will always be a tension between academic research efforts and commercial need. In the life sciences it is especially tough for industry specific apps to be developed from an economic point of view, which is why I believe it will have to be a joint effort.

Your thoughts?

Image via Wikipedia

Technorati Tags: ,

Gamers, get your folding on

May 9, 2008

Protein before and after folding.Technology Review was the first place I saw it, then someone put it up on Friendfeed and now Andrew Perry has a great post on Foldit. Foldit comes out of the lab of a bbgm favorite, David Baker, right here at the University of Washington.

Foldit combines gaming with protein structure prediction. It’s an interesting approach to spreading scientific problems. Folding@home built upon the success of Seti@home and the geek cred of running on gaming consoles and has built quite a following. Will Foldit, which presents a simple, fun interface to get people interested in protein structure (and the existence of Folding@home makes this somewhat familiar to geeks everywhere) be an example of how we can leverage crowdsourcing? Andrew makes some interesting points (which I agree with) on weighting crowdsourcing, although that’s always a hard thing to do, but I’d like to see karma, etc come into play here.

It’s good to see protein structure getting some attention and continuing to be creative. It’s always been my favorite scientific subject. The field lends itself to “pretty pictures”, so getting non-experts involved is a possibility.

The site and server have had connectivity issues since I’ve been trying, so perhaps they need help with web resources, cause lots seem to be interested.

Here is a list of people supporting the project: UW Animation Research Labs, UW Baker Lab, DARPA, Microsoft, and Adobe. Nice list.

Image via Wikipedia

Technorati Tags: , , , , ,

Harvard Law faculty votes for ‘open access’ to scholarly articles

May 7, 2008

Harvard Law School shieldFrom an email I received earlier today. Would normally not pay this much attention, but this is the Berkman Center and Open Access is always a good thing

Good afternoon,

The Berkman Center for Internet & Society is pleased to announce that the faculty of Harvard Law School has unanimously approved a motion for open access: articles will be made freely available in an online repository. With the success of this motion, Harvard Law becomes the first law school to make an institutional commitment to open access to its faculty’s scholarly publications.

In February, Harvard University’s Faculty of Arts and Sciences unanimously passed an open access motion spearheaded by computer science professor and Berkman faculty co-director Stuart Shieber. Professor Shieber’s work and leadership, along with that of Harvard library director Robert Darnton, paved the way for Berkman faculty director William Fisher and executive director John Palfrey to bring an open access proposal to Harvard Law School.

The Berkman community is tremendously proud and excited about the success of these important initiatives.

The full release from Harvard Law School can be found online at http://www.law.harvard.edu/news/2008/05/07_openaccess.php.
Contact: Harvard Law School Office of Communications

The Berkman Center’s announcement, including a link to the full text of the open access motion, can be found online at http://cyber.law.harvard.edu/node/4273.

Image via WikipediaTechnorati Tags: ,

Discussion on business models around Open Data is building up

May 6, 2008

ChemspiderThis post got deleted during a blog snafu. Reposting

Many months ago, I started talking about the monetization of biological data, a theme that’s been present throughout the history of bbgm. In general, I have maintained that for the most part, the value lies not in the raw data, but in what we can do with the data. It looks like there is an interesting discussion brewing on the web around some of these ideas. Here are three a couple of posts, I think in chronological order

Peter Murray-Rust. The comment from Rich Apodaca is a must read. There is a follow up post from Antony Williams as well.

I will just re-iterate a generalizations, because I am only peripherally familiar with the specifics. On the web, data should be available as an addressable resource. The fact that data is available as RDF is great (and I wish more data was available as such). However, my personal preference is that data, especially open data, needs to be accompanied by APIs that allow the data to be accessed in a number of formats (not a dump per se). I think over time the acceptable formats will be established. The key aspect here are the business models. Is the business in providing a service on top of the data? For example for more than X number of API calls, there could be a fee associated.

These business models are going to be the key. Just like Open Source has found business models as have some web services, the models that allow people to build upon Open Data are the key

Image via Wikipedia

Technorati Tags: , , ,

Next Page »