The Open Data licensing issue

May 11, 2008

A little tied up this weekend, so will keep it brief. I have added a number of comments on Friendfeed to posts I have shared from Google Reader about what the licensing of data should be.

The whole thing started by Antony Williams announcing CC support for data on ChemSpider. That was followed by a chain of events and a ton of confusion. Let me add my voice to this debate, since Open Data is near and dear to my heart

I classify scientific data into the following categories

  • Raw data: This is the kind of data deposited in Tranche, or RCSB, or GenBank. Sequence data, structural data, raw proteomics data. There are associated metadata that are required for quality and reproducibility.
  • Processed data: These are the results of doing something with the raw data, e.g. molecular simulation results from a PDB structure and form a continuum

I can’t but agree with John Wilbanks. Here is the part that all of us should read again and again

The public domain is not an “unlicensed commons”. The public domain does not equal the BSD. It is not a licensing option.

It is the natural legal state of data.

It is a damn shame that we no longer think of the public domain as an option that is attractive. It’s a sign of the victory of the content holders that the free licensing movements work against that something without a license – something that is truly free, not just just free “as in” – is somehow thought to be worse. We’ve bought into their games if we allow the public domain to be defined as the BSD. The idea of the public domain has been subjected to continuous erosion thanks to both the big content companies and our own movements, to the point where we think freedom only comes in a contract.

The public domain is not contractually constructed. It just is. It cannot be made more free, only less free. And if we start a culture of licensing and enclosing the public domain (stuff that is actually already free, like the human genome) in the name of “freedom” we’re playing a dangerous game.

The public domain is the natural place for raw scientific data. That’s where it belongs and always has been. We, myself included, have been guilty of making things more complicated than they need to be. There is a data commons already. Our goal should be to make sure people respect it, and make data available in ways that we can take advantage of it.

Our discussion on content licensing should be limited to processed data, i.e. what we do with data in the public domain. There, we need to allow people to make choices, but keep the raw data unfettered. Those who want to associate copy left licenses with raw data are being dogmatic. Scientific data doesn’t have to be viral or anything like that, it’s there for the greater scientific good, and there’s only one logical mechanism for it. In fact, I would argue that putting copy left on it (a sequenced genome doesn’t belong to anyone) is as wrong as full on copy protection. You may have some embargo on making it publicly available, especially with things like structures where you might want to do something with it before anyone, but in the end the data belong in the public domain

I would like to thank John for putting this down so emphatically and clearly. A lot of us have been saying the same thing for a while, but this is the most clear distillation that I’ve read yet.

That does not mean we don’t have to have a discussion around how we make content (not raw data, but follow on content) available and the implications. Antony was confused for good reason.

Further reading

More from John
Cameron Neylon
Egon Willighaghen
More from Egon
Bill Hooker
Web as platform: Bret Taylor on Open Data
Open Science and licensing
Protocol for implementing open access data
bbgm post on protocol for open data

Discussion on business models around Open Data is building up

May 6, 2008

ChemspiderThis post got deleted during a blog snafu. Reposting

Many months ago, I started talking about the monetization of biological data, a theme that’s been present throughout the history of bbgm. In general, I have maintained that for the most part, the value lies not in the raw data, but in what we can do with the data. It looks like there is an interesting discussion brewing on the web around some of these ideas. Here are three a couple of posts, I think in chronological order

Peter Murray-Rust. The comment from Rich Apodaca is a must read. There is a follow up post from Antony Williams as well.

I will just re-iterate a generalizations, because I am only peripherally familiar with the specifics. On the web, data should be available as an addressable resource. The fact that data is available as RDF is great (and I wish more data was available as such). However, my personal preference is that data, especially open data, needs to be accompanied by APIs that allow the data to be accessed in a number of formats (not a dump per se). I think over time the acceptable formats will be established. The key aspect here are the business models. Is the business in providing a service on top of the data? For example for more than X number of API calls, there could be a fee associated.

These business models are going to be the key. Just like Open Source has found business models as have some web services, the models that allow people to build upon Open Data are the key

Image via Wikipedia

Technorati Tags: , , ,

Sun and Amazon jump into the pool together

May 5, 2008

At JavaOne, one of the big announcements was a hookup between Amazon, specifically EC2, and OpenSolaris (finally generally released as a full open source OS). The collaboration between Amazon and OpenSolaris will give customers access to OpenSolaris (for feree) and MySQL premium technical support, and more. The key selling points are ZFS and D-Trace. Now, I am a big Linux guy, but options are always good and enterprise relationships/partnerships are just a sign of the maturing and relevance of cloud computing.

Aside. It’s interesting that Sun talks about OpenSolaris as the OpenSolaris community
Technorati Tags: , , , ,

Biobootcamp 2008

May 1, 2008

EntrepreneurshipPerhaps I was premature in bemoaning the lack of a startup school for life scientists. Adam Rubenstein points to biobootcamp 2008. Not exactly what I had in mind, but knowing some of the people involved, I suspect it will be quite useful to people.

Image via Wikipedia

Technorati Tags: ,

Bio-IT World Day 1 - Visualization, the cloud and people

April 29, 2008

Collective intelligenceDetailed blog posts will follow when I have some additional cycles, but thought I’d share some quick thoughts on day 1 of Bio-IT World. My conference started with a workshop on data visualization, which was mostly about the importance of visualization for making sense of multidimensional data sets and what kind of visualizations could be done. My take aways from the talks

  • There was a distinction made between statistical methods and data mining and presenting information to humans.
  • Life science data is inherently multiscalar and reducing dimensions without losing information or creating artifacts is not trivial
  • Importance to create systems that can help scientists go through a workflow and predict visualizations, and help guide the user to the most appropriate visualization for the relevant questions
  • APIs are important for Pfizer. If a full API is not available, they are not interested in a visualization package
  • and last but not the least, as I Twittered during the workshop, they need to invite Ben Fry to give a talk on visualization. I am sure he would have a lot to contribute

Perhaps the highlight was the keynote by John Reynder from Johnson and Johnson PRD. He gave us a tour of his experiences through his career, including his time at Los Alamos. The talk was not in any great depth, but I left it very encouraged. Encouraged that the head of an IT organization at a large pharma company understood the value of collaboration, understood that innovation happens everywhere, and needs to be tapped appropriately and a lot of information is pre-competitive and should be shared across companies. Other things he talked about

  1. The cloud :). There was a slide on how to dial up storage and cycles, with AWS prominently mentioned
  2. Collective intelligence. He spent a lot of time on collective intelligence, from knowledge and innovation networks, to connecting people internally and talking about using new ways to make tools available and connecting people together. There was a suitable amount of web 2.0 jargon and frequent mention of the Semantic Web as essential to the life sciences.
  3. We have the compute power, but the gap comes from the software.
  4. He also warned about getting too caught up in the technology and losing sight of the problem

Would have been nice to have open data mentioned explicitly, but he clearly said that pharma needs to appreciate data and information sharing.

Bio-IT World means meeting old friends, especially from my Accelrys days as well as finally meeting people I admire from my online life, with a special shoutout to Michael Cariaso

On tap on Day 2 - Electronic Data Capture, high throughput data management, supercomputing and a W3C lunch

Image via Wikipedia

Technorati Tags: , ,

Next Page »