We need to change the system

August 10, 2008

I return to one of my favorite topics, open data and data ownership. Discussions with some very smart people over time (including a recent one on Friendfeed) have convinced me that our problem does not necessarily lie in scientists being inherently protective of their data, but rather to a system that encourages them to do so.

First let me throw out some oft-repeated mantras that drive my philosophy, mostly stolen from other wiser people.

Raw data by itself has is not the value center. Value comes from the interpretation of these data.

Data finds the data (then people find the people) (via Jeff Jonas and Jon Udell)

Wherever you are, there is someone smarter somewhere else (Via Tim Bray, channeling Bill Joy)

Now that we have got those thoughts out of the way, and lets assume that most people involved in science do care about science in general, and acknowledging that as humans we need recognition in some manner, the challenge lies not in trying to fit our desires into an existing, broken, system, but rather in taking this system, which is very long in the tooth and changing it.

The science blogosphere, The BioGang, etc are but a small part of the scientific communities. Some of us have the ability to make change from within, some of us have a bigger pulpit than others, and some of us can only write about the changes we would like to see. So it’s going to take a while, but if pharma companies can agree to share pre-competitive biomarker data, then academics can change as well.

I still maintain that raw data should be made public in a reasonable time. You might want to re-check the data quality, or perhaps your data was collected to support a hypothesis, and you have full right to test it out. But you can’t sit on that data. Complete your analysis and make it available. And if the data are collected for the sake of data collection (genome study, high throughput structure determination) then you must make it available ASAP. There is enough in there to keep many many people busy.

The other aspect is data ownership. Large data sets of fundamental data belong in the public domain. Supporting data, data that supports a paper, or some hypothesis or discovery, I am not 100% sure about. I think there needs to be some form of attribution, especially if you don’t plan to publish the data in a paper. How do we manage that? I don’t know. Others have studied this for a longer time. How does this protect long term monetization prospects? Actually I think that’s the easy part, and I’ve covered it many times before.

Sometimes I feel that it’s pointless to write about this subject, one I care about more than most. Then I remember how much I care.

Zemanta Pixie

Comments

Viewing 13 Comments

    • ^
    • v
    Is there no platform comparable to Delicious that supports open science, e.g. just doing your job (like bookmarking on Delicious) while doing it in public, and in a way that is somehow better accessible/searchable than blogs?

    If not, some should make something like that.
    • ^
    • v
    Maybe such a tool should default to keeping things private, while allowing to share things either public or with your network (semi-public) in a granular fashion. In any case, I think research findings should be stored granular, so they can be shared and retrieved in small bits.
    • ^
    • v
    The first place is to understand what the source of the data was. Is it basic data (e.g. a genome). That must remain in the public domain. Derived data (analysis results) belongs to the researcher and how it is made available is where I think we should be spending most of our efforts. IMO making analysis data/datasets available, e.g. as part of a publication is critical, because it leads to better, more reproducible, science
    • ^
    • v
    I'm writing this paper about a new algorithm to simulate dense granular mater.

    In the paper we put 4 graph of the result of our simulation. I ran that and at least they are made with the same svn version of the code, but I'm not sure if the 20 (aprox.) parameters we use are exactly the same over the simulations.

    In each simulation, I care about the energy of the system and measure that every 0.2 time units, in simulations that go from 1000 to 10.000 time units.

    So I've lost a lot of data: I would like to have snapshots of the system, say every 10 time units, but for the results we want are not necessary, so we don't keep it.

    I would like, when the paper is ready, to be able to put online the following:
    a) the svn version of the program used to get the results.
    b) the files that allow you to run the same experiments.
    c) "Pictures" of the system every time unit (so people can see the evolution of the system and see that there are no errors and they can use that data to what they want) together with data that allowed me to make my figures for the paper and all the code I used to process them.
    d) The paper itself with an open license

    Because I want that my data can be reproduced by anyone anywhere and I think that if I do not do that way, I let myself to be unprofessional and I could change the results for my convenience or just my code has a bug that I cannot see and changes the results.

    But that is not going to happen. We are sending it to PRL and they don't ask for that. Because I'm lazy and didn't save the pictures. Because I have to run my simulations in the terminal of my computer and I don't know how to log that. Because there are no one in the world that have said what kind of data needs and in what standard so I could submit my results.

    So, if anyone want to write standards to make science open, I can help with the standards for simulations which is what I know.

    If there are somebody that think is a good idea to implement an API to run different simulation programs and present the results and organize all in a webpage, I would like to help and make my code compatible with that.

    Best,
    Sebastian
    tsuresuregusa [at] gmail
    • ^
    • v
    Meryn, not quite sure I understand the question. Could you elaborate?
    • ^
    • v
    I may have been thinking a little ahead of you here. I was thinking how we could empower scientists who already have the "open" mindset. I think a kind of generic tool would be best, but I'm personally not fond of wikis. The advantage of "personal" tools like blogs and Delicous is that each bit of data is always linked to the creator, which gives each user the ability to view purely their own data, and do completely their own thing, but it still enables them to share the content.

    I personally don't have a clue about how typical science workflows look like, but let's say some social scientists does statistical analysis. At first, you'd want the dataset in the cloud. It doesn't have to be public, but at least online. Then there could be an web-based alternative to SPSS, which enables you to share each "test" you do on the data, but doesn't force you so. Having a web-based tool would make sharing the procedure you took very easy. It would be directly shared together with the results.

    The reason why I named Delicous is that I think it's the perfect model for online cooperation. It doesn't take too much effort. You could use it in complete isolation if you wish, while still contributing to the "collective intelligence".

    Too have privacy options in such tool would be natural. Even Delicous has a "private" bookmark option. A online tool which forces you to share every piece of data you produce wouldn't gain much traction in the scientific community I guess.
    • ^
    • v
    Swivel shows a bit of what's possible, though it's only for simple charts.
    I'm imagiining a really bad-ass tool. It would take a lot of effort to produce it. But I think it'd be worth it.
    • ^
    • v
    Tools like that are inevitable, e.g. Spotfire could come up with a SaaS service (they might already have one). NextBio also falls into that category. Hopefully more over time
    • ^
    • v
    I just saw an earlier post of you which confirms that you've already been thinking about how to bootstrap scientific communities. Ever read The Del.icio.us Lesson?
    • ^
    • v
    I hadn't. As a long term delicious user (and also a believer in how Yahoo have not maximized potential) it's an interesting read
    • ^
    • v
    I'm curious. What do you think Yahoo should have done differently? Do you still have improvements in mind over the new version?
    • ^
    • v
    I rarely use the delicious website (pretty much just the firefox extension). One of the better uses of delicious (and it's indirect) is via lijit, where my delicious network becomes part of a search engine.

    Delicious has a large tagspace. Yahoo could have easily used that tagspace to provide users with additional interesting material, especially since using multiple tags provides appropriates contextual information. In other words there were a lot of possibilities to make delicious a better discovery engine. I still love it, but it shows a serious lack of vision on their part
    • ^
    • v
    I think the younger generation of scientists is more open to sharing data. I have see tenured Senior Investigators whose (bad) attitude is " It's MY data and I can do with it what I want and keep it to myself as long as I want." Wrong, honcho-breath. In most cases of research done with NIH or NSF grants, that data actually belongs to the taxpayers underwriting your grant, using the same reasoning as for Open Access journals. Don't like that concpept? OK, fund your research entirely with private money, and you can keep the data away from others forever.

    Sure, for publicly funded research you should get first crack at analyzing it. But I sense a fear that someone will find something in that data that Dr. Honcho missed, which is what motivates him or her to hang onto it and not put it out, much less putting it out in a usable form like a database. And he's right. All of us collectively are usually smarter than any one of us - but if (more likely when) someone does find something else valuable in the data, and publishes it WITH ATTRIBUTION TO THE DATA GENERATOR, is that not a Good Thing for Science?

    This would be easier if NIH would not only publish Data Sharing rules, but ENFORCE them. All it would take is one well-publicized case of Dr. Honcho actually losing his grant because he wouldn't share the data, and a lot of these problems would go away for good. Unfortunately, most NIH grant and program officers lack the gonads to do this...
 

Trackbacks

(Trackback URL)

close Reblog this comment
blog comments powered by Disqus