Open Data, Open Visualization and a new blog

August 30, 2008

A reviewer at the National Institutes of Healt...Image via Wikipedia I discovered a new blog today, FlowingData, at least I don’t recall having seen it before. The blog is all about the meaning of data. How did I find it? One of my Google alerts took me to a post on How Open Should Open Source Data Visualization Be. The part that I went straight to was the part on the three aspects of open source data visualization; Open Tools, Open Code, Open Data.

I don’t necessarily agree with some of the discussion in the post, at least from a scientific perspective, where data visualization is a key to data interpretation, although it’s possible I am misinterpreting the author who seems to be favorable of openness. Again, it’s not always possible to satisfy all three. While R does achieve that, but you can’t always use R (it has it’s performance limitations).

In science, code, data, and data interpretation all go together. The value lies in the interpretation of the data, the hypotheses underlying the interpretation and the presentation of the interpretation. The openness is important, not because it is our duty to give back, but because it is good science. Andrea, in the comments, makes some interesting points. By the way, if you include links to your MyExperiment Workflows, you get super extra brownie points. In particular

However, I believe there’s more potential benefit than risk in sharing my so-called intellectual property. Open science ideals (as exemplified by sharing data, analysis, and results) are highly congruent with the values of the open source communities that I study, and I can’t help but conclude that the institutionalized incentive systems for academics that make us hesitate to share knowledge are overdue for revision.

and

I share my work on principle; science is supposed to be about truth and knowledge, not hoarding data and hiding tools from others for our own personal benefit, to the potential detriment of the greater community.

What we need to spend time figuring out as a field is to take this discussion beyond academia. How can we allow people to make money from good ideas, on top of an open science backbone. Too much of the discussion centers around publishing, peer review, etc, but perhaps that’s where the initial discussion should take place.

The good news, I think most scientists are pre-disposed to open science, it’s just a case of the system that discourages them to go in that direction.

Reblog this post [with Zemanta]

Bioinformatics as mashup

August 29, 2008

bioinformatics: acquiring, collating and rearranging information already available elsewhere?

That is from a Tweet by Neil. My reaction was somthing along the lines of “boy that sounds like the definition of a mashup”.

Bioinformatics is a broad field, but part of it, a good part of what a bioinformatician does is exactly what Neil describes. The work of a bioinformatician is built on data collected by many people around the world and deposited in a variety of data bases. A lot of what we do is take information from one and try and match it up to information from a second source, presumably with the goal of getting additional insights. It might sound crude to call it that, but I think if we start thinking of bioinformatics as a mashup, we could start thinking about making those mashups available to others, and perhaps even new ways to present the information.

Disclaimer: This post was written early in the morning before any intake of caffeine

Reblog this post [with Zemanta]

Scientific Identity

August 28, 2008

I have been thinking a lot about distributed identity lately and what it means for scientists. This was fueled by a bunch of things, including the recent news about OAuth, and discussions around social networks in science.

We keep talking about how to connect information together. In the general web world, you have various services that, with varying degrees of success, bring things together into a common namespace. What we need to do in the scientific space is something similar. We have standards in place to make sites and services talk to each other. If we could figure out how to move our scientific identity, i.e. our collaborators, our communications (formal and informal, peer reviewed or otherwise), and our interests across services, while maintaining control over the communications, we would be in a very good place as we redefine how we communicate and practice of science.

Personally, I’d like to see journals and scientific “networks” adopt OpenID, OAuth, and other web standards and along with DOIs and perhaps something like SciFOAF. Another paradigm to look at is laconica, which allows you to communicate across communities (and makes good use of OAuth), in essence giving you a distributed identity. In true internet meta-fashion, you can sign up using OpenID.

On a parting note, a key success factor will be abstraction. We need to have the tools, etc in place that all the underlying complexity is abstracted away, otherwise there will always be too much friction to get started

Reblog this post [with Zemanta]

The ‘Ubiquitous’ web

August 27, 2008

MozillaImage via Wikipedia All of you know about it already, but I shall happily add to the noise. Last evening I had one of those “Holy S**t” moments. Was sitting in a coffee shop, catching up with the days news, when I saw a flurry of activity on Friendfeed around Ubiquity. Turns out Ubiquity is a new project by Mozilla Labs, which for want of a better description is like Quicksilver for the browser, a mini command line available with an Alt-space.


Ubiquity for Firefox from Aza Raskin on Vimeo.

Ubiquity is still young, and may never catch on although I have a feeling it will, at least among the geek crowd, and being a Firefox plugin = low friction. But you can see the promise right away. You can, very quickly, using simple commands access search, Wikipedia, maps, insert material into documents, send email, etc.

Here are some examples

Ubiquity-1

Ubiquity-2

But that’s just a start. It doesn’t take a leap of faith to see entire vocabularies being created to support certain data types and activities. An early example comes from Rajarshi Guha, who very quickly rustled up a couple of commands. Maybe we can have a repository somewhere someday for a set of standardized commands in bioinformatics/cheminformatics

Ubiquity-SMILES

The example above isn’t quite working perfectly, but you can see what we can do, and this is just the tip of the iceberg. There are a lot of mashups possible, including getting related papers, targets, compounds, structures, and being able to package them up and email them, or put them in a document, and as our web gets a little more structured, I can imagine myself sitting inside that command line and following an entire graph of information that streams through, ready to be manipulated and used for something even more interesting.

The web is a moving target, our browsers are moving targets, our capabilities to manipulate are moving targets, and efforts like Ubiquity show us a glimpse into the future. Will be fun being part of that future (and blogging about it)

Update: I should have known Pawel would take a stab. An even cooler example (Sorry Rajarshi, I like proteins better)

Reblog this post [with Zemanta]

Peering into PLoS One comment stats

August 27, 2008

PLoS ONEImage via Wikipedia I was one of the lucky few who was given access to a dump of “social” statistics for PLoS One (my term). The data were given to us to analyze as we please, to glean from them what we may (I don’t really know who all the others were).

To give some context, we need to look back at Euan’s post on commenting. He does a great job of slicing and dicing the data from BMC. My first instinct was to do a similar analysis of data from PLoS One, but in the end decided to go in a slightly different direction and look at some trends that might give some semi-quantitative insights into the scientific mind and commenting, and provide some commentary on what this really means, if anything.

In the time since the first comments on PLoS One, in December 2006 (I show up as an early commenter, which felt kinda nice), over 710 people have commented with an average of ~2 comments a person (about half have left more than one). Given the diversity in the kinds of papers people publish that number is higher than I expected. However, it could (should?) be a lot better. Not surprisingly, as the following figure shows, there is a very spiky distribution among those who do comment, with a few commenters, including Björn Brembs, commenting a lot more than others.



What’s a little more interesting is the number of people that have not left a single comment. I would love to know what the ratio of people who engage with a paper (spend a certain amount of time on it) to the people who end up commenting is. My guess is that the percentage of people who do leave a comment is somewhat small, given that the total number of people visiting PLoS One is probably significantly higher than 710. As a blogger, interacting with others through a comment stream (either on the blog or on sites like Friendfeed) is one of the more rewarding aspects. The stats tell us that we are a long way away from publishing platforms essentially becoming micro-communities. Let’s say you have a particular lab, e.g. a group publishing papers on the photophysics of bacteriorhodopsin. If the group published 2-3 papers a year at PLoS One, each paper could become a discussion board, with authors and others in the field having a discussion. In a perfect world, all these people would comment on each others papers and via cross-linking, etc you’d get a vibrant bacteriorhodopsin community. This is essentially an extension of the now infamous data finds data, people get people meme that the whole world should latch on to.

Alright, enough flights of fancy, lets look at some more numbers. The one thing I could not find was any correlation between ratings and commenting, which did surprise me a little bit. As you can see going from left to right (which essentially is a function of time) there seems to be a burst of activity right in the beginning, but other than that you get a nice little skyline with a fairly steady output. Depending on your point of view, that’s a good thing or a bad thing. If you want to be a naysayer, you can say that things have not progressed as they should, with increasing reader engagement. On the positive side you could note that there has been no drop off, and people continue to remain engaged and every now and then you get a paper which yields more interest than others. If you presume that the 2008 numbers will hold for the rest of the years, you essentially get some growth, but not by much, but at least there is no drop off.

When PLoS launched trackbacks I remember being quite excited, but if there was one area that disappointed me, it was the lack of trackbacks. The numbers are loud and clear here. If you take out trackbacks from Bora and other PLoS staff, the number is less than a 100 for all PLoS One papers and a maximum of 4 for any paper. This is a combination of flaws in the trackback system in general (could write a whole blog post on that) and perhaps with the PLoS implementation. The folks at PLoS really need to think about how they could leverage trackbacks, and perhaps could take the lead in integrating trackbacks with DOIs, to try and resolve various links that point to papers published on PLoS One

Earlier I had talked about microcommunities. There I had compared a paper to a blog post (hold your horses, it was just an analogy). A different analogy would be the Life Scientists room at Friendfeed or a site like Hacker News. There people post a link and a whole discussion erupts (not always, but often enough). I would throw out this challenge (see the DOI suggestion above). Why should discussion be localized to PLoS One itself. If a paper pubished in PLos One is discussed in 20 other places, it would be considered a success. In other words, we shouldn’t limit our thinking to just on site commenting. Perhaps within the site, we should be focussed on ratings and perhaps tagging and notes.

I’d like to end this post with another figure. The quantity of comments might not be quite what some of us had hoped for, but it would seem that the recent trend is somewhat encouraging as seen in the following figure (ignoring that last data point).

So what have we learned from this exercise. Quite frankly, I am not sure. Is the commenting on PLoS One at a level that we hoped it would be? Not quite. Is it as bad as some might like to believe? Not quite. What we have is a very very nascent (no pun intended) effort on the part of the scientific community using web publishing platforms as a communication medium. I’d like to ask those same scientists to think about newsgroups. Most scientists are fairly comfortable participating in newsgroups, and here you essentially have one, with very clearly defined thread titles.

Reblog this post [with Zemanta]

Next Page »