Image via Wikipedia I was one of the lucky few who was given access to a dump of “social” statistics for PLoS One (my term). The data were given to us to analyze as we please, to glean from them what we may (I don’t really know who all the others were).
To give some context, we need to look back at Euan’s post on commenting. He does a great job of slicing and dicing the data from BMC. My first instinct was to do a similar analysis of data from PLoS One, but in the end decided to go in a slightly different direction and look at some trends that might give some semi-quantitative insights into the scientific mind and commenting, and provide some commentary on what this really means, if anything.
In the time since the first comments on PLoS One, in December 2006 (I show up as an early commenter, which felt kinda nice), over 710 people have commented with an average of ~2 comments a person (about half have left more than one). Given the diversity in the kinds of papers people publish that number is higher than I expected. However, it could (should?) be a lot better. Not surprisingly, as the following figure shows, there is a very spiky distribution among those who do comment, with a few commenters, including Björn Brembs, commenting a lot more than others.

What’s a little more interesting is the number of people that have not left a single comment. I would love to know what the ratio of people who engage with a paper (spend a certain amount of time on it) to the people who end up commenting is. My guess is that the percentage of people who do leave a comment is somewhat small, given that the total number of people visiting PLoS One is probably significantly higher than 710. As a blogger, interacting with others through a comment stream (either on the blog or on sites like Friendfeed) is one of the more rewarding aspects. The stats tell us that we are a long way away from publishing platforms essentially becoming micro-communities. Let’s say you have a particular lab, e.g. a group publishing papers on the photophysics of bacteriorhodopsin. If the group published 2-3 papers a year at PLoS One, each paper could become a discussion board, with authors and others in the field having a discussion. In a perfect world, all these people would comment on each others papers and via cross-linking, etc you’d get a vibrant bacteriorhodopsin community. This is essentially an extension of the now infamous data finds data, people get people meme that the whole world should latch on to.
Alright, enough flights of fancy, lets look at some more numbers. The one thing I could not find was any correlation between ratings and commenting, which did surprise me a little bit. As you can see going from left to right (which essentially is a function of time) there seems to be a burst of activity right in the beginning, but other than that you get a nice little skyline with a fairly steady output. Depending on your point of view, that’s a good thing or a bad thing. If you want to be a naysayer, you can say that things have not progressed as they should, with increasing reader engagement. On the positive side you could note that there has been no drop off, and people continue to remain engaged and every now and then you get a paper which yields more interest than others. If you presume that the 2008 numbers will hold for the rest of the years, you essentially get some growth, but not by much, but at least there is no drop off.

When PLoS launched trackbacks I remember being quite excited, but if there was one area that disappointed me, it was the lack of trackbacks. The numbers are loud and clear here. If you take out trackbacks from Bora and other PLoS staff, the number is less than a 100 for all PLoS One papers and a maximum of 4 for any paper. This is a combination of flaws in the trackback system in general (could write a whole blog post on that) and perhaps with the PLoS implementation. The folks at PLoS really need to think about how they could leverage trackbacks, and perhaps could take the lead in integrating trackbacks with DOIs, to try and resolve various links that point to papers published on PLoS One
Earlier I had talked about microcommunities. There I had compared a paper to a blog post (hold your horses, it was just an analogy). A different analogy would be the Life Scientists room at Friendfeed or a site like Hacker News. There people post a link and a whole discussion erupts (not always, but often enough). I would throw out this challenge (see the DOI suggestion above). Why should discussion be localized to PLoS One itself. If a paper pubished in PLos One is discussed in 20 other places, it would be considered a success. In other words, we shouldn’t limit our thinking to just on site commenting. Perhaps within the site, we should be focussed on ratings and perhaps tagging and notes.
I’d like to end this post with another figure. The quantity of comments might not be quite what some of us had hoped for, but it would seem that the recent trend is somewhat encouraging as seen in the following figure (ignoring that last data point).

So what have we learned from this exercise. Quite frankly, I am not sure. Is the commenting on PLoS One at a level that we hoped it would be? Not quite. Is it as bad as some might like to believe? Not quite. What we have is a very very nascent (no pun intended) effort on the part of the scientific community using web publishing platforms as a communication medium. I’d like to ask those same scientists to think about newsgroups. Most scientists are fairly comfortable participating in newsgroups, and here you essentially have one, with very clearly defined thread titles.
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_b.png?x-id=0f3531cd-2301-417b-8ed8-a21c5bbb3f56)



4 Comments
Deepak,
Thanks for the analysis. I have a couple of clarifications on your second and third graphs which may help people interpret what they are seeing there.
These two graphs are effectively showing the same data (but with differences in the Y axis), but you may want to confirm my understanding below.
First some background:
People can leave Notes on an article (literally highlighting an area of text and making a note on it); they can leave general Comments (discursive text about the entire article); or they can make a Rating (and when making a Rating they also have the chance to leave a text comment, and some people have done this).
In addition, PLoS itself sometimes leaves Comments/Notes (e.g. to post reviewer comments, or to make corrections) therefore, to get a picture of real user activity, we should remove PLoS originated comments from any counts (which both of your graphs have done).
Since launch, we have numbered our articles sequentially from #1 to #2782 (the highest number in the data set we provided – July 23rd 2008).
Given the above, I believe that your 2nd graph plots:
(number of Comments+Notes+Ratings that included a text comment EXCLUDING any such comments left by PLoS itself) vs (the actual article numbers)
You make the point that article number (the x axis) is a proxy for time . It is a proxy for time, of course, but in fact the number of articles we have published has steadily increased each month as per the data I paste below.
Therefore if the graph were charted against actual time (rather than article number) you would see a different pattern (basically the right hand side of the graph would be rising).
This is effectively what your third graph is showing, except the Y axis is slightly differently defined from the prior graph. Your X axis is 'month' (Jan 2007 through to July 2008) and the Y axis is the number of Comments and Notes (EXCLUDING any text comments left within a Rating event , and EXCLUDING any Comments left by PLoS itself).
The data for number of articles published per month is as follows (note that July data is not a full month, as the data supplied only went through Jul 23rd):
Month # of articles published
200612 138
200701 48
200702 79
200703 73
200704 60
200705 93
200706 87
200707 84
200708 153
200709 146
200710 153
200711 136
200712 118
200801 153
200802 177
200803 167
200804 205
200805 223
200806 234
200807 246
Also, the fact that July 2008 was not a full month of output would explain why the final data point in your third graph looks low
Pete Binfield (Managing Editor of PLoS ONE)
Peter
Your understanding of the axes of the graphs and what data they represent is correct. Longer response later this evening
Deepak,
Thanks for the analysis. I have a couple of clarifications on your second and third graphs which may help people interpret what they are seeing there.
These two graphs are effectively showing the same data (but with differences in the Y axis), but you may want to confirm my understanding below.
First some background:
People can leave Notes on an article (literally highlighting an area of text and making a note on it); they can leave general Comments (discursive text about the entire article); or they can make a Rating (and when making a Rating they also have the chance to leave a text comment, and some people have done this).
In addition, PLoS itself sometimes leaves Comments/Notes (e.g. to post reviewer comments, or to make corrections) therefore, to get a picture of real user activity, we should remove PLoS originated comments from any counts (which both of your graphs have done).
Since launch, we have numbered our articles sequentially from #1 to #2782 (the highest number in the data set we provided – July 23rd 2008).
Given the above, I believe that your 2nd graph plots:
(number of Comments+Notes+Ratings that included a text comment EXCLUDING any such comments left by PLoS itself) vs (the actual article numbers)
You make the point that article number (the x axis) is a proxy for time . It is a proxy for time, of course, but in fact the number of articles we have published has steadily increased each month as per the data I paste below.
Therefore if the graph were charted against actual time (rather than article number) you would see a different pattern (basically the right hand side of the graph would be rising).
This is effectively what your third graph is showing, except the Y axis is slightly differently defined from the prior graph. Your X axis is 'month' (Jan 2007 through to July 2008) and the Y axis is the number of Comments and Notes (EXCLUDING any text comments left within a Rating event , and EXCLUDING any Comments left by PLoS itself).
The data for number of articles published per month is as follows (note that July data is not a full month, as the data supplied only went through Jul 23rd):
Month # of articles published
200612 138
200701 48
200702 79
200703 73
200704 60
200705 93
200706 87
200707 84
200708 153
200709 146
200710 153
200711 136
200712 118
200801 153
200802 177
200803 167
200804 205
200805 223
200806 234
200807 246
Also, the fact that July 2008 was not a full month of output would explain why the final data point in your third graph looks low
Pete Binfield (Managing Editor of PLoS ONE)
Peter
Your understanding of the axes of the graphs and what data they represent is correct. Longer response later this evening