Scifoo: Google and large scientific datasets
August 9, 2007
GIVE US YOUR DATA!!!
Yes that was the title of a session led by Google-ites, including Chris DiBona (I forget the name of the actual session leader). The following are notes from that session, somewhat cleaned up.
Google has a mission, one of organizing the worlds information and making it universally accessible and useful. In keeping with that mission, it should come as little surprise that they have a tremendous interest in the sciences. Their current project has the following goals
1. Archive interesting scientific data
2. Distribute data to the people who need it
At this point, they would essentially like to solve engineering infrastructure problems, in a world where the cost of storage is going down by 78% every year. (”Moore’s law is for wimps” was the title of that slide I believe)
At the current time the following are not what Google is trying to achieve
1. Access controls
2. Supporting non-open data (They will support public domain data/CC data)
3. Building domain specific tools
4. In Situ computation
5. Profit
How are they getting there? Well they are providing a 3TB drive array (Linux RAID5). The array is provided in “suitcase” and shipped to anyone who wants to send they data to Google. Anyone interested gives Google the file tree, and they SLURP the data off the drive. I believe they can extend this to a larger array (my memory says 20TB)
Challenges
1. Collecting vs. crawling
2. Culture of proprietorship/exclusivity
3. Licenses
4. International shipping is hard
5. What do you do with all the Metadata?
6. What does it mean to index scientific data?
Chris Di Bona wants to make all data open source. Go Chris!!! Remember, the data is not knowledge, although for too many scientisits, the data are their research.
The data will probably be provided on a Google Code like page, and anyone should be able to get access to the data. There was talk of allowing people to build applications of the data. As Peter Murray-Rust noted, putting the data in the cloud is definitely enticing to some (I would add Amazon to his list as well). Like many others, I am curious to see where this goes. Quite a few people, and not just the astrophysics variety were very interested in what Google has to offer.
Technorati Tags: Scifoo, Google, Science, Data, Storage, Open Science, Open Data
Comments
20 Responses to “Scifoo: Google and large scientific datasets”
Got something to say?



[…] Deepak has a very good post on Google’s plans for scientific data. It is heartening to know that Google is working towards a future of open scientific data. Check out the post for more information. […]
This is really interesting. I remember hearing about it somewhere about how they were storing scientific data for NASA or something of this sort but I did not know they were planning to expand this and make it available to all.
psst. session leader’s name was jon trowbridge, fyi
[…] Scifoo: Google and large scientific datasets (bbgm) […]
Jon Trowbridge … I apologize to you for forgetting your name.
kthaney … thank you for having a memory far better than mine
I am just curious how they want to provide fast querying access to the data?
That’s an open question. They are planning to index it, but in terms of querying the data, they feel at this time that researchers know more, so it’s up to you to figure that out. I have a feeling that approach will change.
Would it have anything to do with Google’s purchase of gapminder.org’s very cool flash based data display engine called trendalyzer?
I have wanted to see NASA’s/Goddard’s collection of global ground station temperature readings brought into Trendalyzer for some time. It’s good that Noel Gorelick (formerly of NASA) in involved in the project. He can demonstrate how this works by getting his former employers to lead the way.
Great question, and I wish I knew the answer. Given the type of data that was discussed at Scifoo, it would seem that Gapminder is not a driving force behind the exercise today, but I am pretty certain that at some point in time, we will begin to see Trendalyzer become a standard part of Google’s offerings.
[…] But it was Deepak, who later shared his experience on the presentation in details: Scifoo: Google and large scientific datasets […]
[…] Other link: Scifoo: Google and large scientific datasets […]
[…] I have written previously about Google’s interest in large scientific datasets. Attila Csordás, who earlier wrote about dark data, followed up with Jon Trowbridge at Google about Google’s efforts in this direction. While the talk from Scifoo is not available, Attila got permission to upload Jon’s talk up on Slideshare. The presentation is quite similar to the talk at Scifoo, and should give you a good idea about Google’s efforts, specifically the drive array that I talked about in my original post. […]
[…] ? Scifoo: Google and large scientific datasets ? business|bytes|genes|molecules […]
[…] Google and large scientific datasets, we talked about Google shipping a drive array to scientific labs where they could slurp terrabytes […]
[…] major issue with science’s huge datasets is how to get them to Google. In this post by a SciFoo attendee over at business|bytes|genes|molecules, the collection plan was described: (Google people) are […]
[…] article from earlier today reports that research.google.com will soon provide a home for all that scientific data that we have talked about earlier. That part is not a surprise since it had been alluded to at […]
[…] There is more information (including about why Google intend to import data by shipping RAID arrays around the world) here and (more up to date) here. […]
[…] Chris di Bona ran a session called ‘Give us the Data’. Now I wasn’t there but Deepak Singh wrote the session up in a blog post which is my main inspiration. The idea here seemed to be to send in big data sets and that Google […]
[…] do you ship a large dataset to google? Well, they send you hard drives in a suitcase!: (Google people) are providing a 3TB drive array (Linux RAID5). The array is provided in […]
[…] Scifoo: Google and large scientific datasets : business|bytes|genes|molecules […]