Scifoo: Google and large scientific datasets

GIVE US YOUR DATA!!!

Yes that was the title of a session led by Google-ites, including Chris DiBona (I forget the name of the actual session leader). The following are notes from that session, somewhat cleaned up.

Google has a mission, one of organizing the worlds information and making it universally accessible and useful. In keeping with that mission, it should come as little surprise that they have a tremendous interest in the sciences. Their current project has the following goals

1. Archive interesting scientific data
2. Distribute data to the people who need it

At this point, they would essentially like to solve engineering infrastructure problems, in a world where the cost of storage is going down by 78% every year. (”Moore’s law is for wimps” was the title of that slide I believe)

At the current time the following are not what Google is trying to achieve

1. Access controls
2. Supporting non-open data (They will support public domain data/CC data)
3. Building domain specific tools
4. In Situ computation
5. Profit

How are they getting there? Well they are providing a 3TB drive array (Linux RAID5). The array is provided in “suitcase” and shipped to anyone who wants to send they data to Google. Anyone interested gives Google the file tree, and they SLURP the data off the drive. I believe they can extend this to a larger array (my memory says 20TB)

Challenges
1. Collecting vs. crawling
2. Culture of proprietorship/exclusivity
3. Licenses
4. International shipping is hard
5. What do you do with all the Metadata?
6. What does it mean to index scientific data?

Chris Di Bona wants to make all data open source. Go Chris!!! Remember, the data is not knowledge, although for too many scientisits, the data are their research.

The data will probably be provided on a Google Code like page, and anyone should be able to get access to the data. There was talk of allowing people to build applications of the data. As Peter Murray-Rust noted, putting the data in the cloud is definitely enticing to some (I would add Amazon to his list as well). Like many others, I am curious to see where this goes. Quite a few people, and not just the astrophysics variety were very interested in what Google has to offer.

Technorati Tags: , , , , , ,

This entry was posted in Admin, BioIT, Science, Search, scifoo. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • This is really interesting. I remember hearing about it somewhere about how they were storing scientific data for NASA or something of this sort but I did not know they were planning to expand this and make it available to all.
  • psst. session leader's name was jon trowbridge, fyi :)
  • Jon Trowbridge ... I apologize to you for forgetting your name.

    kthaney ... thank you for having a memory far better than mine :)
  • I am just curious how they want to provide fast querying access to the data?
  • That's an open question. They are planning to index it, but in terms of querying the data, they feel at this time that researchers know more, so it's up to you to figure that out. I have a feeling that approach will change.
  • ventana
    Would it have anything to do with Google's purchase of gapminder.org's very cool flash based data display engine called trendalyzer?

    I have wanted to see NASA's/Goddard's collection of global ground station temperature readings brought into Trendalyzer for some time. It's good that Noel Gorelick (formerly of NASA) in involved in the project. He can demonstrate how this works by getting his former employers to lead the way.
  • Great question, and I wish I knew the answer. Given the type of data that was discussed at Scifoo, it would seem that Gapminder is not a driving force behind the exercise today, but I am pretty certain that at some point in time, we will begin to see Trendalyzer become a standard part of Google's offerings.
blog comments powered by Disqus
  • Archives