Fork me on GitHub

Scifoo: Google and large scientific datasets

GIVE US YOUR DATA!!!

Yes that was the title of a session led by Google-ites, including Chris DiBona (I forget the name of the actual session leader). The following are notes from that session, somewhat cleaned up.

Google has a mission, one of organizing the worlds information and making it universally accessible and useful. In keeping with that mission, it should come as little surprise that they have a tremendous interest in the sciences. Their current project has the following goals

1. Archive interesting scientific data
2. Distribute data to the people who need it

At this point, they would essentially like to solve engineering infrastructure problems, in a world where the cost of storage is going down by 78% every year. (“Moore’s law is for wimps” was the title of that slide I believe)

At the current time the following are not what Google is trying to achieve

1. Access controls
2. Supporting non-open data (They will support public domain data/CC data)
3. Building domain specific tools
4. In Situ computation
5. Profit

How are they getting there? Well they are providing a 3TB drive array (Linux RAID5). The array is provided in “suitcase” and shipped to anyone who wants to send they data to Google. Anyone interested gives Google the file tree, and they SLURP the data off the drive. I believe they can extend this to a larger array (my memory says 20TB)

Challenges
1. Collecting vs. crawling
2. Culture of proprietorship/exclusivity
3. Licenses
4. International shipping is hard
5. What do you do with all the Metadata?
6. What does it mean to index scientific data?

Chris Di Bona wants to make all data open source. Go Chris!!! Remember, the data is not knowledge, although for too many scientisits, the data are their research.

The data will probably be provided on a Google Code like page, and anyone should be able to get access to the data. There was talk of allowing people to build applications of the data. As Peter Murray-Rust noted, putting the data in the cloud is definitely enticing to some (I would add Amazon to his list as well). Like many others, I am curious to see where this goes. Quite a few people, and not just the astrophysics variety were very interested in what Google has to offer.

Technorati Tags: , , , , , ,

This entry was posted in Admin, BioIT, Search, scifoo. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

14 Comments

  1. Posted August 10, 2007 at 01:06 | Permalink

    This is really interesting. I remember hearing about it somewhere about how they were storing scientific data for NASA or something of this sort but I did not know they were planning to expand this and make it available to all.

  2. Posted August 10, 2007 at 05:06 | Permalink

    This is really interesting. I remember hearing about it somewhere about how they were storing scientific data for NASA or something of this sort but I did not know they were planning to expand this and make it available to all.

  3. Posted August 10, 2007 at 17:00 | Permalink

    psst. session leader’s name was jon trowbridge, fyi :)

  4. Posted August 10, 2007 at 22:01 | Permalink

    Jon Trowbridge … I apologize to you for forgetting your name.

    kthaney … thank you for having a memory far better than mine :)

  5. Posted August 11, 2007 at 02:01 | Permalink

    Jon Trowbridge … I apologize to you for forgetting your name.

    kthaney … thank you for having a memory far better than mine :)

  6. Posted August 13, 2007 at 13:37 | Permalink

    I am just curious how they want to provide fast querying access to the data?

  7. Posted August 13, 2007 at 17:37 | Permalink

    I am just curious how they want to provide fast querying access to the data?

  8. Posted August 13, 2007 at 15:47 | Permalink

    That’s an open question. They are planning to index it, but in terms of querying the data, they feel at this time that researchers know more, so it’s up to you to figure that out. I have a feeling that approach will change.

  9. Posted August 13, 2007 at 19:47 | Permalink

    That's an open question. They are planning to index it, but in terms of querying the data, they feel at this time that researchers know more, so it's up to you to figure that out. I have a feeling that approach will change.

  10. ventana
    Posted August 14, 2007 at 11:06 | Permalink

    Would it have anything to do with Google’s purchase of gapminder.org’s very cool flash based data display engine called trendalyzer?

    I have wanted to see NASA’s/Goddard’s collection of global ground station temperature readings brought into Trendalyzer for some time. It’s good that Noel Gorelick (formerly of NASA) in involved in the project. He can demonstrate how this works by getting his former employers to lead the way.

  11. Posted August 14, 2007 at 11:13 | Permalink

    Great question, and I wish I knew the answer. Given the type of data that was discussed at Scifoo, it would seem that Gapminder is not a driving force behind the exercise today, but I am pretty certain that at some point in time, we will begin to see Trendalyzer become a standard part of Google’s offerings.

  12. ventana
    Posted August 14, 2007 at 15:06 | Permalink

    Would it have anything to do with Google's purchase of gapminder.org's very cool flash based data display engine called trendalyzer?<br><br>I have wanted to see NASA's/Goddard's collection of global ground station temperature readings brought into Trendalyzer for some time. It's good that Noel Gorelick (formerly of NASA) in involved in the project. He can demonstrate how this works by getting his former employers to lead the way.

  13. Posted August 14, 2007 at 15:13 | Permalink

    Great question, and I wish I knew the answer. Given the type of data that was discussed at Scifoo, it would seem that Gapminder is not a driving force behind the exercise today, but I am pretty certain that at some point in time, we will begin to see Trendalyzer become a standard part of Google's offerings.

  14. Posted December 2, 2008 at 21:30 | Permalink

    I agree with you, now Google is showing tremendous interest in the sciences, especially in solving engineering infrastructure problems. The downtrend of 'costs of storage' is alarming. Nice Article, thanks.

13 Trackbacks

  1. [...] Deepak has a very good post on Google’s plans for scientific data. It is heartening to know that Google is working towards a future of open scientific data. Check out the post for more information. [...]

  2. [...] Scifoo: Google and large scientific datasets (bbgm) [...]

  3. [...] But it was Deepak, who later shared his experience on the presentation in details: Scifoo: Google and large scientific datasets [...]

  4. [...] Other link: Scifoo: Google and large scientific datasets [...]

  5. [...] I have written previously about Google’s interest in large scientific datasets. Attila Csordás, who earlier wrote about dark data, followed up with Jon Trowbridge at Google about Google’s efforts in this direction. While the talk from Scifoo is not available, Attila got permission to upload Jon’s talk up on Slideshare. The presentation is quite similar to the talk at Scifoo, and should give you a good idea about Google’s efforts, specifically the drive array that I talked about in my original post. [...]

  6. [...] ? Scifoo: Google and large scientific datasets ? business|bytes|genes|molecules [...]

  7. [...] Google and large scientific datasets, we talked about Google shipping a drive array to scientific labs where they could slurp terrabytes [...]

  8. [...] major issue with science’s huge datasets is how to get them to Google. In this post by a SciFoo attendee over at business|bytes|genes|molecules, the collection plan was described: (Google people) are [...]

  9. [...] article from earlier today reports that research.google.com will soon provide a home for all that scientific data that we have talked about earlier. That part is not a surprise since it had been alluded to at [...]

  10. [...] There is more information (including about why Google intend to import data by shipping RAID arrays around the world) here and (more up to date) here. [...]

  11. [...] Chris di Bona ran a session called ‘Give us the Data’. Now I wasn’t there but Deepak Singh wrote the session up in a blog post which is my main inspiration. The idea here seemed to be to send in big data sets and that Google [...]

  12. [...] do you ship a large dataset to google? Well, they send you hard drives in a suitcase!: (Google people) are providing a 3TB drive array (Linux RAID5). The array is provided in [...]

  13. [...] Scifoo: Google and large scientific datasets : business|bytes|genes|molecules [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present