Web as platform: Bret Taylor on Open Data

April 9, 2008

Datasets in the Linking Open Data project, as of September 2007I think my bullishness for FriendFeed just went up a notch after reading Bret Taylor’s blog. For those who don’t know Bret is one of the co-founders of FriendFeed and an ex-Googler. The other day he started his first blog, and guess how he did so. His last project at Google was App Engine and his first project after App Engine got released was to develop a blog platform deployed there (hence the appspot.com address). Anyway, apart from being impressed by his coding skills and experience, I was equally intrigued by his latest post; We need a Wikipedia for data. In it, Bret writes (all emphasis mine)

I think all of these barriers to data are holding back innovation at a scale that few people realize. The most important part of an environment that encourages innovation is low barriers to entry. The moment a contract and lawyers are involved, you inherently restrict the set of people who can work on a problem to well-funded companies with a profitable product. Likewise, companies that sell data have to protect their investments, so permitted uses for the data are almost always explicitly enumerated in contracts. The entire system is designed to restrict the data to be used in product categories that already exist.

He continues

The interesting thing is, almost every internet company would benefit if this data were freely available. Most internet companies have embraced open source operating systems because every company needs an operating system, and no company wants their OS to be a competitive advantage - they just want it to work. I would argue we are all in the same boat with these factual data sources. No one really wants factual data accuracy and completeness to be their competitive advantage; we all want the best data possible to build the best products possible, and discrepancies in data quality are artifacts of the extremely inefficient economy of buying and selling data we currently live in. If everyone had the same, high quality data, all of our products would be better for it.

I could end this post here and say “I rest my case”, but there is one area I differ. He argues that we should have a Wikipedia for data, a global database of data sources that anyone can use. I disagree. I believe that we should have a web of data, to be precise linked data, with each data point and data set an addressable resource. In the comments some, including myself, mention Freebase and dbpedia. There is also Swivel where you can upload datasets. Fellow scifoo Aaron Swartz has theinfo.org, a resource for really large datasets. That’s all fine, but do we really want a centralized repository of data? Shouldn’t genomic data stay in GenBank and structural data stay at the PDB. If instead we just made our data open, and in formats that can be slurped up or used for the kinds of innovations that Bret talks about, that would be the ideal situation. Then we could use the data our way, act upon it, apply algorithms, etc.

What do you think? Do we need a Wikipedia of data? Or do you think that the web itself should be our open data commons?

Further Reading
Using data for better results
Data should be set free
The value of information

Image via Wikipedia

Technorati Tags: , ,

The web as platform: Research streaming

March 18, 2008

Michael has taken some of the thoughts that Neil, Cameron and myself have had recently and come up with something even cooler, a research stream. The idea, described here is really interesting. He has used an RSS Wordpress plugin, research activity from Twitter and feeds from Flickr (appropriately tagged), CiteULike and his blog (again only research related material). What’s cool is that those Twitter messages are generated from a script that sends Subversion log messages to Twitter (oh so cool and geeky).

This led to a follow on thought. Right now, we are at a stage where, on the web, we have platforms, aggregators and destination sites. We can consume information passively, but only access it actively. It would be very interesting to see how we can go from passive streaming to passive stream processing. In my mind that’s a Semantic Web problem, where we can find relationships in data streams, but of course, I could be completely way off base. That said, until the web becomes a platform not just for automating information distribution and aggregation, but also for autodiscovery and autointelligence, it’s not quite the comprehensive platform that it can be. That’s what an intelligent backend should be able to do; interpret all that data meaningfully with minimal human intervention.

Technorati Tags: , ,

Semantic thoughts #2

March 4, 2008

In Semantic thoughts, I talked a little bit about my Semantic Web epiphany. In this post, I want to discuss some additional thoughts as I get more familiar with concepts about linked data and the Semantic Web blogosphere in general (like I needed to find one more thing to get interested in).

In recent days, in the mainstream tech blogosphere and in media in general, I’ve been reading a whole bunch of posts, some hype, many others poorly written posts about what the semantic web is all about (in this post I am referring mostly to some of the more formal Semantic Web concepts like RDF). What really got me was Matthew Ingram’s post that the semantic web is boring, and therein lies my irritation. Yes, if one looks at the underpinnings of what makes up the Semantic Web, it looks somewhat hairy and academic (I was intimidated and unimpressed for the longest time). However, for some reason people’s expectations for the Semantic Web are either something glitzy or something life changing. Why is that? My gut feeling is that the semantic web is getting a lot of hype because some prominent blogs and even mainstream media have started talking about it, because Sir Tim Berners-Lee is Sir Tim Berners-Lee and the fact that he is involved gives it some degree of visibility, and because companies have started getting funding. But people forget one thing. The Semantic Web is essentially a backend framework for getting more information from the web of data. In a perfect world, consumers wouldn’t know or care that their application was powered by the Semantic Web, unless they were wondering why their new discovery service was so much more powerful. I don’t believe that the Semantic Web is going to be the panacea that some think, but I do agree with Nova Spivack that the next decade of the web is going to be all about how we make our data smarter. One driven as much by good data design, and leveraging metadata, as by brute force and cool algorithms for interrogating the current structure of the web (a la PageRank). The Semantic Web will only be one part of that smarter backend, which will include everything from new data sources (mobile, geolocation, etc), distributed computing, database architectures, etc. What the front end will look like, I don’t really know. What I do know is that the smarter backend will result in better applications, but just like most people don’t really know that Google Maps is powered by MapReduce or Yahoo Search is now powered by Hadoop, they don’t really know that the superior performance of their application in finding and presenting information is because the web is becoming a linked data web.

Yes, the fact remains that most people developing Semantic Web apps, e.g. BioDash, are not web designers, but more academic types, so their apps don’t look good, but that’s going to change. Just look at goPubmed. Perhaps that’s why so much rides on Twine. Twine is arguably, the first true, very visibly Semantic Web consumer app. Freebase, much as I love it, is a platform first and foremost and designed for the developer.

So where do we go? Personally, I don’t really care about better social network (although FOAF, etc do focus on connecting people), but I do care about “data finding
data
“. In the life sciences, there is enough structured data that I believe a semantic life science web is inevitable. If we don’t go in that direction, the life science community is hurting itself. If we want to start making sense of all the complex relationships, best expressed as a graph, then RDF is the way to go. This will only happen when there is enough of a body of developers well versed in data models who work with those that know how to build usable websites that leverage new, smarter backend technology. That will take a few years, but I am confident it will happen out of sheer necessity.  One only has to look at a conference like CSHALS to see the kinds of problems being that bring together the Semantic Web and the life sciences.

I would like to give a shoutout to Talis here. If every Semantic Web evangelist could be Paul Miller, the world would be a better place. Talis is a platform developed by a company that uses it to build library systems. It also just happens to be a platform for building RESTful web services on top of a Semantic Web backend. It’s up to developers to grok it and use it for building the next generation of applications. Applications that really harness and leverage collective intelligence. That last part is why I believe that the Semantic Web is essential for the life sciences. We need to marry the collective intelligence both from the biological data, as well as a knowledgable community that really understands the data.

Further reading
RealTech
Geospatial Semantic Web
Nova Spivack on the Meaning and Future of the Semantic Web
Nova Spivack on Collective Intelligence and Hyperdata
Web 3G
Paul Miller channeling Richard Waters

Technorati Tags: ,

Web as platform: A social network for ideas and bursty work

February 21, 2008

I swore that I would not be joining random social networks. For one, between Twitter and Facebook, I am all set, but also, none really add anything to my online productivity (sites that do I’ll join in a minute like Dopplr). Well, just decided to join one. It’s called Kluster. What is kluster (they could really use the shift key)?

kluster is a place to harness the power of community collaboration to get stuff done. everyone has ideas, we provide a platform to get them out of heads and into the world…where they belong.

we initially built kluster to facilitate large group decision-making during product development, marketing/advertising initiatives, and event planning. then, after the system got its algorithmic brain, we realized it was powerful in virtually any decision-making activity, with groups large or small.

The only reason kluster caught my eye was a Businessweek article and the fact that the founder, Ben Kaufmann, had managed to get into TED, an event I dream to speak at some day. Anyway, I really liked what I was reading, went to the site and liked what I saw. So, I’ve signed up (where’s the OpenID folks?).

Will the site live up to expectations? Can’t really tell, but it did give rise to some thoughts. At Ignite Seattle the other day, Justin Martenstein gave a talk on Six Hour Startup. The challenge here is what to do once an idea comes to fruition. Can a site like kluster function as an exchange for ideas and collaboration, and in addition as a place where people can bring and idea to life and allow others to potentially run with it. Here is why it might work, keeping in mind the ability of people to somehow get drawn to inane ideas, the kind that get Facebook apps funded.

Ideas that attract investments prevail, and those who invest in them gain equity in the project—whether it’s a logo, a toy, or a corporate marketing campaign. If a company buys a product or an idea from Kluster, Watts turn into dollars.

If it works the way I hope it does, that’s a pretty powerful vehicle to monetize ideas and concepts. Will be back with more once I’ve had a chance to peruse the site. I am going in with a dose of healthy skeptcism, since the kinds of projects might just be a turn off.

Also, I wonder how well Kluster sits next to Google Code or SourceForgeW. They definitely seem to have thought through some of the IP issues, but I am still not clear on a couple of things. If I understand correctly, the real benefit of kluster is for “open” projects, whether code related projects or otherwise. You will be most effective using open source, open data, creative commons, etc. If I were the Kluster folk, it would really enable some of these capabilities.

Anyway, to sum things up while I wait for my account to get activated, Kluster seems to be part InnoCentive, part Basecamp, part marketplace, part Facebook. Definitely intriguing.

PS. Nice choice of name for their streaming server domain

Technorati Tags: , , , ,

Andrew Walkingshaw on making the Semantic Web usable for scientists

February 18, 2008

I first ran into Andrew Walkingshaw when I saw his excellent talk on Web 2.0 for scientists on YouTube. At Scifoo, I got a chance to meet Andrew and help coordinate a session where he presented Golem, a project that has since seen the light of life. The talk emphasized the need to abstract out the technology from the end users. This week Andrew is at Semantic Camp where he has a presentation on automatically indexing science (slides below). The presentation gives you a real taste of what Golem seeks to accomplish.

The goal in essence is to take the chemical data captured as CML and make it more usable to the general audience. The key is slide 40, where Andrew talks about SPARQL and how scientists shouldn’t need to know SPARQL, which makes it very important to present a usable front end to your average scientist. This is a point that cannot be overemphasized. We have wonderful technology; technology that enables us to extract rich information from out datasets. The challenge to the development community is simple. How can we bring this technology to the scientific community and truly enable them? It is also why I argue so often that your average computational scientists shouldn’t be developing websites. There are just too many crummy services out there.

Technorati Tags: , , ,

Next Page »