Dreaming of a life science Semantic Web platform

July 15, 2008

I have long been intrigued by IO Informatics and their flagship Sentient Suite. Bio-IT World is carrying an article about the company that reminded me of that interest. I have never used it, so I wonder if anyone else out there has?

Now the Sentient Suite doesn’t really sit on the WWW, so it’s not really an optimal Semantic Web product, but it does use RDF, a wonderful data model for the kind of data life scientists use, rich in relationships and metadata, and SPARQL to query the data. So at least within the confines of a company’s data sources, in theory you have a rich graph of data which can be queried. I’ve long felt that we either don’t leverage structured data optimally in much life science software, or over engineer it. The company claims to be smart about how it’s doing this, although given the complexity of data, how well they achieve their goals is something I question.

My ideal would be a life science version of the Talis platform, perhaps with an industry facing side (much the way Talis has its library business), and a public facing side that sits on the web with published APIs and the underlying technology that allows developers to build tools on top of it. I am sure they’d be a ton of takers. IO has one side of this (the enterprise facing side). Would be cool if they, or someone else, made a platform available publicly, with some underlying intelligence that can be leveraged by APIs.

Zemanta Pixie

Collective Intelligence in the hospital

June 18, 2008

It’s a Harvard teaching hospital, which means that a procession of young doctors come through, each with a fresh line of inquiry, few of which, when fulfilled, contributes to an institutional memory. Most of the doctors I’ve seen here have been only once or twice.

Doc Searls is sick, but like Jon Udell in the comments, I was drawn to the lines above. In medicine, where information and knowledge is truly built via the collective, and where you might find non-obvious linkages, wouldn’t we really benefit from capturing this collective intelligence, across doctors and patients. A lot of healthcare systems focus on building hospital efficiency and on a per-patient level. I wonder if they do as good a job of capturing this knowledge, and potentially alerting doctors of possible diagnoses or help that add to the knowledge they have built on their own? If not, there is an opportunity there, which someone needs to tap into. And yes, I think ontologies underlying data entry would be a huge plus.

Technorati Tags: , ,

Zemanta Pixie

Calais gets an upgrade

May 29, 2008

I really like services like Calais and Zemanta. They are both poster children for linked data and entity extraction. Anyway, just got an email from the folks at Calais, highlighting several improvements in the service.

Hi from Calais.

We’re writing to tell you about some new and exciting developments at the Calais Initiative.

Working from the outside in – we’ve released a new web site. While our initial site was adequate for our accelerated launch – we’ve always wanted more and now we have it. The new site (of course, still at www.opencalais.com) has a range of new features to make it more useful for you. You can create rich user profiles, join interest groups, post your Calais-related creations to the gallery, and send messages directly to other members. We’ve also implemented much higher-quality forums to allow you to pose questions and share answers. Your user IDs and passwords have all been migrated for you. If you’ve forgotten your password or key, just go to the Community tab to retrieve it.

Second, a number of new tools from us and our partners are on the website. They include Tagaroo, our great WordPress plugin; Marmoset, a toolkit for integrating Calais within the Yahoo! SearchMonkey framework; a series of Drupal, integration modules; and many more. Some of these tools are valuable in and of themselves – some help make it much easier for you to develop using Calais.

Third, we have significantly extended and enhanced the Calais Web Service. The service now supports a number of new entity types including TV shows, sports events, and music albums. In addition, the service can now deliver results to you not only in RDF – but as “Simple Tags” and as Microformats as well.

We’ve also created special interest areas for Publishers, Bloggers, Software Providers, Content Managers, and Developers.

And, most important is the ever-growing ecosystem of Calais powered tools that are being developed by the community. With our new Gallery, developers can create entries to highlight their own creations. I’d make it my first stop on the new site.

As for what’s coming in the future – think dramatic knowledge domain expansion and think linked data.

So, take a moment to visit Calais. The sun is shining. The breeze is soft, and cool things keep happening.

Regards,

The Calais Team

You know what I’d really like to see; a similar service, ideally with a Wordpress plugin for life science data, perhaps one that uses some ontologies at the backend to identify and markup appropriate content in a blog and other content. In the life science space, I can only think of Transinsight (sticking to commercial entities) as a potential candidate for such a task. There are other text analytics companies (Linguamatics, Biowisdom, etc), but they’ve completely different business models and areas of focus.

Technorati Tags: , , ,

The web as platform: WikiProteins

May 28, 2008

WikiProteins is all over the web, including BoingBoing and Ars Technica (and of course all over my FriendFeed). This is the first project by WikiProfessional, essentially a Wikipedia for specific content (not unlike the idea of Wikipedia focussing on scientific topics at a high level and pointing to other sites for more technical, domain-specific detail). Sound familiar? WikiProfessional has some of the same ideas as Google’s Knol project (where is that?). The idea is to build a concept web of knowlets. In order to achieve that, MediaWiki has been extended to help some of those underlying relationships to be captured. What I think is missing (and I am not a 100% sure about this) is a true RDF backend, which would really make this phenomenal. The cool part, the current Concept Web as they call it, is all about the life sciences.

To a great degree, this is what the web and science should be all about; Pulling in data from different sources to build a new resource. WikiProteins pulls in data from other sources, e.g. Pubmed. This is why, IMO, every biological content site should have a RESTful API. Let me go one step further and say that every biological content site should provide access to the data in RDF, then we can truly say we have a linked data web.

WikiProteins comes from some heavyweights. Anytime your PI is Amos Bairoch, and Jimmy Wales is a co-author, you know this is serious stuff, and I really like what they’ve done. In some ways, this is better than the Encyclopedia of Life, at least when it comes to making things accessible and available. Here is the abstract for the paper in Genome Biology

WikiProteins enables community annotation in a Wiki-based system. Extracts of major data sources have been fused into an editable environment that links out to the original sources. Data from community edits create automatic copies of the original data. Semantic technology captures concepts co-occurring in one sentence and thus potential factual statements. In addition, indirect associations between concepts have been calculated. We call on a ‘million minds’ to annotate a ‘million concepts’ and to collect facts from the literature with the reward of collaborative knowledge discovery. The system is available for beta testing at http://www.wikiprofessional.org.

Sounds just like what the doctor ordered in some ways, especially for a protein person like yours truly. A search from one of my favorite proteins, bacteriorhodosin, yields a knowlet, already populated with a ton of info (note that the information has not been added manually, but automatically, but once there, “experts” can edit the information). The knowlet is information rich, although it is sorely missing structural information. The publications chosen are also not necessarily the first ones that come to mind. I wonder how they select relevancy? There is a nice visual histogram which allows you to select various pieces of information extracted from the underlying data, concepts, and classifying them as well (whether they are predictive, factural or a co-occurence)

WikiProteins functionality

This is probably a good time to describe knowlets and concepts. From the paper

In WikiProteins each concept can be edited by the community. Each concept page is hyperlinked to the Knowlets of all concepts mentioned in that page. A Knowlet stores relationships between a given source concept and individual target concepts. The various relationships (F, C and A) between two concepts are computed into a single composite value, named the ’semantic association’. The technology allows the coupling of all Knowlets into a larger, dynamic ontology called the ‘concept space’

The paper has a nice figure showing how they arrive at these concepts. The next section though is what really gets me excited (emphasis mine)

Knowlets and their connections can be exported into standard ontology and web languages such as the Resource Description Framework (RDF) and the Web Ontology Language (OWL). Therefore, any application using these languages will enable the use of Knowlet output for reasoning and querying with programmes such as the SPARQL Protocol and RDF Query Language. The concept space is provided in open access. The system performs a recalculation of the semantic relationships in the entire biomedical concept space at regular intervals.

Also take a look at the linker, which adds concept web capabilities to a number of resources, including PubMed

That’s all I have time for right now. More later, after I’ve had a chance to play.

Technorati Tags: , , , ,

The web as platform: A science Data Commons

May 24, 2008

Datasets in the Linking Open Data project, as of September 2007Cameron Neylon has a wonderful post on how we can build a Data Commons for the sciences. Cameron brings together two intricately interwoven concepts. The Data Commons and the tools required to record and process all this scientific information. To a degree its not too far away from the WWW, where we have simple protocols connecting pieces and tools (e.g. search engines) that bring all this together. For an open data web, the Semantic Web takes on a level of importance that most people don’t appreciate, but that’s not what this post is about.

Cameron proposes a model in his post. As Cameron notes, repositories already exist for most data types and the majority are open. Where the Google’s and Amazon’s can jump in is to enable these repositories, especially with next-gen sequencing and other data types pushing the scientific communities knowledge and capabilities. Very rightly though he pushes the idea of long tail science, i.e. not repositories for structures, etc, but all the information we are streaming out of our labs. What will be the infrastructure that will handle these days. The problem, as Cameron notes, is data capture and perhaps most important, data re-use, for which capturing the associated metadata is critical, and having tools that allow you to consume the data are even more critical.

There are a lot more details in the post. My preference would be that these are driven by need and intention rather than by formal committees. The internet provides protocol standards, the Semantic Web stack is essentially complete. In various scientific domains we have efforts on data formats and standards. As we start playing around with the data, the ones that resonate will bubble to the top. The key is to make sure that we as a community come together to realize that this needs to be done. The technology will follow.

Image via Wikipedia

Technorati Tags: , , ,

Next Page »