Fork me on GitHub

The value of feature extraction

I haven’t mentioned Twine in this space, which may come as a shock to some, none more so than myself. Well, there are good reasons for everything. For one, I haven’t had the time to sit down and take stock. Secondly, I wasn’t been able to elucidate my thoughts in any coherent manner. So let us do this the hard way, by taking a step back.

Semantic analysis has been a staple of bbgm. I am a strong believer in the power of machine readable data, and the potential power it can provide. Whether it be a system like Freebase that allows you to add structure to data, or the kinds of entity extraction and data contexts that Jon Udell and Jeff Jonas talk about, making your data more “intelligent” is something that we should strive for, whether it be in the world of business, or the world of life science (or as Andrew Walkingshaw might point out, materials science). My old colleagues at Scitegic have a motto – “Ask more of your data”. That has always resonated with me. So lets take yet another look at how to make your data smarter, especially when the semantic web seems to be inching towards more mainstream acceptance.

Let’s start with a quote from a talk on Ambient Findability

  • For every search on cancer.gov, there are over 100 cancer-related searches on public search engines.
  • Of these searches, 70% are on specific types of cancer.

There is another statement of interest in the same talk

… the ability to find anyone or anything from anywhere at anytime

The above statements bring to mind the subject of context. Let us agree that “data finds the data“.  In that case we must also agree that data must be found in the correct context. . Don’t believe me, just ask Jeff Jonas. In my mind, if machines are to do this, semantic markup of some sort is the only way. Extracting information from documents, regardless of format, whether they be text, images, video, is one of the key challenges of our times. In the life sciences, right now, I don’t really know of any ways (if someone knows of any, let me know) that someone can extract the meta-data from an image or a video, and correlate it to meta-data in a set of text files and automatically come to a conclusion about the potential context of the two observations. I talked about Persistent Context for the life sciences in the past. Let me steal another of Jeff’s ideas, that of Sequence Neutrality. Essentially, “context engines must constantly be on the lookout for new observations that change earlier assertions – and if a new observation provides such evidence – the invalidated assertions from the past must be remedied.“. Context and feature extraction together make a very powerful mix, which can help pharma companies find better, safer drugs faster. This is especially critical in the kind of healthcare environment taking root today, with an emphasis on pharmacovigilance, early safety assessment, etc. If we can continuously update our safety databases based on new data, we are likely to identify adverse events faster, and essentially could carry out constant meta-analysis.

Jon Udell in a post commenting on Tim O’Reilly’s review of Twine talks about entity extraction and a firefox plugin called Gnosis. I had heard about Gnosis before, but only looked at it askance. However, Jon’s post made me take a second look, and all I can say is WOW. Take a look at the screenshot below. It shows the features that Gnosis extracted from my blog post on pharma futurology. The interesting thing is not the actual results, but the concept. If you could do the Freebase thing, and add additional information which gets stored in a dictionary somewhere, you have that much power available to you. Just as a note, for more complex pages, Gnosis is not always accurate, but the potential is obvious. You can also perform additional queries based on the extracted features. There will come a time when the options available will be that much more powerful. Adaptive Blue’s BlueOrganizer also takes a similar approach, recognizing books, websites, etc.

Which brings us to Twine, Radar Networks to be released semantic web solution. Announced at the Web 2.0 Summit, Twine is at heart an information manager, but with the potential to be a lot more. Like Gnosis, Twine performs entity extraction on documents. Unlike Gnosis, at least as far as I can make out, the power of Twine comes from the crowd effects. Once a group of people have collected a large set of documents, you can use the tags associated with a document to screen through everything else that your collection, or twine, might contain about that tag. Twine also bases it’s extraction based on your browsing behavior and can handle various media types. Of course, the devil lies in the details. Without testing Twine, I am not sure how useful it can really be, but it is definitely promising. Since it uses natural language processing, machine learning and network effects, I almost think about it in the same way as might look at systems from Linguamatics or Biowisdom, which perform text analytics, except that Twine does so much more. Another thing I like about Twine is that it is queryable using web standards (RDF, SPARQL, OWL, etc).

I see a lot of potential for Twine. In a RRW post, Nova Spivack of Radar Networks mentions that content providers have expressed interest in Twine. I presume Twine can become aware of the various life science ontologies, the markup in PLoS journals, OTMI, etc.  In fact both Gnosis and Twine have a lot of potential in the publication area. A semantic plugin that allows you to find papers by an author, similar articles, or use a keyword to seed a search could be quite useful.
The possibilities of collaborative research, finding data types and relationships across an organization, etc become a very real and somewhat simple possibility. Now Twine is just an example. Conceptually, this is easy to grok.  The success of these applications lie in the implementation and how they can be made accessible for a wider (read non-developer) audience.

One of the problems I see is that there are several models out there. Freebase is a queryable datastore, on top of which you can build structure. Twine is a smart entity extraction and behavioral analysis system. At Web 2.0, Nova Spivack suggested that Twine would tie into Freebase, and that I think is very important. By themselves, all we end up with are silos again. If Twine, Freebase, Gnosis, etc can somehow be linked together, we have a lot of power available to us, especially if we throw into the mix text analytics packages. I am inclined to agree with Spivack (and disagree with Danny Hillis) here; the web is the platform. Platforms like Freebase should only be built on top of it.

Further reading:
Freebase – The scientists perspective
The semantic web goes mainstream
New era of semantic apps
When web sites become web services

Technorati Tags: , , , ,

This entry was posted in Informatics, Semantic Web, Software & Internet. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • Could you please elaborate more on
    "If Twine, Freebase, Gnosis, etc can somehow be linked together, we have a lot of power available to us, especially if we throw into the mix text analytics packages."
  • Gosh this was a long time ago, but let me try and guess. The idea was to take some of the dictionaries and ontologies underlying Gnosis (now Open Calais), the network graphs in Twine and the structured knowledge in Freebase and mashing them up together. The text and data mining ideas would be along the lines of what Toby Segaran has done with Freebase
  • I've been struggling with translating our UsefulChem knowledgebase into something more semantically marked up for machine (or human) use. One of the key challenges is that current approaches rely on "facts" rather than the reality that scientific information is always fuzzy. A big part of maturity in science is the ability to recognize how reliable a piece of information is from the context. New students are generally not great at doing this.

    For example, we have reactions where the students reported a certain amount of compound was added. In a traditional paper, because you don't have access to the raw data you take this to be a fact. But when you are given NMRs of the reaction mixture you find that the measured ratio of compounds is not exactly what you would expect. So the experiment is compromised but it doesn't mean that a tremendous amount of useful information can't be derived. For example we can infer that all the products are stable (or not) in the solvent used for the reaction. But the concentration is "fuzzy" and we need to use fuzzy logic as scientists to process what the results mean.
blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present