Archive

Posts Tagged ‘Natural Language Processing’

NLTK: a natural language processing toolkit in Python

October 11th, 2008

NLTK looks very useful.

“NLTK — the Natural Language Toolkit — is a suite of open source Python modules, data and documentation for research and development in natural language processing. NLTK contains Code supporting dozens of NLP tasks, along with 40 popular Corpora and extensive Documentation including a 375-page online Book. Distributions for Windows, Mac OSX and Linux are available.”

The development of NLTK is led by Steven Bird, Edward Loper, and Ewan Klein.

(Spotted on Language Log.)

English , , , , ,

Which flavour does knowledge have on the web?

October 9th, 2008

In recent debates within the KiWi - Knowledge in a Wiki project, the need arose to further refine and find a common understanding of the type of knowledge that is (ideally) managed and processed using (semantic) wikis. One of the proposals evolved around a conceptualization of knowledge put forward by Gabi Reinmann-Rothmeier, also dubbed the “Munich Modell” (Münchner Modell).

In the Munich Modell, knowledge comes in three states of matter: solid (like ice), liquid (like water) and gas (like water vapor).

“Frozen” knowledge is knowledge in its most tangible, manageable form, for instance the type of verified, expert-endorsed information you would find in an encyclopedia like the Encylopedia Britannica.

“Gaseous” knowledge, on the other hand, is knowledge in its least consolidated form: think for instance of the type of heated debate you might have with folks in a pub, which is arguably the least structured, most uncontrollable, but also the most engaging type of knowledge!

And the “liquid” form of knowledge, eventually, is the common knowledge of day-to-day-life. It’s probably fair to say that it becomes obvious mostly when in the process of changing its state of matter: When it is calibrated against “frozen” or informational knowledge or when it is debated, becomes “gaseous” knowledge that informs action. (If you’d like to know more about the Munich model and are able to read German, you might want to download the original article here - PDF, 365 KB).

When talking about knowledge that is managed, used or, respectively, that evolves online, I think it also makes sense to pay some attention to the type of community that is preferred by particular online tools or environments. The particular flavour of knowledge, in this sense, is simultaneously characterized and shaped by the state of matter of knowledge and the form of the community that applies.

N.B. The following is not an immediate translation of the “Munich model”, but rather a reconceptualization which tries to also consider that different community models (and their implementation through IT) also play a role for the whole spectrum of knowledge management on and with the web (e.g. for online communication and interaction, online publishing and documentation and maintenance of web infrastructures).

Web-Flavour 1: The Blogosphere - gas, gas, gas!

Hmm… sniff it! This is the flavour I like best because it is my flavour. On the blogosphere (and twittersphere), knowledge is exchanged, developed further and evolves almost like in a pub debate… it does have the extra advantage though that you can add links, cite resources and that you get to keep your blog posts (or tweets or equivalents thereof) for later reference or debate. Different people approach blogging differently - the approach I would favour in the context of this definition is a form of blogging that invites dialog in that it allows others to comment and react, and where contributors aren’t anonymous, distant institutions, but are addressable using their personas/identities on the web. As such, contributions are often marked or tinted by the views and personality of that real-life person behind a persona/identity. As a short cut, think of this flavour as the flavour of the social media tag cloud.

Web-Flavour 2: Wikipedia - evolving slowly with the flow

Wouldn’t you agree that Wikipedia is like a sea of knowledge? It is fed by brooks and rivers (in this analogy: for instance the article and contributions that are invited on the Wikipeda Community portal) that make it rise and swell like tidal waves would, but mostly by millions, billions and trillions of tiny drops that trickle in on a daily basis. In comparison to the blogosphere, the world inside wikipedia is a rather neat and orderly one: Titles of pages are unique, and were they aren’t, there are disambiguation pages (like this one) in place. Even though articles are written by real humans (I assume), there is no visible author attached to an article (unless you start developing an interest for Discussion Pages, e.g. this one; most people don’t). Wikipedia is the sea of knowledge we bathe in on a day-to-day-basis without even noticing - just try to remember how many and which Wikipedia pages you have looked at today or this week - can you? Most probably not. It’s the result of a community effort, but it’s not about views and opinions of individuals, but about what we all know together or would know together if we could wire our brains to one another (tired of the Wikipedia examples? Check out Factolex instead, a collaborative, micro-content encyclopedia that allows you to extract and conceptualize bite-sized pieces of knowledge as you go).

This is the flavour I like to have around me every day, because it makes things easier without asking for a huge effort. It’s also the flavour of thesauri and metadata schemata.

Web-Flavour 3: The unfinished structure of ontologies

Flavor three is NOT (as one might expect) the flavour of the Encyclopedia Britannica Online (which is one huge data silo and therefore not relevant for my scope of interests) … instead, I would argue that it’s the flavour of the web’s infrastructures and of knowledge infrastructures like ontologies. Think of the geometry of snow flakes: they all follow rules but none of them is like another. The open world assumption of ontologies also applies to snow flakes - just because you haven’t seen a particular shape doesn’t mean it doesn’t exist! Nobody has the patent for building snowflakes - Wilson Bentley in his famous snowflake shots just captured an expression of rules that are out there, in the world, belonging to the world. Ontologies capture the structure of what, to the best of knowledge and belief, can be said about the world. Anyone can build an ontology, but we prefer to have experts do that job: members of the scientific community which has its own ranking and weeding mechanisms in place.

Flavour three is the flavour of things we wish to be able to rely on on the web, and where we can invest a trust that is much greater than the trust we invest in people. More like the trust we invest in, say, the laws of nature.

So what is the flavour of a Semantic Wiki?

A good mix of flavour two and three, I would argue. A Semantic Wiki is a vessel for the sea of relevant knowledge (relevant for instance for the members of a particular team), but enhances it with the structure of the domain knowledge that applies.

Having said that: A semantic wiki would be much spicier if it also had a bit of the flavour of the blogosphere and social media, as there are tasks where a bit of a debate, a bit of a controversial exchange and the ability to respond to people directly is highly valuable! Just as water, knowledge goes through a cycle of different states of matter, and knowledge is not processed by segregated individuals, but in communities and through networks of people.

Before publishing this, I wanted to get some feedback in particular from KiWi members working on enabling technologies - here is Peter Dolog’s take; Peter is an Assistant Professor in the Information Systems Unit at the Dept. of Computer Science Aalborg University:

Peter DologI like the distinction and comparison of knowledge to some natural elements like gas, liquids, solids or snowflakes. These give a good metaphor for understanding when talking about different flavours of knowledge. It is also fascinating to see how humans move between these three categories by participating in different social processes or simply by studying these things.

It is, however, a bit more difficult to see how this can be done or supported in the most suitable form on the web or in the intranets of companies. At the same time it seems to me, from the discussions we have had in the KiWi project, that semantic wiki platforms could indeed facilitate this. Wikis naturally provide the social contexts for contributions. Semantic wikis with tags and ontology management seem to be a first step towards a flexible knowledge consolidation infrastructure where one can move easily between these categories; and other technologies such as natural language processing and automated reasoning can help further. Personalization can further provide and adjust views on the knowledge according to preferences.

I am happy that we can study these phenomena in the KiWi project at least to a certain extent and perhaps contribute to this as well. I am confident that this is relevant also for the industry and especially for large distributed companies where externalization of knowledge is a must.

So there is a lesson to be had: When building a knowledge management system using the web, think of the three states of knowledge, but most importantly, also think of the form of community and community processes that are required or preferable to allow future users to really put that knowledge to work - melt it, share it, heat it, debate it, freeze it, keep it, let it evolve!

Image sources on Wikicommons:
Water vapour by Markus Schweiss
Wake at Boelge Stor by Malene Thyssen
Ice Crystals by Petr Dlouhý

t

Reblog this post [with Zemanta]

English , , , , , , , , , , , , , , , , , , ,

The Wild vs The Orderly: Folksonomies and Semantics (TRIPLE-I 2008)

September 4th, 2008

This second day of TRIPLE-I 2008 was my personal folksonomy day, even though the theme was already set yesterday, with Andreas Hotho’s invited talk about “Extracting Semantics from Folksonomies” which was the opening lecture of the workshop “Knowledge acquisition from the Social Web.”

Andreas Hotho is directing the Bibsonomy project at Kassel University’s Knowledge and Data Engineering resarch group; Bibsonomy is a social bookmark and publication sharing system catering especially for researchers who, next to bookmarkingm also wish to manage publications. Next to other interesting things, Bibsonomy supports the import of bookmarks from del.icio.us, Firefox bookmarks and local BibTex files. Being a project led by a university’s computer science department, Bibsonomy is at the same time the result, the object and a stimulus for research in the area of tagging and folksonomies. Andreas describes this double appeal of folksonomies to both ordinary people and researchers in a 12 seconds vlog post:


Andreas Hotho’s statement about folksonomies and research (see www.bibsonomy.org) on 12seconds.tv

One of the outcomes of the research into folksonomies is FolkRank, a search algorithm that exploits the structure of folksonomies; the name reveals that it was inspired by PageRank, but as the graph of folksonomy structures does not correspond to the web graph, some adaptations had to be made. The specifics of these adaptations can be found in an online article by Andreas and his colleagues: “FolkRank: A Ranking Algorithm for Folksonomies” (PDF, 268 KB).

Andreas Hotho’s talk more specifically addressed the search for methods to identify tags which describe the same concept (or a more specific / a more general concept respectively) within a folksonomy. He suggested two approaches:

  1. Applying measures directly to folksonomy statistics, allowing to describe tags as a vector; e.g. co-occurrence frequency and FolkRank could serve as a similarity measure (with these two having a tendency towards high-frequency tags) or a cosine method (which is more likely to produce “siblings”)
  2. Looking up tags in an external thesaurus/vocabulary (for instance achieving semantic grounding by mapping a tag and its most similar tags with Wordnet Synsets)

Future areas of interest within folksonomy research Andreas proposed were trend detection, tag recommendation, detecting spam (a major challenge!), logsonomies (i.e. the structure of search engine query log files) and learning synsets, hierarchies, and structures of folksonomies. Andreas Hotho can be contacted via his homepage, if you have any further questions regarding Bibsonomy, FolkRank or this present piece of research.

Another presentation dedicated to folksonomies - and the presentation that won my personal presentation design award - was “Seeding, Weeding, Fertilizing - Different Tag Gardening Activities for Folksonomy Maintenance and Enrichment” by Katrin Weller and Isabella Peters, both from the Dept. of Information Science at Heinrich Heine University in Düsseldorf. The entire presentation was designed to match the CI of Tagcare, a tag gardening tool that is hopefully going to go online soon.

The term “Tag Gardening” was borrowed from James Governor who wrote in a 2006 blogpost:

“Like plants or animals, tags evolve in an emergent fashion, open to hybridisation. Stewardship can help grow and put roots down.

Helping the darwinian process is tag gardening.

Tag gardening is about taking tags in the wild and tending to them, or identifying a wild tag that will do well in your south facing IT

garden. I am talking about domestication here.

Just like there are professional bloggers i am pretty sure some parties will emerge that get paid for their abilities.”

I seriously hope that the latter is going to come true, even though I have the feeling that most providers will continue to consider user input and effort pro bono work!

Katrin Weller’s intro (Isabella Peters had excused herself) focused on the well-known problems with tags and folksonomies, e.g. :

  • spelling variants, synonyms, abbreviations, different natural languages
  • adhoc or personal functions of tags other than content description (e.g. “toread”, “@Henry”, “nicepic”)
  • flatness of tag clouds which allows for browsing by popularity, but not by semantic interrelations

She further distinguished three levels where tag or tag cloud improvement becomes relevant:

  • single document vs document collection level
  • Single user vs collaborative level
  • intra- and cross plattform level (e.g. different tagging conventions, tag separation with comma or blank space, etc)

To push the gardening metaphor even further, Kathrin presented us their ideas of weeding, seeding, fertilizing etc.:

Weeding
The weeds in this case are “bad” tags like spam or misspelled tags (weed: any plant that crowds out cultivated plants)
Aim: enhancing recall and a consistent indexing vocabulary
Achieved by: type-ahead functionality, editing funcionalities, natural language processing, user guidelines for indexing and retrieval, nomination of authorized users as gardeners

Seeding
Seeding in folksonomies means to expand frequently used tags by more specific tags (called “baby tags” or “seedlings” by Katrin Weller; seedling: young plant or tree grown from a seed)

Landscaping
The idea of landscaping here means to create “flower beds” through identifying species of tags, e.g. by similarity.
Aim: enhancing precision and expressiveness

Fertilizing
Fertilizing in this context means to combine folksonomies with other knowledge organization systems (KOS): thesauri, controlled vocabularies, ontologies, etc. (fertilizer: any substance such as manure or a mixture of nitrates used to make soil more fertile). Fertilizing might work both ways, Katrin suggested: a folksonomy might be fertilized with the semantic structure of a KOS, or a KOS enhanced by terms from a folksonomy.

And finally TagCare: The ambitious plan is to have a system that allows to import tag clouds from Flickr, deli.icio.us and Bibsonomy, cleanse out dissimilarities between tags, add hierarchical structure to the tag clouds, allow the user to view tag statistics and probably also to have community features, such calibrating one’s tags with those of the chief gardener or to activate collaborative spam elimination. It is going to be a free service, and if you want to be notified when it goes live, you might want to send an email to Katrin.

This full-service proposal for tag gardening does of course sound brilliant - yet is it going to be feasible, on a technical level? In the post-presentation discussion, somebody mentioned Faviki, which relies on DBpedia concepts to solidify the tag cloud. It didn’t exactly seem as though the TagCare team had already thought along these (semantic web) lines, even though this perfectly corresponded to their ‘Fertilizing’ idea. But if TagCare solely relies on good human gardeners, how long will it take until they have gained a big enough community to stimulate someone’s altruism? The idea of tag gardening of course is beautiful, and I am curious to learn more about the technology it is going to use.

Other folksonomy and tag related presentations that I was unable to attend or am unable to describe now, after the 10th hour of my 2nd day at TRIPLE-I, with a band performing folkore music involving yodeling and probably Schuhplattler right outside of this room:

  • Quality Metrics for Tags of Broad Folksonomies (Celine Van Damme, Martin Hepp, Tanguy, Coenen, University of Brussels, Universität der Bundeswehr München
  • Providing Multi Source Tag Recommendations in a Social Resource Sharing Platform (Martin Memmel, Michael Kockler, Rafael Schirru, German Research Centre for Artificial Intelligence DFKI)
  • Semantic Tagging and Inference in Online Communities, Yildirim Ahmet, Üsküdarli Suzan, Boğaziçi University
  • Using Visual Features to Improve Tag Suggestions in Image Sharing Sites (Mathias Lux, Oge Marques, Arthur Pitman, Klagenfurt University)
  • Harnessing Wikipedia for Smart Tags Clustering (Maria Grineva, Maxim Grinev, Denis Turdakov, Pavel Velikhov, Russian Academy of Sciences)

Please leave a comment if you think that any of the above needs correction.

EDIT: I got the chance to record another 12 seconds definition (and am thinking of setting up a video glossary for the Semantic Web now): Rolf Sint from Salzburg Research explains what folksonomies are and why folksonomies and ontologies go together well in 12 seconds!


Rolf Sint explains folksonomies and their relation to ontologies on 12seconds.tv

Reblog this post [with Zemanta]

Uncategorized , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Microsoft buying Powerset

June 27th, 2008

Allegedly. Powerset are natural language processing specialists. See also last year’s ISWC talk from CTO Barney Pell, “Natural Language and the Semantic Web”, discussions with Barney from last month’s Talis Semantic Web Gang chat, and earlier commentary from Paul Miller.

Uncategorized , , , , ,

SemTech 2008: Nova Spivack (Radar Networks) - “Experience from the Cutting Edge of the Semantic Market”

May 20th, 2008

Nova Spivack of Radar Networks gave a keynote talk at the 2008 Semantic Technologies Conference this morning.

He started off by giving some background to Twine. Twine is a service that lets you share what you know. When Nova pitched the original idea for the underlying platform to VCs in 2003, he was told that it was a technology in search of a problem. Thanks to DARPA and SRI, Nova had carried out some research in this field for a few years. The intial proposal to VCs was to develop next-generation personal assistants based on the Semantic Web. After the initial knock back, Nova went out again to raise funding, and Paul Allen stepped in as the first outside angel with Vulcan Capital.

Radar started working on the first commercial version of the underlying platform and also began work on the Twine application. The platform underneath Twine is not something they’ve talked about much so far, and they will discuss it (not at this conference) in the Fall. Radar also want to allow non-Semantic Web savvy people to build applications that use the Semantic Web without doing any programming.

Twine was announced last October at the Web 2.0 Summit. They began the invite-only beta soon after that. The focus of Twine is interests. It’s a different type of social network. Facebook is often used for managing your relationships, LinkedIn for your career, and Twine is for your interests. He called it “interest networking” as opposed to social networking.

With Twine, you can share knowledge, track interests with feeds, carry out information management in groups or communities, build or participate in communities around your interests, and collaborate with others. The key activities are organise, share and discover.

Twine allows you to find things that might be of interest to you based on what you are doing. The key “secret sauce” is that everything in Twine is generated from an ontology. The entire site - user interface elements, sidebar, navbar, buttons, etc. - come from an application ontology.

Similarly, the data is modelled on an ontology. Twine isn’t limited to these ontologies. Radar are beginning the process of bringing in other ontologies and using them in Twine. Later, they will allow people to make their own ontologies (e.g. to express domain specific stuff). In the long run, the community infrastructure will allow people to have a more extensible infrastructure.

Twine does natural language processing on text, mainly providing auto tagging with semantic capabilities. It has an underlying ontology with a million instances of thousands of concepts to generate these tags (right now, they are exposing just some of these). Radar are also looking at statistical analyses or clustering of related content, more of which we will see in the Fall (mainly, which people, items and interests are related to each other). For example, “here are bunch of things that are all about movies you like”. Twine uses machine learning to create these clusters.

Twine search also has semantic capabilities. You can filter bookmarks by the companies they are related to, or filter people by the places they are from. Underneath Twine, they have also done a lot of work on scaling.

Consumer prime-time launch of Twine is slated for the Fall. A good few bugs still have to be addressed, but Nova says there has been a “wonderful flowering of participation and friendships” in Twine. Many networks of like-minded people with common interests are being formed, and it is very interesting to see this take place. Nova himself has 500 contacts in Twine, and just 300 in Facebook. He now uses it as his main news source. David Lewis (the top Twiner) has 1000+ contacts in Twine.

Twine wants to bring semantics to the masses, and is not just aiming at Semantic Web researchers: it has to be mainstream. The main common thread in feedback received is that the interface needs to be simplified more. (Nova says he shaved his head as part of this new simpler interface :-)) Someone who knows nothing about structured data or auto tagging should be able to figure out in a few minutes or even seconds how to use it. It takes a few days at the moment to get a sense of the value, but Nova says it can be very addictive when you get into it.

Individuals are the first market, even if you are on your own and don’t have any friends -) It is even more valuable if you are connected to other people, if you join groups, giving a richer network effect. The main value proposition is that you can keep track of things you like, people you know, and capturing knowledge you think is important.

Motley Fool recently talked about Google killers. Twine is not one, according to Nova, as it is not trying to index the entire Web. Twine is about the information that you think is important, not everything available. Twine also pulls in related things (e.g. from links in an e-mail), capturing information around the information that you bring in.

When groups start using Twine, collective intelligence starts to take place (by leveraging other people who are researching stuff, finding things, testing, commenting, etc.). It’s a type of communal knowledge base similar to other things like Wikia or Freebase. However, unlike many public communal sites, in Twine more than half of the data and activities are private (60%). Therefore privacy and permission control is very important, and it goes deep into the Twine data.

Initially Radar had their own triple store, an LGPL one from the CALO project. They found that it didn’t scale towards web-scale applications, and it didn’t have the levels of transaction control you’d need from an enterprise application. They decided to go for a SQL database (PostgreSQL) with WebDAV. However, relational databases weren’t optimised for the “shape” of data that they were putting into it, so it needed to be tweaked. They’ve had no performance issues so far, but they may move to a federated model next year. Twine uses an eight-element tuple store (subject-predicate-object, provenance, time stamp, confidence value, and other statistics about the triple or item itself). They can do predicate inferencing across statements, access control, etc. The platform is all written in Java, and Twine then sits on top of that.

Next he talked about the Twine beta status. There have been 20000 beta testers in last 30 days, 9000 twines created, 150000 items added, 60% of twines are private, and new features are being added every four weeks (in point releases). Some of the feature requests they’ve received include import capabilities, interoperability with other apps, and the ability to use other ontologies.

Twine will stay in invite beta for the summer. Soon, they will take off the password door to the public twines, so that they will all be visible to search engines. Radar will be SEO-ing the content automatically, so you will see more “walk-ins” after that happens. They will still be able to control who gets an account, but stuff will be publicly accessible.

In the Fall, Radar will open it so that anyone can open an account. You will be able to really customise Twine, to author and develop rich semantic content. Nova says that Twine will then be a step beyond blogs and wikis when it happens (but he can’t say much about the new stuff for now).

Next, there were some questions.

Q: The first one was about privacy. What if you add something and then later you decide that you want to delete it - is it really deleted or does Twine keep it around?

A: Nova answered that currently, it is not really deleted, it goes into a non-visible triple. But they will be doing that (really deleting it) soon.

Q: What is the approach to interoperability with Twine? What other types of semantic applications will Twine work with?

A: Today, Twine works with e-mail (in / out), RSS (get feeds out), and browsers (e.g. for bookmarking). There have been lots of requests for interoperability with mindmaps, various databases, enterprise applications, etc., so Radar are giving it a lot of thought. Twine has to provide APIs. They have a REST and a SPARQL API: they are not fully ready just yet, but by end of the year Twine will have a usable REST API. Unfortunately, Radar can’t handle the long tail of requests for features, there’s just too much, but an API will help people to make their own add-ons.

Then there’s the ontology level. You will be able to get the data about you or related to you out of Twine in RDF. You should also be able to get stuff out using other ontologies that are common, e.g. using FOAF, SIOC (yay!), or Dublin Core.

They are also looking at specific adaptors that they need to build. For example, this includes importers for del.icio.us, Digg, desktop bookmark files, Outlook contacts, and a bunch of others. They will be rolling out some of these in the Fall timeframe. Also, there may be a demand for Lotus Notes interoperability - or Exchange - possibly. Radar may actually look at other semantic applications like Freebase that they could interoperate with first. They have already hardcoded in some interoperability with Amazon for example.

Q: When Radar went to VCs and were turned down, was Twine part of the pitch? (For the second time around with Paul Allen, the questioner presumed that Nova did have it as part of the pitch.)

A: In 2003, Radar had a desktop-based semantic tool called “Personal Radar”. It was basically a Java-based P2P “Twine” using RDF. It had lots of eye candy and visualisations. The VCs said “semantic what?” and it was extremely hard to explain P2P, Semantic Web, RDF, and knowledge sharing to them. He said the VCs are mainly interested in when you are going to make money for them. But most of his pitch was blue sky, with no business plan, demonstrating a piece of technology, and pushing the fact that he knows people will need it. Paul Allen was more visionary, and he really believes adding structure to the Web is inevitable. He was willing to take a bet before they were in business. Then they went on to get Series A funding. The VCs said it was too early, but they eventually got it. Series B wasn’t as hard, and it fell into place in a matter of weeks, so it was a good round.

Even though there’s a lot of talk about the Semantic Web in the press and on the Web, most VCs are still figuring it out now and they are interested in making just one bet in the space. The main thing you need to avoid is being a platform without having any applications to show. It has to be compelling, where you can envisage users using them. Valley VCs are jaded about platforms.

Q: As one imports information from various places, what exactly is there in Twine that will prevent a person having to merge any duplicate objects?

A: Nova said there is limited duplication detection at the moment, but this will be improved in a few months. Most people submit similar bookmarks and it is reasonably straightforward to identify these, e.g. when the same item is arrived at through different paths on a website and has different URLs.

Q: Ivan Herman from the W3C asked if Radar were considering leveraging the linked open data community?

A: Nova said that DBpedia would be one of those main sources of data that they want to integrate with - the FOAF-scape, the SIOC-o-sphere, and DBpedia. Wikipedia URIs are already being used to identify tags, and this is something they will leverage.

Q: How can copyright be managed in Twine?

A: Nova said that it’s thanks to the Digital Millennium Copyright Act (DMCA). It provides a safe harbour if you cannot reasonably prevent against anything and everything being uploaded (and are unaware of it). Twine’s user agreement says please do not add other people’s copyright material. Fair use is okay, and if you share something copyrighted, it is better to have a blurb with a link to the main content. Therefore, Twine is using the same procedure as in other UGC sites.

Q: How are Radar going to make money?

A: Twine is focused on advertising as the first revenue stream. Twine has semantic profile of users and groups, so it can understand their interests very well. Twine will start to show sponsored content or ads in Twine based on these interests. If something is extremely relevant to your interests, then it is almost like content (even if it is sponsored). They will be pilot testing this advertising soon.

Q: Have Radar been approached by Google, Facebook, as the value proposition for Twine is very interesting?

A: Nova said they are not trying to compete with Facebook (right now!), but rather they are trying to find the magic formula that will work for Twine right now. Facebook has a lot of fluffy stuff: vampires, weird games, etc. Nova said he’d prefer to spin the bottle with a real person. Twine will focus on professional people who have a stronger need for a particular interest, doing things technically that are outside the scope of what they are doing at the moment.

Q: Why does Twine use tuple storage: why is it not using a quad?

A: Nova said it’s faster in their system, so for performance reasons they decided to avoid reification.

(I will also post my notes from Eric Miller’s keynote in the next day or three.)

English , , , , , , , , , , , , , , , , , , , , , ,

CTO of Powerset Dr. Barney Pell on SemanticWeb Gang

May 16th, 2008

I had a pleasure yesterday of listening to Dr. Barney Pell, CTO and founder of Powerset. For anyone who is interested in Semantic Web, Natural Language Processing and the next generation of search, I highly recommend this podcast - it is very insightful.

English , ,

Calais - What’s On the Horizon?

May 13th, 2008

Well, we’ve just finished Release 2 of Calais – so it’s time to start talking about what’s in the pipeline for our next few months. This isn’t a roadmap – but it should give you a general idea of the areas we’ll be focusing on. As always, we welcome your input and ideas.

We hate vaporware – so there are a few things we’re working on that we’re not going to talk about until they’re ready. A few things we know will be happening over the next 90-120 days…


Expanding breadth and depth of our metadata generation capabilities

R2 was a milestone for us and the technical platform underlying Calais. In this release we began to take advantage of some of the many open data assets to deploy new entities such as sporting events, entertainment awards and others. This was our trial run – which resulted in nearly a dozen new entity types. We don’t believe in a pure lexicon-driven approach (it delivers unreliable results in too many cases), but the combination of expanded lexicons with a great NLP (Natural Language Processing) wrapper is delivering high-quality results very fast. We tried it. It worked. It’s going to accelerate – and we’re very open to hearing about the types of entities you’d like to see added in the future.

Linked Data
We could write on this topic for a few hundred pages – but we won’t. One significant area of development work for us is going to be creating the connections that allow you to move from what Calais extracts to the wider world of linked data assets. We’re not going to share the details right now, but we think it’s going to open up a whole new world of potential applications.

More Integration Tools

We’re going to keep working on building tools to make Calais more accessible and easier to integrate. Two of those tools for the mid-term time frame are the hooks necessary to integrate Calais with Yahoo Pipes and Microsoft Popfly.

More plugins, extensions, modules to augment popular tools
In R2 we released our WordPress plugin and others developed modules for Drupal and a UIMA annotator.  We’ll continue this effort by releasing additional plugins and updated and improved versions of the ones we have.

As always – thanks for the support, suggestions, blog mentions and the occasional zing. It’s a wild ride.

Tom
 

English , , ,

Selling and overselling the Semantic Web

May 12th, 2008

We got on the train together.  I had just finished a four-day training/consulting session with a company doing information integration for international security.  She was doing a master's degree, with a thesis about Ontologies.  Like a good grad student, she was a voratious reader.  She had read white papers, research papers, books, web pages, magazine articles, and anything else she could get her eyes on.  The more she read, the more confused she became. 

It is hard to be surprised at this.  It seem that just about everyone is jumping on the Semantics bandwagon.  Is an Ontology a top-down way to organize all human knowledge?  Or just a glorified ERD?  Or a controlled vocabulary?  Will it take an Ontologist to make them?  Or will they be something that everyone can do, like a web page?  Will Ontologies make the web come alive as a sentient, intelligent being?  You can find someone who has seriously puported variants of all of these, all using the name "Ontology".

So I just sorted it all out for her before we got to Elephant and Castle.

Well, not really.  There are just too many contradictions.  Is the Semantic Web about a top-down organization of everything?  Or a wooley free-for-all?  Are vocabularies controlled or not?  Is content authored or automatically generated?

But here's what I was able to offer.  One story, about a web of information.  A story about information sharing.  A story that builds on the success of things like Wikipedia and the World Wide Web.

In my story there is no need for natural language processing.  Inference plays a key role, but not an analytic one; it is  just a way to connect information together.  Upper ontologies are largely irrelevant, but reusable ontologies are not.

Will this technology story solve every problem?  No.  It will not diagnose diseases, it will not automatically index your library. It will not make your search engines obsolete.

But no technology story can do all that - the best we can hope for is a story that is coherent (it actually makes sense), feasible (it can be done with extant technology), and, perhaps most importantly, valuable (it provides some real business value).  I think I have such a story - and in that story, there happens to be no need for natural language processing, upper ontologies, or highly sophisticated inferencing. 

So, what is that story?  I did finish the story before we got to Elephant and Castle, but that's a 45 minute ride.  I can't fit the whole thing into a blog entry.  But I can fit it into a book. 

Yes, this whole entry was a troll for the book.  Check it out.

Uncategorized , , ,

Powerset launches Powerlabs at TechCrunch40

September 17th, 2007

SF BetaPowerset is undertaking a huge task: building a natural language search engine that reads and understands every sentence on the Web. The good news is that thanks to the technology that we’ve licensed from PARC combined with homegrown technology developed by our strong team of linguists and computer scientists, we’re well on the path to achieving this goal.

We realize that most companies wait to launch until they have a completely usable beta version. Because Powerset is a natural language search engine, the earlier we have input from the best natural language processing units on the planet – the brains of humans – the quicker our search engine will improve. Through a combination of quantitative feedback, qualitative suggestions and AI learning techniques, Powerset will get much smarter when people are interacting with it.

That’s the reason we’ve decided to create Powerlabs, which we are opening up to the first group of users today at TechCrunch40. Powerset Labs is a community where users can provide feedback on our product design and natural language engine. While users will not be able to interact with the Powerset open search box across the Web, we are giving users a peek at technology demonstrations that show off some of Powerset’s natural language processing capabilities.

Though the content of Powerset Labs will change based on user feedback, we wanted to share with you what we demoed today at TC40: Powermouse and Use Cases.

Powermouse is a window into Powerset’s natural language index. When Powerset reads sentences in Wikipedia, we go from open text to representations of meaning. In other words, we take text and turn it into structured “facts.” When users enter a query into Powermouse, they’ll be able to browse the “facts” stored in our index. In the example below, when wrestling star "Hulk Hogan” is entered in the first “something” box, users can see all of the facts we’ve indexed about him. Now, if you add “defeat” into the connection box, users see all of the facts that we’ve indexed from Wikipedia about wrestlers that Hogan has defeated. Here’s a Hulk Hogan screenshot. In addition to showing off the power of our index, Powermouse also shows a different type of interface that’s possible with a natural language index.

Uses Cases demonstrates how a natural language query can exploit Powerset’s index. Unlike Powermouse, Use Cases lets users express their intent in natural language. We’ve picked about a dozen use cases that illustrate how a natural language index can return results that are qualitatively different than keyword results. For example, here’s a query of "Who mocked Blair?" that shows how Powerset understands all of the various ways "mock" can be expressed in English. After a query, we encourage users to tell Powerset which of our results are good and which are not. We also ask that users vote on which results are better: the Powerset results or the keyword results on the right. All of this user feedback is what will help make Powerset a better search engine.

Once users have tried out the applications in Powerset Labs, we invite them to submit ideas about how to make them better. Within Powerset Labs, users can browse through ideas, vote on the best ideas and comment on ideas. As users participate in Powerlabs, they’ll get karma points for everything they do. Eventually, users with the most karma will get perks within the community. The bottom line: We’re listening and we’ll try to implement your brilliance as soon as we possibly can.

As a note, Powerset has received a lot of attention over the past few months and we’ve been overwhelmed with the number of people who have signed up for Powerset Labs. Instead of letting everyone in at once, we’ve decided to let people into Poweset Labs in the order they signed up. We want to make sure that each group of Powerset Labs users gets a great experience, so we’re going to grow the community slowly and carefully. If you’d like to sign up, go to labs.powerset.com and we’ll be letting in the next wave of users as soon as possible.

We’re really excited to see the Powerset Labs community grow, to gather and implement your feedback, to share with you more and more technology demonstrations and prototypes, and ultimately to deliver a transformative search engine to the world.

Oh, and we have our first official social media press release about our launch at TechCrunch40 if you are press and need contact information, quotes or screenshots.

English , , , , , , , , , , , ,