Archive

Archive for August, 2007

Parsing Miss South Carolina’s Statement

August 31st, 2007

It’s not like it’s easy to parse Wikipedia, but at least most of the its text is (usually) written with correct spelling, capitalized proper names, meaningful paragraph structure and so on. A natural question is: how will our system perform on the rest of the Web with all of its slang, non-standard syntax, and so on? To put Powerset to the test, two of our engineers, Lukas Biewald and Brendan O’Connor, ran our entire parsing and indexing system on the hardest corpus we could find: Miss South Carlolina’s response to the question, "Recent polls have shown that a fifth of Americans can’t locate the US on a map. Why do you think this is?" They fed this transcription into the XLE verbatim, disfluencies and all:

I personally believe that U.S. Americans are unable to do so because uh some uh people out there in our nation don’t *have* maps and uh I believe that our ed- education like such as in South Africa and uh the- the Iraq everywhere like such as and I believe that they should uh our education over here in the U.S. should help the U.S. or- or- should help South Africa and should help the Iraq and the Asian countries so we will be able to build up our future

One might think that such a convoluted mess of words (I hesitate to call it "English") would be impossible to parse, but here is the C-structures that our parser generates: 

Unsurprisingly, the sentence is fragmented quite a bit, but the parser clearly managed to extract useful structure throughout the sentence. The last large verb phrase “should help South Africa and should help the Iraq and the Asian countries so we will be able to build up our future” seems very close to correct, which is pretty impressive (Language Log has more discussion on the weird “the Iraq” construction). The output of the semantics system is too long to put here, but in some ways, it’s amazing that we were able to extract any semantics at all. And, believe it or not, we can actually run some queries against the Carolina Index (as it’s known at Powerset). It’s hard to think of a reasonable question, but we asked, "Who does education help?" and returned and highlighted the right answer: "Americans". Or should we have returned "U.S. Americans"? 

English , , , , ,

Barney to speak at the Singularity Summit

August 30th, 2007

SF BetaThe Singularity Institute is hosting the Singularity Summit] in San Francisco on September 8-9. Dr. Barney Pell, Powerset’s own CEO, will be presenting in the first session called “What are the Pathways and Major Challenges.” In his talk, Barney will predict that the path to Artificial General Inteligence (AGI) will be based on a rich interplay in which top-down engineering and bottom-up brain simulation approaches meet in the middle. He’ll also talk about the role of economics in accelerating the development of AI systems. Powerset is actually an example of this: once natural language becomes a part of core search, companies will invest large funds in natural language technologies, dwarfing historical investments to date. If you want to read more, you can get your tickets today (only $50!) or read Barney’s more detailed blog post or even listen to Barney’s interview with Dan Farber. We hope to see you there!

English , , , , , , , ,

Semantic Web Yahoo - Part Deux

August 13th, 2007

It’s been nearly 2 years since I joined Yahoo! and the the semantic web-based technology I helped develop has been deployed in production for some time. It has been encouraging to see the ideas get more accepted since today I noticed that in a hotjobs search for rdf yahoo near Sunnyvale there 5 jobs open - not in my group, but in Yahoo! Local.

Our group in Sunnyvale is continuing to look for HTTP and web caching experts, designers and coders for building REST-based web services. Right here and now we have interesting, large scale, rich data problems and are applying semweb techniques to them. Contact me if any of this sounds exciting to you.

Semantic Web Yahoo - Part one

Uncategorized , , , , , ,

Flickcurl - C API to Flickr

August 3rd, 2007

In January 2007 just for fun I started writing a C API to Flickr using the Flickr web services called Flickcurl. The name was because it was originally built using Flickr via libCurl to do the HTTP work … although right now it contains more use of libxml than of libcurl.

I started this for a bunch of reasons, including to learn more about “web 2.0″ web APIs, see how RESTy the Flickr API really is (Answer: not much, it’s very much an RPC model) and the issues with developing a Web API. It’s clear this is an evolved and evolving one since now and then I discover undocumented returned attributes in the XML and cases where it is not clear why attributes were used instead of elements. It’s very suited towards dynamic scripting languages where it is easy to pass around dictionaries / hashes / associative arrays of parameters that can grow. So in some sense, making something feel like a natural API in a static language like C is rather going against the grain and rather slow work.

There are, however, things available to help. There are method reflection APIs so I wrote a code generating program that can nicely automate writing many of the simpler calls that return no value or just a single one. I also used a lot of similar patterns so that parsing tags xml is quite similar to parsing comments xml. The XML is primarily read via XPath and a little DOM.

One other nice thing about this is that this a piece of work with a fixed size, albeit growing slowly. The Flickr API currently has 104 calls - depending on how you measure them - so it’s easy to check progress, and that’s how I’ve been doing it. I built tools to read the docu-comments (javadoc, gnome-doc, kernel-doc style) and mark the Flickcurl coverage release by release.

The news today is that I have reached the half way point: 50% of the APi with the release of Flickcurl 0.11 at least until they add something more! I have also done most of what I think are the trickier parts - the uploading, searching and getting info about photos. The remaining API parts are more regular, so I feel like I’m coding downhill now.

Now there’s something else it does - and this won’t be a surprise to most given my interests. Flickcurl generates RDF descriptions from Flickr photos with a flickrdf utility, including reading Machine Tags. The namespaces are either well known ones, or invented by me, pointing at the machinetags.org wiki - you can create your own definition.

flickrdf uses Raptor to do nicer serializing when it is available. So this means I can turn jellyfish into Turtles. W00t! (*)

$ ./flickrdf -o turtle http://www.flickr.com/photos/dajobe/196308964/
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/#> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://www.flickr.com/photos/dajobe/196308964/>
    dc:creator [
        a foaf:Person;
        foaf:maker <http://www.flickr.com/photos/dajobe/196308964/>;
        foaf:name "Dave Beckett";
        foaf:nick "dajobe"
    ];
    dc:dateSubmitted "2006-07-23T18:16:13Z"^^xsd:dateTime;
    dc:rights <http://creativecommons.org/licenses/by-nc-sa/2.0/>;
    dc:modified "2007-02-25T07:45:46Z"^^xsd:dateTime;
    dc:issued "2006-07-23T18:16:13Z"^^xsd:dateTime;
    dc:created "2006-07-23T05:28:50Z"^^xsd:dateTime;
    geo:lat "36.620487";
    geo:long "-121.904468";
    dc:title "Jellyfish at Monterey Aquarium";
    dc:subject "jellyfish" .

After that bad joke (and it could have been worse if I had a picture of a Turtle) here’s what you need to know. Get it at flickcurl-0.11.tar.gz (md5sum eea351e4d35e8d1c63b124cd8ee257ba, sha1sum d220f6371c0c5334c824a51ba848d9358d73e533) or the latest in the Flickcurl Subversion It’s licensed under the GPL2 / LGPL2 / Apache 2.0 or any newer versions of any of them.

Note: I work for Yahoo! and although Flickr is part of Yahoo! this project is my own personal work.

(*) Actually I’m slightly cheating with this example, there’s a couple of bug fixes in SVN after the release which are needed to get this output.

Uncategorized , , , , , , , , ,