Archive

Posts Tagged ‘HTML’

The Day after Freebase went RDF

October 30th, 2008

So what’s been happening on the blogosphere after John Giannandrea’s keynote at ISWC and the revelation that Freebase now produces Linked Data from an RDF service

Tetherless World sums up the Freebase facts (e.g. 156,000,000 assertions made; 1370 published types; 75 domains; graph model, identity, web based) and further points out that ontology creation “is a social process, and both freebase and semantic wiki are tools that enable users to create ontological vocabulary without worrying too much on building a comprehensive ontology.”

Inkdroid notes that the RDF service release “is important news because Freebase is an active community of content creators, creating rich data-centric descriptions with a wiki style interface, fancy data loaders, and useful machine APIs.” This is followed up by a quick and handy tutorial how you can get machine readable data back from freebase using a URI with Freebase. Conclusion:

So why is this important? Because following your nose in HTML is what enabled companies like Lycos, AltaVista, Yahoo and Google to be born. It allowed for agents to be able to crawl the web of documents and build indexes of the data to allow people to find what they want (hopefully). Being able to link data in this way allows us to harvest data assets across organizational boundaries and merge them together. It’s early days still, but seeing an organization like Freebase get it is pretty exciting.

Yves Raimond was the first to wonder on the public W3C LOD mailinglist: “now, to see whether it links to other datasets :-)” - the idea of having linked data without the linkage would indeed seem like love’s labour lost. Semantic Focus / James Simmons seconds: “One downside is the data doesn’t appear to link to external resources, in a sense walling itself in. It should be trivial to link the topics that came from Wikipedia back to Wikipedia as well as DBpedia (which would be killer, by the way).” This is followed up a later post, where James expresses concerns regarding the relationship DBpedia / Freebase: “Freebase may see a drop in userbase growth and participation if it becomes a mirror of DBpedia (or vice-versa) and the popularity once garnered by one project may shift towards the other, or away entirely.”

More News / Andrew Newman puts the Freebase RDF service release in context with Cathrin Weiss’ “250 million triples on your iphone” submission, iMoCo, to the Billion triples challenges, also DBpedia and Semaplorer, developed at the University of Koblenz:

DBPedia stood out because it was the only one that allowed you to write data to the Semantic Web rather than just read the carefully prepared triples. For a similar reason I though SemaPlorer was good because they tried to do more than just the standard triples but went that extra bit further by making it more generic like integrating flickr. But they were all excellent, all of them showing what you get with a billion or more triples and inferencing.

That combined with the guys at Freebase making all of their data available as RDF and it was a big day for the Semantic Web.

ARQtick / AndyS plays a bit with the Blade Runner example cited by Freebase, e.g. takes a look at the graph, looks for interesting properties and extracts author names

N.B. If you want to follow ARQtick’s example: use the Linked Data browser plugin Tabulator or go to the Marbles site to view the RDF - without a data browser you’ll be redirected to the HTML page. You will also need it to make sense of rdf.freebase.com.

English , , , , , , , , , ,

Freebase Officially Linked Data with Release of RDF Service

October 29th, 2008

At ISWC2008 Freebase released its new RDF service for generating RDF representations of Freebase topics, allowing Freebase to be used as Linked Data! To obtain the RDF data for a topic send a GET request to http://rdf.freebase.com/rdf/some.topic.id where "some.topic.id" is replaced by the desired topic identifier (slashes in the identifier must be replaced by dots). Topic data can be represented as N3, RDF/XML or Turtle depending on the preferences expressed in your client's HTTP Accept header. Try it out with the Freebase topic Semantic Web.

You can also cater to clients that prefer HTML output by using the /ns end-point (http://rdf.freebase.com/ns/some.topic.id). The service performs the content negotiation automatically; delivering human-friendly HTML representations to Web browsers, and redirecting clients expecting RDF to the /rdf URL (via 302 redirect).

One downside is the data doesn't appear to link to external resources, in a sense walling itself in. It should be trivial to link the topics that came from Wikipedia back to Wikipedia as well as DBpedia (which would be killer, by the way).

Got something to say? Leave a comment!

English , , , , ,

Interview for Journalism.co.uk… Journalists get to know the Semantic Web!

October 29th, 2008

I was interviewed last week by Colin Meek from Journalism.co.uk on the topic of “Web 3.0″ and what it means for journalists… You can read the full article in two parts (1, 2). My original answers are part of an interview on their Insite blog. I also had the chance to talk about various DERI offerings in the Semantic Web area including SIOC, SWSE, Sindice, Semantic Radar, etc.

Colin also asked me about other readable data that is being crawled by Semantic Web search engines like Sindice, SWSE or Swoogle. These search engines can usually match keywords in any data that has been crawled or integrated into a semantic store, not just people. It could be from structured information about people, places, dates, library documents, blog items or topics, whatever. In fact, there is no limit to the types of things that can be indexed and searched - since RDF (an open data model that can be adapted to describe pretty much anything) is used as the data format. Anyone can reuse existing RDF vocabularies like SIOC to publish data, or they can publish data using their own custom vocabularies (e.g. to describe stamp collecting or Bollywood movie genres or whatever), or they can combine public and custom vocabularies (e.g. take FOAF and your own vocabulary about soccer to describe players and managers on a soccer team). Geotemporal information is particularly useful across a range of domains, and provides nice semantic linkages between things. For example, having geographic information and time information is useful for describing where people have been and when, for detailing historical events or TV shows, for timetabling and scheduling of events, etc., and for connecting all of these things together (”I’m travelling to Edinburgh next week: show me all the TV shows of relevance and any upcoming events I should be aware of according to my interests…”).

The keyword searches in the Sindice search engine allow you to find more information on where resources of interest are (searching for “john breslin” will point to all public pages that contain semantic information about yours truly). Sindice also has an API that can provide results in a resuable (semantic) format that can be leveraged by other applications. Alternately, SWSE (Semantic Web Search Engine) shows you semantic information about the object of interest (e.g. my phone number, my friends, etc.) which may be derived from multiple sources (e.g. this information on me comes from tens of sources consolidated together via unique identifiers for me or through what’s called “object consolidation”).

For me, this article highlighted the fact that the Semantic Web community needs to be very aware that one of the key features of the Social Web for journalists and for many others is the ability to find a lot of personal and sensitive information on people, and with the advent of “Web 3.0″, we need to realise that (”with great power comes great responsibility”) the availability of contextual and semantically-related information is going to become even more apparent, and people will talk about it in both positive and negative terms. Educating site owners about what semantic data they may be publishing (knowingly or unknowingly, even if it’s just RSS feeds) is needed, and developers should determine exactly what opt-in or opt-out mechanisms are required before implementing semantic solutions. Users also should be aware of the benefits and other potential uses of their semantic data.

I think now is the time to avert any scares, because in reality, the data that is on the Web or the Social Web can be used in new ways anyway, whether metadata is present or not (some facts can be derived). Google have recently implemented some discussion forum parsing algorithms to determine how many posts are on a thread, how many users posted on that thread and when the last post was made. You can see this in a search result I did for “irish pubs boards.ie” below. It’s not complete, and probably relies on identifying certain HTML structures for non-Google discussion sites, e.g. you can see two threads in the middle that don’t show details of the total posts or commenters. But it’s moving towards the SIOC vision of providing more metadata about discussions on the Web to help you in finding more relevant information - whether the site owners want to provide Semantic Web data or not!

Making data available semantically enables computers to help us do things we cannot easily do (or cannot do at all) right now, and this is what makes it so powerful. We also need to think more towards educating people about the benefits as well as how we can minimise any hazards. Is this a job for W3C SWEO? As my colleague James Cooley said: “I think scientists thought the benefits of GM food were so obvious that there was no case to make. Then you got Frankenstein Food and the game was up.”

For journalists interested in the Semantic Web, I’d recommend reading this paper entitled “SemWebbing the London Gazette” by Jeni Tennison and John Sheridan which describes how they have exposed information from their newspaper website using RDFa so that it becomes easy to re-use (slides here). You can also view some interesting slides by Colin Meek from a seminar he gave to journalists about the Social Web in Olso a few days ago. It’s in three parts (1, 2, 3). I’ve embedded the third part (on the Semantic Web) below…

Other posts referencing this article:

English , , , , , , , , , , , , , , , , , , , , , , , , , ,

This Week’s Semantic Web

October 1st, 2008

Special Edition : SIOC Update

I had a man cold when I should have been doing my duty, but with no apologies (fairly safely assuming John has a CC-with-attribution kind of policy) here’s a good proxy :

20080403a.png It’s time for another installment from the world of SIOC!

Previous SIOC-o-sphere articles:

#7 http://sioc-project.org/node/328
#6 http://sioc-project.org/node/310
#5 http://sioc-project.org/node/294
#4 http://sioc-project.org/node/272
#3 http://sioc-project.org/node/271

#2 http://sioc-project.org/node/138
#1 http://sioc-project.org/node/79

If you wish to contribute to the next article, join the SIOC Twine and use the tag “siocosphere9” when you add items.

English , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Tales from the SIOC-o-sphere #8

October 1st, 2008

20080403a.png It’s time for another installment from the world of SIOC!

Previous SIOC-o-sphere articles:

#7 http://sioc-project.org/node/328
#6 http://sioc-project.org/node/310
#5 http://sioc-project.org/node/294
#4 http://sioc-project.org/node/272
#3 http://sioc-project.org/node/271
#2 http://sioc-project.org/node/138
#1 http://sioc-project.org/node/79

If you wish to contribute to the next article, join the SIOC Twine and use the tag “siocosphere9” to add items.

English , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Publishing Linked Data With PHP

September 30th, 2008

For a while now I’ve been experimenting with writing my own little PHP applications that run against the Talis Platform. Most of these have never been seen in public because they’re mainly just for scratching an itch I have at the time. I’ve also used a lot of them to validate my own thinking around the types of services that the platform needs to provide to build interesting applications. The core of most of those applications became Moriarty my PHP library for accessing the platform. I use Moriarty extensively now to kick start any development I do. I’m even using it to write PHP scripts for running at the command line. I’m not sure that PHP is going to usurp Perl from my toolbox, but it’s certainly becoming my language of choice for working with RDF.

I’ve been looking carefully at the core patterns that my PHP applications have been following to see if there’s anything else I could pull out. This is generally how I prefer to build new libraries: extracting them from several different projects. Assuming you know how your library is going to work before you’ve written any applications is almost always wrong. I like using libraries that have distilled the essence of repeated attempts at solving the same problem. That’s why I never think about modularization of a codebase until I need to.

I’ve been gravitating towards Konstrukt because it appears to be the least intrusive of the PHP web application frameworks out there and it keeps fairly true to REST principles. I used it to build Kniblet as part of a platform tutorial. However, there are some quirks that it has that I don’t like. For example, to return anything other than HTML requires you to throw an exception. That mechanism works quite well for most applications but doesn’t really suit data-rich applications that have multiple output formats.

It’s with this in mind that I’ve started a new PHP web application framework called Paget. Calling it a framework is somewhat of an overstatement. It’s a few classes that make it easy to publish RDF as linked data. It’s very primitive at the moment, but it’s quite versatile.

It uses a simple configuration array that is passed to a dispatcher that handles the request. The application’s default behaviour is specified using this configuration. One part sets up a series of regular expressions that match URI paths handled by the application and map them to the resources it provides. The data about each resource is obtained by using one or more “generators”. These are simply classes that generate RDF for the given resource. Paget runs each generator to gather the RDF data describing the resource and then handles the serving up of that data according to linked data principles. Right now that’s just enough behaviour to function as a generic linked data publishing framework.

I have three different deployments of Paget that are publishing three RDF data sets using different generators. Each of these was quite trivial to set up, being a few lines of confiiguration. For my own site’s data space I wrote a generator that fetched RDF directly from one of my platform stores (this one) and served it up as HTML and various flavours of RDF. See, for example, http://iandavis.com/id/me which is URI that identifies me.

My second deployment was for PlaceTime, a URI space that I have operated since 2003. It provides RDF data for timelike entities like instants and intervals and spacelike points. However, it hasn’t been fully linked data compliant (mainly because it pre-dated the decision on httpRange-14). I wrote a generator for each type of entity that creates trivial RDF for each valid URI in the space. Some examples:

Finally, I created a generator that reads a local RDF file. I then used it to serve up the whisky vocabulary that Tom, I and several others created at the recent VoCamp Oxford

Admittedly, all these datasets and spaces look pretty similar but this is still early days for Paget. I have some ideas for future development that will flesh out Paget into a fully-fledged RDF driven application framework. For example: as well as generators I plan to add filters, augmenters and transformers that alter the generated data in arbitrary fashions. These could be used to trim the data down, or to convert it to a more usable structure. I can imagine that it would be very useful to be able to pull in more RDF from arbitrary locations on the Web to supplement data in the initial set, e.g. with schema information or additional details. In my opinion that’s one of the significant differences between the web of data and the web of documents: the web of data is going to enable more information to be brought automatically together for the user rather than forcing them to seek it out.

Paget’s HTML rendering of RDF is very primitive at the moment, making only basic attempts to make it human readable. It’s still extremely tabular which is hardly a great use of structured information. One area that I’ve been interested in exploring is that of dynamic user interfaces that adapt to the underlying data automatically. RDF is particularly amenable to building these kinds of interfaces because of its uniform data model. A lot of work on this was done by the Fresnel project and it would be interesting to apply some of the learnings from that project to building dynamic web applications. My goal here is to code as little specific behaviour into the application as possible, instead making the application detect patterns in the data and provide suitable user interface behaviours at runtime. This is really the only way we’re going to be able to build true open world applications, i.e. those that are tolerant of missing data and can adapt to new and unanticipated data.

What I’m still experimenting with is whether these user interface additions should be server-side or passed on to the client. Some of the augmentations could make more sense when actioned by the client based on user activity.

There’s lots to research here and hopefully some of these ideas will make it into Paget very soon.

English , , , , , , , , , ,

Why Faviki is able to suggest tags in 13 languages

September 26th, 2008

Just got in touch with Vuk Miličić from Faviki recently - Faviki has been selected as a featured project on Google code, and in that context, Vuk describes the process of how Faviki retrieves its suggestions in a little more detail. It’s really interesting! It also sheds more light on the way that DBpedia is used in Faviki: Not immediately for the retrieval of tags, but for the translation of tags - long live the smartness of linked data!

  1. Faviki fetches a web page and extracts a core text (without HTML and non-relevant content).
  2. Then it tries to figure out if a content is in English. If it isn’t, it is sent to Google language API, which detects the original language automatically, translates it into English and returns the translation.
  3. The content is then sent to and analyzed by Zemanta API, which then finds relevant links. Faviki uses links from English Wikipedia - titles are used as semantic tags.
  4. If users language is not English, we must translate them. Using DBpedia datasets “Links to Wikipedia Article” , we can find names of Wikipedia’s titles in one of 13 languages. These datasets actually contain the connections between English Wikipedia articles and articles from Wikipedia in other languages.
  5. Finally, suggested tags are offered to a user.

Read the whole blog post on Vuk’s Faviki blog

Reblog this post [with Zemanta]

English , ,

Released today: SemanticProxy.com

September 23rd, 2008

We released Calais a little less than nine months ago. It’s been a fascinating process and an edifying period.

On the one hand we’ve seen a level of interest and adoption well beyond anything we’d anticipated: 6,000 registered developers. Well over 1,000,000 transactions per day. Dozens of creative and inspirational applications. It’s been great.

On the other, we have been reminded that semantically enabling the web is primarily a challenge of critical mass. Publishers are waiting for semantic consumers (search engines, news aggregators and applications) before they work on adding semantic metadata to their content.  Meanwhile application developers are waiting for the publishers to act.

We know we’ll get there in the end – but it’s slower than we’d like to see.

SemanticProxy is our attempt to jumpstart the semantic consumer end of the equation. We have all the standards we need.  What we’re missing is a critical mass of semantically enhanced content.

SemanticProxy doesn’t solve that problem, but it can act as a catalyst.

SemanticProxy makes any web site – particularly news sites – behave like a semantically enabled web site. Instead of making you write the programs to fetch a page, clean the HTML, process it with Calais and then get the resulting RDF, SemanticProxy does the heavy lifting.

You hand it a URL, and SemanticProxy hands back rich semantic metadata as RDF or MicroFormats.

SemanticProxy follows the standards for publishing linked data on the web – a good overview of which can be found here.

The best way to experiment with it is to get a Calais API key and use the URL Builder. Copy the resulting URL, paste it into your browser and you’ll see the results.

While doing that will show you what’s going on, SemanticProxy is really meant for machines to talk to. You could construct a simple web crawler that fetches the semantic content of each page. You could build a browser plugin that exposes the underlying semantic content of a page while you’re browsing. We’re looking forward to seeing your ideas.

It’s still in beta. We’ve optimized it for the top 30 English language news sites – but it works quite well on Wikipedia and other sites as well. Go forth and experiment.

We’ve designed SemanticProxy to scale well with demand. It runs almost entirely in the cloud.  It can handle tens of millions of transactions a day right now, and it can scale to hundreds of millions whenever we need to.

Visit SemanticProxy.com and let us know what you think. We’d appreciate feedback, ideas and critiques.

 

English , , , , , ,

Released today: SemanticProxy.com

September 23rd, 2008

We released Calais a little less than nine months ago. It’s been a fascinating process and an edifying period.

On the one hand we’ve seen a level of interest and adoption well beyond anything we’d anticipated: 6,000 registered developers. Well over 1,000,000 transactions per day. Dozens of creative and inspirational applications. It’s been great.

On the other, we have been reminded that semantically enabling the web is primarily a challenge of critical mass. Publishers are waiting for semantic consumers (search engines, news aggregators and applications) before they work on adding semantic metadata to their content.  Meanwhile application developers are waiting for the publishers to act.

We know we’ll get there in the end – but it’s slower than we’d like to see.

SemanticProxy is our attempt to jumpstart the semantic consumer end of the equation. We have all the standards we need.  What we’re missing is a critical mass of semantically enhanced content.

SemanticProxy doesn’t solve that problem, but it can act as a catalyst.

SemanticProxy makes any web site – particularly news sites – behave like a semantically enabled web site. Instead of making you write the programs to fetch a page, clean the HTML, process it with Calais and then get the resulting RDF, SemanticProxy does the heavy lifting.

You hand it a URL, and SemanticProxy hands back rich semantic metadata as RDF or MicroFormats.

SemanticProxy follows the standards for publishing linked data on the web – a good overview of which can be found here.

The best way to experiment with it is to get a Calais API key and use the URL Builder. Copy the resulting URL, paste it into your browser and you’ll see the results.

While doing that will show you what’s going on, SemanticProxy is really meant for machines to talk to. You could construct a simple web crawler that fetches the semantic content of each page. You could build a browser plugin that exposes the underlying semantic content of a page while you’re browsing. We’re looking forward to seeing your ideas.

It’s still in beta. We’ve optimized it for the top 30 English language news sites – but it works quite well on Wikipedia and other sites as well. Go forth and experiment.

We’ve designed SemanticProxy to scale well with demand. It runs almost entirely in the cloud.  It can handle tens of millions of transactions a day right now, and it can scale to hundreds of millions whenever we need to.

Visit SemanticProxy.com and let us know what you think. We’d appreciate feedback, ideas and critiques.

 

English , , , , , ,

Twine massive CCK08 invitations

September 13th, 2008

I am a CCk08 student and an active user of Twine. Days ago I announced the creation of a Twine on conectivism in the course forum, offering everyone invitations to test it.

Other participants  have proposed to send massive invitations to CCK08 participants. Although I thought that Nova Spivack, Radar networks CEO would show some reluctance (we´re more than a thousand), he has written me today agreeing to provide us, not only whatever invitations we could need but also a contact that would help us to organize the massive shipment. 

I believe that it would be positive to create a semantic knowledge basis on Connectivism.

So, if you want to test Twine, you can post a comment in this post, contact me or send an email and you´ll be invited to join as soon as posible. I´ve created a Twine about Connectivism and this course related questions. You´re invited, once at Twine, to join it. 

Thanks in advance to George Siemens and Stephen Downes for collaborate.
 

English Resources: 
-Twine Tutorials.
-Screencasts.

Share and Enjoy: Digg Sphinn del.icio.us Facebook Mixx Google BarraPunto blinkbits BlinkList blogmarks BlogMemes Blogsvine connotea De.lirio.us description e-mail Furl LinkaGoGo Live Ma.gnolia Meneame MisterWong NewsVine Pownce Propeller Reddit Slashdot SphereIt Spurl StumbleUpon Technorati TwitThis Wikio YahooMyWeb E-mail this story to a friend! LinkedIn Print this article! Blogosphere News

English , , ,

Google Chrome prefers XHTML

September 2nd, 2008

The blogosphere is restless about Chrome, the new open source browser developed by Google. But I´m not going to discuss its software design, its performace or it usability, there are many people talking about it. I´ll talk about a technical detail: it prefers XHTML instead the classic HTML.

How to know it? As probaby you know in HTTP there is a header called Accept to specify format types which are acceptable. Requesting this service developed by Richard Cyganiak with Chrome, we can get the value of that header:

text/xml,application/xml,application/xhtml+xml,
text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

English, Spanish , , , ,

This Week’s Semantic Web

September 1st, 2008

Selected links related to Semantic Web technologies for the week ending 2008-09-01, all weeks. Also available in RDF as linked data or via GRDDL.

Summer Special!

(or Winter Special! down under)

FOAFlets of the Carribean

FOAFlets of the Caribbean

No blurb, just links.

In the Media

Docs

Software News

Vocabs/Ontologies

Events etc.

Miscellany

Quote of the Week

…fighting the web is like holding back the ocean; it will route around you or it will wear you down, but will never go away, and it will never tire or give up.

- DeWitt Clinton, On Fighting the Web Itself

Summer Bonus Quote of the Week

“Language-independent” just means they invented a new language.

- Kevin Reid

~

Sources include Planet RDF, various other blogs, Semantic Web Interest Group IRC Chatlogs & Scratchpad, ESW Wiki, SemWebCentral, Sweet Tools, W3C Semantic Web Activity, mailing lists, personal emails etc etc. If you see anything suitable this coming week, please mail meor use the del.icio.us tag “TWSW” - thanks!

English , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

DERI, NUI Galway launches the boards.ie SIOC Data Competition

July 30th, 2008

The Digital Enterprise Research Institute (DERI) at NUI Galway is running a unique competition from 1st August to 30th September 2008 in conjunction with boards.ie, Ireland’s largest discussion forum site. The competition is an open contest in which entrants can win over €4000 in Amazon.com vouchers by submitting an interesting creation based on a data set of discussion posts from boards.ie over the past ten years:

  • The first prize is an Amazon voucher for $4000 (~€2500)
  • The second prize is a voucher for $2000 (~€1250)
  • The third prize is a voucher for $1000 (~€625)

Read the rules and find out more information on the contest at:

The data set (approximately 9 million documents) has been represented in the Semantically-Interlinked Online Communities (SIOC) open data format developed by DERI, NUI Galway for expressing the information contained in social websites (forums, mailing lists, blogs, etc.). Entrants may create whatever they feel is interesting based on this data: it could be a novel web application that makes use of the data set, a report on analyses performed on the data, a tool that allows one to visualise or browse the semantic structure, or whatever else the imagination can come up with!

The data reflects ten years of Irish online life, collected between 1998 and 2008 from boards.ie. boards.ie is one of Ireland’s busiest websites, with over a million unique visitors a month. The most popular discussion areas are ‘after hours’, soccer, motors, poker, and computers. Popular topic threads include one about a virtual pub (over 4000 pages), member discussions (2800 pages), poker stories (1800 pages), Liverpool rumours (1250 pages), recruitment in the Gardaí (800 pages long), and a freebie list (250 pages).

To enter the competition, go to data.sioc-project.org to access the data sets and view the guidelines. There will be three prizes for the top entries, as judged by an independent panel of three experts. The contest is open to anyone except current / former researchers with DERI and employees of boards.ie Ltd. One person may make multiple entry submissions. The closing date is the 30th September 2008.

The purpose of this contest is to generate interesting applications or creations that make use of community data represented in the SIOC Semantic Web format. All rights to these creations will remain with the contest participants (not including the underlying data, whose copyright remains with the creators). Neither DERI nor boards.ie Ltd. will acquire any commercial rights to these applications or creations as submitted through this contest. Up until now, this data has been publicly viewable, but it was difficult to leverage it without any added semantics due to the fact that it was embedded in heavily-styled HTML pages.

[DERI is a Centre for Science, Engineering and Technology (CSET) established at NUI Galway in 2003 with funding from Science Foundation Ireland (SFI). After five years of operation, DERI has become an internationally-recognised institute in Semantic Web research, education and technology transfer.]

English , , , , , , , , , , , , , , ,

Architectural Arguments

June 30th, 2008

Further (if weak) evidence that appeals to puppies are rather non-technical (and not even socio-technical):

...HTML owns that process of extracting a valid URI-reference from an attribute’s value string. A simple string parsing description, with associated context-specific error-handling, is more than sufficient to satisfy the needs of HTML5 without appearing to override an existing standard that has recently been agreed to by all vendors, including the few browser vendors that care about HTML5
In contrast, pretending to define a new URL standard as part of HTML5 is not acceptable. HTML5 is a user of the Web, not a definer of it. HTML will never define the identifiers for the Web. That would be a fundamental violation of the Web architecture.—Roy Fielding

This just seems to mix up process/spec structure with system structure. It’s a bit like saying that the architecture of a building is ungainly because the blueprints are all smeared up.

The first paragraph isn’t insane. There could be a dispute as to whether the existing standard is in fact agreed to and, even if it is, whether it is de facto (or merely de jure) and want to do about it.

The second paragraph doesn’t seem to appeal to any facts at all. Sentence one is just Roy saying he doesn’t like it (though expressed in pseudo-factual terms). The second seems just false on any remotely literal reading (HTML5 isn’t the kind of thing that can use anything, much less the web!). The third sentence seems more like a declaration of his intent (i.e., it’ll never happen because he’ll make it never happen). The last seems factual, but it definitely contentless, at least without serious serious supplement (i.e., it should be a conclusion, but we haven’t even seen whether it’s a violation of the standards for writing blueprints or of what the blueprints say; whose blueprints are they anyway? can we sensible talk about blueprints for the Web?)

Thus, Roy is not giving public reasons, primarily. He’s just expressing strong dissent with some coloring of expertise to hide the bruteness of that dissent. That’s not happy discussion technique, IMHO.

Uncategorized , ,

The Flickcurl Story

June 28th, 2008

In January 2007 I started playing with the Flickr API - the HTTP-based web service that lets you manipulate Flickr. At that point I was using it to play with machine tags and to see how the most popular Web Service API worked, especially in the area of authentication. This was in the days before OAuth if you can remember that far back.

I started with a test program in C that called libcurl and did some of the signing and parameter marshaling of the flickr.photos.getInfo call which is where all the juicy metadata about photos is. I started thinking about ways to map photo metadata into RDF for manipulating and querying; there is an existing Perl Flickr RDF mapping but it didn’t contain everything. This state of sources was useful; it contained a small library with the one API method plus a command-line utility to call it. Since I was using libCurl to call Flickr, I named it Flickcurl. Also CFlickr was taken! (Flickcurl also uses libxml but flickcurlibxml is just nuts).

Apart from playing with photo metadata I also had some personal reasons to make something new. I wanted a lighter weight and less formal project than the way I had been building the Redland RDF Libraries. More of it compiles, ship it model and less of the unit tests, test cases and continual make check, worrying about portability approach. Maybe more fun would be a way to put it. I’m happiest as a free software / open source software tool-builder and at this point in 2007 I was spending a lot of time at work doing non-coding things such as designing specifications and doing technical leadership and the chance to work on some different code now and then was appealing to counterpoint the work stuff.

Redland is a set of libraries that have been growing since mid-2000 with more and more features as the semantic web technology stack grows so at any point in time there is no clear end state. For this project I wanted a clear goal to reach so I could be clearly done at some point. This is possible with the Flickr API since there are at any time a finite number of API calls (something like 100) so progress can be measured… although Flickr did add API calls while I was working on it. The result was I made a Flickcurl API coverage page with embedded API changelog (automatically generated from source code comments).

Flickcurl 0.1 was “released” 2007-01-21 although I didn’t announce it to anyone at that point. It was more of a tarball than an actual release.

One more thing I wanted to do was to experiment with different ways to tell people about software, compared to the ways I as using with Redland which was mostly email based but also via SourceForge and Freshmeat. So for Flickcurl I tried a bunch of different ways:

That was kind of fun, and I also followed a similar light weight process with Triplr but that’s another story. I think caring less worked out fine; people did use it and submit patches. Right now I still use the Flickr mailing list, API group, and freshmeat project.

As the library headed towards 100% of the API and beyond it did get a bit more formal and I imported what I think are the best practices from the Redland libraries:

  • objects in C design
  • always refactoring the source code: refactoring is not just for dynamic languages
  • source code docu-comments generating an HTML API reference via gtk-doc
  • folding in portability fixes
  • make it work with optional libraries for extra functionality (Raptor in this case, to allow serialising to other RDF syntaxes)
  • built in portable ANSI C
  • taking care about memory leaks with valgrind
  • comes with a utility program able to exercise the entire API (called flickcurl)
  • Debian packages (created by somebody else, yay!)
  • manual pages for the command line utilities

The general aim was to get 100% of the Flickr API done by the end of 2007 and I actually reached it for Flickcurl 1.0 on 2008-01-12 which was pretty close.

So right now the library has gone beyond 1.0; the latest release is Flickcurl 1.4 which was released last Tuesday 24th June (see release notes) which primarily added video support but I also updated the photo metadata mapping to RDF by adding a serializer class for abstracting the photo-to-triples process.

The RDF triple mappings is something that has always been there but not part of the core library. It could be optionally used inside Raptor to automatically read Flickr photo URIs as RDF data sources. I doubt it’ll ever be presented inside a public web service like Triplr since it would require passing in Flickr API authentication tokens and user credentials.

The RDF triples mapping I’ve made for the Flickr photo metadata has mixture of vocabularies which are in 3 buckets:

  1. Existing Vocabularies: well known RDF schemas (class and properties) that have been developed over many years by multiple people and organisations, sometimes with a lot of formality.
  2. Flickr-specific Vocabularies: vocabularies I made up mostly for Flickr video and places API terms.
  3. Machine Tag Vocabularies: I made them up using machinetags.org/ns URIs as a root for the namespaces associated with the vocabularies. The terms in the vocabularies come from how people used machine tags on Flickr and are not always defined.

This is a range of what might be called semantic web heavy to light although there is absolutely nothing wrong with mixing things up if you are not worried about inference. This is OK! I should probably put some html/schema documents at the vocabularies and get the redirects and all that # and / business sorted so that the linked data works out properly but what I have now is just a start and I’d be interested to see what people think. There are more details of the vocabularies and terms I’m using in the Flickcurl 1.4 release notes although I should probably add vocabulary information to the documentation too.

That’s all for now but I’ll expand some more in another post about the Flickr API itself and my experience with it and impressions of it as a both a software developer and HTTP Web Service designer.

Uncategorized , , , , , , , , , , , , , ,

RDFa is a Candidate Recommendation

June 20th, 2008

RDFa technology button (in blue)
The Semantic Web Deployment Working Group and XHTML2 Working Group have published a Candidate Recommendation of RDFa in XHTML: Syntax and Processing. Web documents contain significant amounts of structured data, which is largely unavailable to tools and applications. When publishers can express this data more completely, and when tools can read it, a new world of user functionality becomes available, letting users transfer structured data between applications and web sites, and allowing browsing applications to improve the user experience. RDFa is a specification for attributes to be used with languages such as HTML and XHTML to express structured data. See the groups' RDFa implementation report (although this document is still evolving as new implementations come in). The Working Groups also updated the companion document RDFa Primer.

A set of RDFa technology buttons have also been published, linked in from the Semantic Web logos' index page.

English , , , , ,

What is a Widget?

June 17th, 2008

I’ve had widgets on my mind lately. Both our own Fraser and our friend and investor Fred Wilson are speaking at WidgetWebExpo today about widgets. Fred has a post today on his blog talking about the term “widget” and why it is wrong for what we are doing.

Fred offers the following definition via Wikipedia

A web widget is a portable chunk of code that can be installed and executed within any separate HTML-based web page by an end user without requiring additional compilation. They are derived from the idea of code reuse.
Other terms used to describe web widgets include: gadget, badge,
module, capsule, snippet, mini and flake. Web widgets often but not
always use DHTML, JavaScript, or Adobe Flash.

Widgets are a big part of what we do here. We have our widget gallery where people can grab and customize widgets that display everything from their own Amazon wishlist to New York Times bestsellers, to the featured stocks from Wallstrip. People can also use our BlueOrganizer Firefox add-on to create custom widgets containing whatever books, music, movies, and more that they’d like.

What this leads me to think is that whether we call these things “widgets” or something else is not important. It does not matter what we call them, it matters what they do and how they do it. If people can easily showcase what they are interested in and find important, then you can call it whatever you want. Or to paraphrase Shakespeare:

“What’s in a name? That which we call a widget, by any other name would still be installed and executed within any separate HTML-based web page and express the interests of that page’s author.”

English , , , , ,

O’Reilly Media Implements AB Meta

May 19th, 2008

A month ago we introduced AB Meta, a simple and open format for annotating pages that are about things. We highlighted seven benefits and stressed that the purposefully simple and easily understood nature of the format would lead to adoption.

oreilly-logo-051908.pngToday we’re happy to announce that O’Reilly Media has adopted AB Meta across their book pages.

By adding AB Meta markup to each book’s page, O’Reilly introduces lightweight, object-centric semantics to their website.

oreilly-meta-051908.png

With AB Meta, O’Reilly makes it easy for software to identify the specific book inside the regular HTML page. Visitors to the page will be able to interact with real things rather than flat HTML pages, enjoying a better web experience.

As of today, any BlueOrganizer user who navigates to an O’Reilly book page will be able to engage and interact with the specific book - discovering and exploring the book and its attributes across the web.

oreilly-page-051908.png

AB Meta is simple to implement and easy to understand. Already delivering value to both content producers and individuals on the web, the benefit to both parties will only increase as new tools are built for the open format. We’re thrilled to have O’Reilly adopt the format and are happy that you’ll now benefit from a smarter experience on oreilly.com.

English , ,

Modelling Statistical Publications: Some Notes

March 10th, 2008

Lee Feigenbaum has put together a really nice posting discussing different ways of modelling statistical data using RDF. I wanted to contribute to that discussion and add in a few comments about how I've been modelling some of the OECD's statistical publications using RDF.

Note the emphasis: what I've been doing is capturing metadata about individual statistical tables and graphs, their association with specific publications, their metadata, etc. I've not attempted to capture the detail of the statistics themselves, but do have a few relevant comments there.

The background to this is that I'm currently technically leading a project to build the latest version of OECD's electronic library. All of the metadata is stored in RDF, with content available as HTML, PDFs, Excel spreadsheets or as views into the OECD.stat application that the OECD have developed as a power tool for housing and delivering their statistical data.

As Lee discovered in the EuroStat data, regions and countries are core concepts. All of the OECD's statistical output can be classified by country and region, and these are types defined within our schema. We assign URIs to the countries using either the ISO 3166-1 alpha-2 country code or, in the case of classifying data that refers to countries that no longer exist as a specific entity (e.g. Yugoslavia), we use the ISO 3166-3 4 letter country code.

A country may be associated with zero or more Regions, using an Is Part Of relationship. A region may be the European Union, OECD member states or other arbitrary grouping. I suspect the same basic requirements will apply to other statistical datasets.

There are some other types of classification that we associate with the tables:

  • An indicator of whether the table is a "comparative table": e.g. does it include data from multiple countries?
  • An association between the table and a "Table Series" which constitute a collection of tables published over time
  • The statistical Variables that the table contains, e.g. GDP
  • A summary of the time range that the table covers, e.g. "2007", "2005-2007", "2000, 2002-2005", etc. These are captured as simple literals for now as we have to do little/no processing on them at this level.

And then there's the usual collection of title, description, etc. all as multi-lingual literals. All tables are also assigned a DOI to provide a stable link that can be cited in publications. If the table was originally published in a specific Book or journal Article then that relationship is also captured.

Obviously this metadata is, largely, at a level above that which Lee has been exploring, but I thought this might provide some useful context. For anyone looking at capturing statistical data in RDF, there are some other useful places to look at for defining terms and drawing on prior experience.

Firstly the Journal of Economic Literature Classification provides some terms that can be associated with statistical publications to help categorize them. The OECD's statistical glossary fills a similar role.

Secondly, the Statistical Data and Metadata EXchange (SDMX) initiative is also worthy of a look. It's not RDF but, as well as defining XML Schemas and web services for exchanging statistical data, the guidelines include lists of cross-domain concepts and their mappings to those in use by EuroStat, OECD, IMF, etc. So plenty of scope for grounding RDF vocabularies for statistical in a lot of prior art.

Finally, the OECD have some public documentation about the design and implementation of their "MetaStore" database that supports OECD.stat (it's a different beast to the Ingenta MetaStore, I should point out). For example, the document "Management of Statistical Metadata at the OECD" (PDF) has some interesting detail about the different types of metadata (structural, technical, publishing) that is stored in these multi-dimensional data cubes.

Uncategorized , , , , , , , , , , ,

Semantic Web Search Engine Roundup

February 27th, 2008

Unlike traditional search engines, which crawl the Web gathering Web pages, Semantic Web search engines index RDF data stored on the Web and provide an interface to search through the crawled data. Below is a list of Semantic Web search engines that are currently under development.

Semantic Web Search Engine (SWSE)
SWSE is a search engine for the RDF Web on the Web, and provides the equivalent services a search engine currently provides for the HTML Web. The system explores and indexes the Semantic Web and provides an easy-to-use interface through which users can find the information they are looking for. Because of the inherent semantics of RDF and other Semantic Web languages, the search and information retrieval capabilities of SWSE are potentially much more powerful than those of current search engines. SWSE indexes RDF data from many sources, including OWL, RDF and RSS files. RSS2 is converted to RDF and they will be adding GRDDL sources soon. Developed by DERI Ireland.
Sindice
Sindice is a lookup index for Semantic Web documents built on data intensive cluster computing techniques. Sindice indexes the Semantic Web and can tell you which sources mention a resource URI, IFP, or keyword, but it does not answer triple queries. Sindice currently indexes over 20 million RDF documents. Developed by DERI Ireland.
Watson
Allows you to search through ontologies and semantic documents using keywords. At the moment, you can enter a set of keywords (e.g. "cat dog old_lady"), and obtain a list of URIs of semantic documents in which the keywords appear as identifiers or in literals of classes, properties, and individuals. You can also use wildcards in the keywords (e.g., "ca? dog*"). Developed by KMi, UK.
Yahoo! Microsearch
Microsearch is Yahoo!'s stab at Semantic Web search and provides a richer search experience by combining traditional search results with metadata extracted from Web pages. Indexes RDF, RDFa and Microformats crawled from the Web. Microsearch will soon be adding support for GRDDL.
Falcons
Falcons is a keyword-based search engine for the Semantic Web, equipped with browsing capability. Falcons provides keyword-based search for URIs identifying objects and concepts (classes and properties) on the Semantic Web. Falcons also provides a summarization for each entity (object, class, property) for rapid understanding. Falcons currently indexes 7 million RDF documents and allows you to search through 34,566,728 objects. Developed by IWS China.
Swoogle
Searches through over 10,000 ontologies. 2.3 million RDF documents indexed, currently including those written in RDF/XML, N-Triples, N3(RDF) and some documents that embed RDF/XML fragments. Currently, it allows you to search through ontologies, instance data, and terms (i.e., URIs that have been defined as classes and properties). Not only that, it provides metadata for Semantic Web documents and supports browsing the Semantic Web. Swoogle also archives different versions of Semantic Web documents. Developed by the Ebiquity Group of UMBC.
Semantic Web Search
Powered by RDF Gateway, Intellidimension's proprietary platform for Semantic Web applications and agents. Developed by Intellidimension Inc.
Zitgist Search
The Zitgist Query Service simplifies the Semantic Data Web Query construction process with an end-user friendly interface. The user need not conceive of all relevant characteristics - appropriate options are presented based on the current shape of the query. Search results are displayed through an interface that enables further discovery of additional related data, information, and knowledge. Users describe characteristics of their search target, instead of relying entirely on content keywords.

Got something to say? Leave a comment!

English , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Reinventing HTML

October 27th, 2006

Making standards is hard work. Its hard because it involves listening to other people and figuring out what they mean, which means figuring out where they are coming from, how they are using words, and so on.

There is the age-old tradeoff for any group as to whether to zoom along happily, in relative isolation, putting off the day when they ask for reviews, or whether to get lots of people involved early on, so a wider community gets on board earlier, with all the time that costs. That's a trade-off which won't go away.

The solutions tend to be different for each case, each working group. Some have lots of reviewers and some few, some have lots of time, some urgent deadlines.

A particular case is HTML. HTML has the potential interest of millions of people: anyone who has designed a web page may have useful views on new HTML features. It is the earliest spec of W3C, a battleground of the browser wars, and now the most widespread spec.

The perceived accountability of the HTML group has been an issue. Sometimes this was a departure from the W3C process, sometimes a sticking to it in principle, but not actually providing assurances to commenters. An issue was the formation of the breakaway WHAT WG, which attracted reviewers though it did not have a process or specific accountability measures itself.

There has been discussion in blogs where Daniel Glazman, Björn Hörmann, Molly Holzschlag, Eric Meyer, and Jeffrey Zeldman and others have shared concerns about W3C works particularly in the HTML area. The validator and other subjects cropped up too, but let's focus on HTML now. We had a W3C retreat in which we discussed what to do about these things.

Some things are very clear. It is really important to have real developers on the ground involved with the development of HTML. It is also really important to have browser makers intimately involved and committed. And also all the other stakeholders, including users and user companies and makers of related products.

Some things are clearer with hindsight of several years. It is necessary to evolve HTML incrementally. The attempt to get the world to switch to XML, including quotes around attribute values and slashes in empty tags and namespaces all at once didn't work. The large HTML-generating public did not move, largely because the browsers didn't complain. Some large communities did shift and are enjoying the fruits of well-formed systems, but not all. It is important to maintain HTML incrementally, as well as continuing a transition to well-formed world, and developing more power in that world.

The plan is to charter a completely new HTML group. Unlike the previous one, this one will be chartered to do incremental improvements to HTML, as also in parallel xHTML. It will have a different chair and staff contact. It will work on HTML and xHTML together. We have strong support for this group, from many people we have talked to, including browser makers.

There will also be work on forms. This is a complex area, as existing HTML forms and XForms are both form languages. HTML forms are ubiquitously deployed, and there are many implementations and users of XForms. Meanwhile, the Webforms submission has suggested sensible extensions to HTML forms. The plan is, informed by Webforms, to extend HTML forms. At the same time, there is a work item to look at how HTML forms (existing and extended) can be thought of as XForm equivalents, to allow an easy escalation path. A goal would be to have an HTML forms language which is a superset of the existing HTML language, and a subset of a XForms language wit added HTML compatibility. We will see to what extend this is possible. There will be a new Forms group, and a common task force between it and the HTML group.

There is also a plan for a separate group to work on the XHTML2 work which the old "HTML working group" was working on. There will be no dependency of HTML work on the XHTML2 work.

As well as a new HTML work, there are other things want to change. The validator I think is a really valuable tool both for users and in helping standards deployment. I'd like it to check (even) more stuff, be (even) more helpful, and prioritize carefully its errors, warning and mild chidings. I'd like it to link to an explanations of why things should be a certain way. We have, by the way, just ordered some new server hardware, paid for by the Supporters program -- thank you!

This is going to be hard work. I'd like everyone to go into this realizing this. I'll be asking these groups to be very accountable, to have powerful issue tracking systems on the w3.org web site, and to be responsive in spirit as well as in letter to public comments. As always, we will be insisting on working implementations and test suites. Now we are going to be asking for things like talking with validator developers, maybe providing validator modules and validator test suites. (That's like a language test suite but backwards, in a way). I'm going to ask commenters to be respectful of the groups, as always. Try to check whether the comment has been made before, suggest alternative text, one item per message, etc, and add to technical perception social awareness.

This is going to be a very major collaboration on a very important spec, one of the crown jewels of web technology. Even though hundreds of people will be involved, we are evolving the technology which millions going on billions will use in the future. There won't seem like enough thankyous to go around some days. But we will be maintaining something very important and creating something even better.

Tim BL

p.s. comments are disabled here in breadcrumbs, the DIG research blog, but they are welcome in the W3C QA weblog.

Uncategorized , , , , , , , , , ,

Links on the Semantic Web

December 30th, 2005

On the web of [x]HTML documents, the links are critical. Links are references to 'anchors' in other documents, and they use URIs which are formed by taking the URI of the document and adding a # sign and the local name of the anchor. This way, local anchors get a global name.

On the Semantic Web, links are also critical. Here, the local name, and the URI formed using the hash, refer to arbitrary things. When a semantic web document gives information about something, and uses a URI formed from the name of a different document, like foo.rdf#bar, then that's an invitation to look up the document, if you want more information about. I'd like people to use them more, and I think we need to develop algorithms which for deciding when to follow Semantic Web links as a function of what we are looking for.

To play with semantic web links, I made a toy semantic web browser, Tabulator. Toy, because it is hacked up in Javascript (a change from my usual Python) to experiment with these ideas. It is AJAR - Asynchronous Javascript and RDF. I started off with Jim Ley's RDF Parser and added a little data store. The store understands the mimimal OWL ([inverse] functional properties, sameAs) to smush nodes representing the same thing together, so it doesn't matter if people use many different URIs for the same thing, which of course they can. It has a simple index and supports simple query. The API is more or less the one which cwm and had been tending toward in python.

Then, with the DOM and CSS and Ecmascript standards bookmarked, the rest was just learning the difference between Javascript and Python. Fun, anyway.

The result .. insert a million disclaimers... experimental, work in progress, only runs on Firefox for no serious reason, not accessible, too slow, etc ... at least is a platform for looking at Semantic Web data in a fairly normal way, but also following links. A blue dot indicates something which could be downloaded. Download some data before exploring the data in it. Note that as you download multiple FOAF files for example the data from them merges into the unified view. (You may have to collapse and re-expand an outline).

Here is the current snag, though. Firefox security does not allow a script from a given domain to access data from any other domain, unless the scripts are signed, or made into an extension. And looking for script signing tools (for OS X?) led me to dead ends. So if anyone knows how to do that, let me know. Untill I find a fix for that, the power of following links -- which is that they can potentially go anywhere -- is alas not evident!

Uncategorized , , , , , , , , ,

So I have a blog

December 12th, 2005

In 1989 one of the main objectives of the WWW was to be a space for sharing information. It seemed evident that it should be a space in which anyone could be creative, to which anyone could contribute. The first browser was actually a browser/editor, which allowed one to edit any page, and save it back to the web if one had access rights.

Strangely enough, the web took off very much as a publishing medium, in which people edited offline. Bizarely, they were prepared to edit the funny angle brackets of HTML source, and didn't demand a what you see is what you get editor. WWW was soon full of lots of interesting stuff, but not a space for communal design, for discource through communal authorship.

Now in 2005, we have blogs and wikis, and the fact that they are so popular makes me feel I wasn't crazy to think people needed a creative space. In the mean time, I have had the luxury of having a web site which I have write access, and I've used tools like Amaya and Nvu which allow direct editing of web pages. With these, I haven't felt the urge to blog with blogging tools. Effectively my blog has been the Design Issues series of technical articles.

That said, it is nice to have a machine to the administrative work of handling the navigation bars and comment buttons and so on, and it is nice to edit in a mode in which you can to limited damage to the site. So I am going to try this blog thing using blog tools. So this is for all the people who have been saying I ought to have a blog.

Uncategorized ,