As ISWC08 is drawing to a close, it dawns to me that something which Frank van Harmelen has been forecasting for years is now happening, seemingless without conscious effort. He calls it Approximate Reasoning - have a look at his ESWC06 keynote. The basic idea behind it is to do reasoning over ontologies with a different focus, namely by giving up some reasoning correctness in order to gain better scalability.
And indeed, at ISWC08 I have seen a number of things which fit exactly into this corner (while at the same time the authors/programmers might not even be aware of it).
As part of the Billion Triple Challenge, Axel Polleres presented the SAOR system, which does approximate OWL reasoning by means of forward chaining rules. Now you can’t do OWL reasoning (in a sound and complete way) with forward chaining rules (and Axel knows this), so in the end you’re losing some consequences. But at the same time you do get some consequences when having to deal with large amounts of data.
Eyal Oren, also at the Billion Triple Challenge, presented the MARVIN system which performs approximate RDF reasoning by means of massive parallelisation. MARVIN comes out of the EU project LarKC, which is actually pursuing approximate reasoning on a large scale (pun intended).
Among the results presented at ISWC08, I found those by Claudia D’Amato on Statistical Learning for Inductive Query Answering on OWL Ontologies really amazing. She and her collaborators managed to do OWL instance retrieval without any deduction algorithm. Instead they used Support Vector Machines and learned which (named) OWL classes individuals belong to. The learning was done from a small sample set (generated by a reasoner), but the network was able to generalise from the data to achieve about 90% of coverage. In my opinion, this is something conceptually new and it is really remarkable that it works.
In a regular paper Eyal Oren also reported on using Evolutionary Algorithms for RDF query answering.
I foresee the importance of such approaches rising substiantially in the future (and I think it’s a safe guess since Frank also seems to think so). The Billion Triple Challenge series could become one of the driving forums for this. There are exciting times ahead!
I attended the session as I was curious to learn about more applied uses of Semantic Web technology particularly in the financial and business context. In terms of content the tutorial veered wildly from overview material through to some quite detailed looks at linguistic and semantic analysis to extract information from business reports. To that end I’m not going to attempt to summarize the full content of the tutorial but will pick out a few areas of interest.
Somne time was spent on looking at XBRL, the standard business reporting language which is becoming increasingly adopted around the world as a standard means to publish and share business reports. The initiative which began in 1999 was recently extended this year to include a European XBRL consortium. The broad goal of the project is to standardize the means and structure of publishing business financial reports with the goal of making it easy to compare and collate reports for regulatory and other purposes. The current financial crisis was referenced as an illustration of the need for greater transparency in business reporting and is an obvious driver for adoption of the technology.
XBRL draws on many of the same concepts as the Semantic Web, in particular the use of “taxonomies” that can be customized by specific businesses, sectors and regulatory areas, but uses XML technologies like XML Schema rather than RDF. There is growing interest in being able to capture this information using RDF and in mapping XBRL taxonomies into Semantic Web ontologies. For example there has been some early work on an the XBRL ontology, as well as some independent exploration and signs that a W3C incubator or interest group might be formed. The speaker at the tutorial also suggested that before long some standard GRDDL connectors would be available to automate the transformation of XBRL documents into RDF.
Much of the tutorial was discussion of applied uses of RDF data and ontologies within the context of the Musing Project an EU funded project exploring “next-generation business intelligence” in the areas of financial risk management, internationalisation and IT operational risk. Some of the applications that have been explored have been collecting company info from a range of multilingual sources; attempting to assess chances of success of a business in a specific region; semi-automated form filling, e.g. for returns; identifying appropriate business partners; and reputation tracking and opinion mining.
Many of the issues faced in the Musing project deal with how to assemble this data with a historical context: while XBRL data may be present for current or recent years, text mining is required to extract this data from historical reports. The last part of the tutorial was a general introduction to Information
Extraction using the Gate toolkit (this starts from around Slide 75 in the Powerpoint slides). This was a good overview of the capabilities of the toolkit and showed some nice use cases. OpenCalais certainly isn’t the only game in town and, while Gate requires more effort to set-up, looks like it could provide a great deal more customisation options for businesses that really need the extra power.
One of the telling things about the overall process was the need to collate useful data from a number of different sources in order to drive the information extraction process. In order to do Named Entity Extraction a good set of reference material is required, e.g. Gazetteers for place names, or lists of people’s names. While much of this data is already available — in Musing they drew on Wikipedia and the CIA World Factbook for example — a lot more information was either available only by crawling the web or from commercial resources. This suggests to me that there’s still a some ground work to be done in unlocking more data sets that can help drive the business intelligence use cases. There’s essentially a domino effect here: exposing often small focused datasets, can end up unlocking huge potential value further down the line.
For the past several months the EU Commission and the EU Parliament were struggling over the so called “Telecom Package“, a legislative initiative promoted by the Commission under heavy advocacy of France. In a nutshell the Telecom Package contains a very problematic passage, which is meant to strengthen the rights of ISPs in being able to cut off the internet access of individual users, if any violations of existing or future copyright law were detected. In other words: ISPs would be able to control who gets access to the internet, violating the universal service doctrine, which is a basic cornerstone of democracy.
In their first reading on September 24, 2008 the European Prarliament voted against the the “Telecom Package” advocating the so called “Bono Amendment” - which refers to the French Socialist MEPGuy Bono - which basically states that that courts need to be involved in any disconnection procedure. In the original passage, quoted in a recent EU Observer article, it says:
No restriction may be imposed on the rights and freedoms of end users … without a prior ruling by the judicial authorities.”
This decision has some relevant implications for any future developments of the internet. While the telcos and the media companies are struggling hard to adapt to the social logic the internet, searching for new business models and lobbying for regulation in their favour, it is obvious that the existing abundance and innovativeness of the internet is hardly compatible with their notion of making money on the web - basically by restricting access and promoting artificial scarcity.
It also is relevant to developments like Linking Open Data, as in an increasingly interconnected and mashupped world it is getting harder and harder to comply with strict and rigid copy- & usage rights policies - even if their are published under any sort of commons license. In this respect it is important to mention that research on judicial problems arising from the automated processing of content released under differing commons licenses is still missing (as far as I know - does anybody have a hint for me?). But with the current decision of the European Parliament we can observe a very promising shift in the notion that the internet is made up of much more than its commercial exploitability. And that any attempt to stiffle this notion by imposing unbalanced regulatory restrictions on the rights of the users is a major threat not just to the internet as it exists but to democracy itself.
I am packing my bags once again: The first VoCamp (hosted at Oxford University, UK) is about to start this week. So, what is a VoCamp supposed to be? The official definition reads like this: “A VoCamp is a series (hopefully) of informal events where people can spend some dedicated time creating lightweight vocabularies/ontologies for the Semantic Web/Web of Data. The emphasis of the event(s) is not on creating the perfect ontology in a particular domain, but on creating vocabs that are good enough for people to start using for publishing data on the Web.”
I always thought that the lack widely established vocabularies/ontologies has been very damaging to the developent of the Semantic Web. The VoCamp initiative could help changing this situation for the better, so I really hope that this is the start of a long series of events.
My topics of main interest are: 1) Associative Tags; 2) Agreement, Disagreement, discourse; 3) Corporate Semantic Web, 4) “Are upper level ontologies/vocabularies not so bad after all?”, 5) “ Cleaner schemas and ontologies”. These interests are motivated partly by use-cases from the “KiWi – Knowledge in a Wiki” EU project, and partly by developments in the area of biomedical research at DERI Galway and the W3C Interest Group for Health Care and Life Science. Details below.
__Associative Tags__
Tagging is one of the key components of the ‘Web 2.0′, and Semantic Web technologies will help to make tagging even more powerful. Schemas such as SCOT or MOAT have already been established, and make it possible to ‘tag’ not only with simple strings, but with entities. These entities (such as concepts described in SKOS) can be associated with clear semantics and can be further described with RDF statements, to describe hierarchies of entities, or to link entities to rich data sources such as DBpedia. This enables sophisticated data-integration and cross-data source queries that would not have been able with simple, string-based tags.
On the other hand, Semantic Web developers can learn from the simplicity that has made tagging so successful. Creating useful tags is very simple, and good user interfaces can further improve the simplicity of creating useful tag with feature such as autocompletion and tag recommendation. This simplicity should server as a role model for many Semantic Web applications.
Specifically, I am interested in what I call ‘associative tags’, bundles of tags/entities/concepts that can be used for the simple representation of facts. The primary intention of creating aTags is not the categorization of the document, but the representation of the key facts inside the document. Key facts in the biomedical domain might be, for example,
“Protein A interacts with protein B” (which can be represented with an aTag comprising of the three entities “Protein A”, “Molecular interaction” and “Protein B”) or
“Overexpression of protein A in tissue B is the cause of disease C” (an aTag comprising of the four entities “Overexpression”, “Protein A”, “Tissue B” and “Disease C”).
Once the aTags from these different sources are aggregated, it is possible to pose a query such as “show me molecules that are associated with molecules that are associated with disease C”, yielding “protein A” as an answer. Hierachies (in the form of rdfs:subClassOf and skos:narrower) can be used to expand queries based on background knowledge (e.g., that “disease D” is a subclass of “disease C”).
In many cases (especially with some ontologies in the biomedical domain), creating such associative tags can be much simpler than the creation of ‘real’ statements, i.e., relations between individuals and property restrictions of classes.
__Agreement, Disagreement, discourse__
Many people in the Semantic Web community are interested in the representation of argumentation structures on the web. For example: stating that one snippet of text contains statements that are in disagreement with another snippet of text, which is in agreement with yet another snippet of text. This can be of use for many knowledge domains, such as news articles, biomedical publications or reports submitted to a software bug tracker. Of special interest in this context are extensions of established schemas, especially SIOC. There is also another ontology called SWAN that is specifically tailored to the biomedical domain, and efforts to align SWAN with SIOC have started recently.
__Corporate Semantic Web__
As Semantic Web technologies are finally getting mature enough to allow industrial uptake, it is becoming clear that ontologies for describing organization structures and business processes are still lacking maturity. FOAF allows us to represent basic information about persons, organizations and their relationships, but lacks vocabulary for stating that one person is the boss of another person, that a project consists of several subtasks, et cetera. While there are some small projects that try to create such schemas/ontologies, a solution of widespread acceptance does not seem to be in sight at the moment.
__Are upper level ontologies/vocabularies not so bad after all?__
FOAF seemingly tried it a long time ago – foaf:Person is a subclass of, “http://xmlns.com/wordnet/1.6/Person”, foaf:Document “http://xmlns.com/wordnet/1.6/Document” and so on. Linking to external schemas/ontologies (or making use of their classes and properties directly) can definitly help in facilitating semantic interoperability. For a long time, many web developers were very skeptical about such ‘top-down’ approaches of data integration, but recently the recognition of the potential values of such resources seems to be increasing. In parallel, the recent 1-2 years brought us some very large upper ontologies that are available as linked data, such as:
Wordnet 2.0, hosted by the W3C
Yago/DBpedia
OpenCyc (now with new URIs)
UMBEL (derived from OpenCyc and others).
I think the practice of re-using and linking to such upper ontologies as should become popular (again). It helps in creating a highly interlinked Semantic Web, and helps to avoid re-inventing the wheel for each new schema/ontology. This linking should not be done post-hoc, but should be a central part of the early stages of vocabulary/ontology/data creation.
__Cleaner schemas and ontologies__
Working with established ontologies and schemas in ontology editors can be a chore. Most have dependencies on other ontologies, but don’t use owl:imports. Most use an awkward mix of OWL statements and RDF(S), resulting in ontologies that are OWL Full. Many require some OWL reasoning to make use of sameAs statements and inverse properties, but at the same time reasoning is complicated because the ontologies are OWL Full or even contain logical inconsistencies. Often enough, there seems to be no practical reason for the design choices that caused the trouble: some minor changes can turn a messy OWL Full ontology into an OWL lite or OWL DL ontology. At the moment, many different working groups have created local versions of schemas such as FOAF or Dublin Core that are valid OWL-DL to fix that problem.
It doesn’t have to be this way.
Trying to adhere to OWL lite/DL and adding owl:imports statements can help building cleaner, modular and more sustainable ontologies, and does not require significant additional effort during the creation of ontologies. Maybe we can find a consensus that this would be a worthwhile goal, and develop plans towards reaching that goal.
Sorry for still writing about last week, but the TRIPLE-I conference had far too many interesting topics to offer for me to be already through with them - promise, this blog post about wikis will be the last TRIPLE-I post.
An interesting use of wikis was introduced with the Moki plugin for Semantic Media Wiki, developed as a side product of the APOSDLE project. APOSDLE (EU-project leaders love their acronyms;-) aims to develop an Advanced Process-Oriented Self-Directed Learning Environment, which in plain language is a platform to support the process of learning at work. In the course of this project, a model of the enterprise knowledge had to be developed that was to be the collaborative result of domain experts within the enterprise and external knowledge engineers. The APOSDLE image video below conveys a sense of the complexity of the knowledge to be represented.
But on to Moki: As wikis are an ideal, readily available tool for collaboration, the simple solution was to build a plugin (Moki) for Semantic Media Wiki that allow to structure and engineer the domain knowledge. Moki is a hierarchy builder that supports drag and drop so that categories and relations can easily be fitted in place - the special benefit of using Semantic Media Wiki was that the structure of the generated knowledge can be exported in Semantic Web compliant formats. Apart from the browser, no further software is required.
The APOSDLE website doesn’t yet offer any information about Moki, but a description can be found in the conference proceedings: Collaborative Knowledge Engineering via Semantic MediaWiki, by Chiara Ghidini, Marco Rospocher (who gave the presentation), Luciano Serafini, Viktoria Pammer, Barbara Kump, Andreas Faatz, Andreas Zinnen, Joanna Guss, Stefanie Lindstaedt.
For those looking for good arguments for setting up a wiki in a global business environment: Peter Kemper’s keynote was the perfect primer for that. Peter, a Knowledge Management portfolio manager at Shell’s IT-Department, gave some insights into the process of their conversion to wikis. Before there were wikis at Shell, they had global discussion forums, connecting 20,000 people around topics and questions, which were intensively used - the question whether wikis should be adopted or not alone generated 800 responses in these forums.
Instead of going for team wikis, Shell opted for the encyclopedic approach and a wiki that would be accessible to anyone at Shell, and for using MediaWiki - which was, interestingly, the first open source software ever used at Shell. Peter Kemper named scalability and the lean architecture as prime arguments for MediaWiki, and they have indeed not had any technical hiccups so far. It was also an asset that people, being used to Wikipedia, know how to use the MediaWiki interface.
Examples of uses case with which the feasibility of wikis within Shell were tested were: Drilling salt, Geology of the Atlantic Margin, and Production Chemistry. Before that, the main media for maintaining and passing on knowledge had been emails and Powerpoint - not exactly because these were considered appropriate for knowledge management, but because of the effects these media had had on the communication within Shell:
With the advent of email, People wrote less and less memos. Less and less reports were sent to the archive, because people kept powerpoint presentations. If that same information, previously locked in emails and powerpoint, went now into wiki, it would finally be accessible to everyone in the company.
Peter Kemper allowed us a glimpse of the information their wiki held, for instance, about the Atlantic Margin - as geological structures are described, most of the information relies on images. It would be a nightmare to maintain this kind of information in Powerpoint! No offense meant: Powerpoint is good for presentations but not for creating and maintaining a knowledge base. According to Peter, with wikis Shell achieved six times the productivity in comparison to using Powerpoint, in particular due to the linkability of content.
Wikis also turned out to be the superior solution for the integration of curricula from an internal learning environment, as wikis support the modular structure of a learning curriculum. Furthermore, they are also a good means to sustain communication in the time between workshops or team meetings.
At shell, they even use wiki for instance for the translation of contracts into the requirements of day to day procedures - a typical contract in the business that Shell is in has around 400 pages, and it is probably not very likely that a single person is going to read (and immediately understand) the entire contract. In this regard, the wiki also serves as a tool to translate lawyer-readable prose into transparent instructions (and there are probably many more ways in which wikis can be used to support business processes, a statement also put forward by Rolf Sint from Salzburg Research; see his 12 seconds statement below).
A noteworthy detail about the integration of wikis in Shell’s IT architecture: If a user logs onto the wiki for the first time and goes beyond the disclaimer, a new wiki account is automatically created that is identical with his or her windows account - this is not about checking on people, Peter Kemper said, but about creating organisational transparency.
On the one hand, this reveals whether there are organisational units within Shell where the wiki is not as intensively used as elsewhere, meaning that these units probably have specific needs which need to be addressed first. On the other hand, people can (and do) also contact each other via the wiki, e.g. one can contact the person who created an article if one is on need of further information.
About stimulating content production: 60% of Shell’s employees will go into retirement over the next eight years, and with them knowledge that is needed in the company. They even asked and paid former employees to come out of retirement to work on the wiki - that’s what I call commitment to content creation and knowledge preservation.
The Shell wiki already has more than 40,000 registered users (with 150,000 employees in the company, plus contract staff). What is interesting regarding user activation is that the number of active users stays relatively the same, even if the number of users in total increases. Peter Kemper’s account for this was that content comes in waves, meaning that users are activated in those areas where fresh knowledge is generated.
Kemper distinguished three types of users: content owners who create content from scratch; content editors who often just correct syntax or make things ‘look nicer’; and information consumers. Kemper rejected the term ‘lurkers’ for information consumers as looking for information is an activity in itself.
FRDCSA - “Our goal is to assemble the most comprehensive ontology of FLOSS [free/libre/open source software] applications, and make packages available for every free operating system, distribution and hardware platform”
…fighting the web is like holding back the ocean; it will route around you or it will wear you down, but will never go away, and it will never tire or give up.
We submitted five papers for OWLED 2008 EU—for people who aren’t attending, or in case one or more of these aren’t accepted (hard to believe, I know!), here are the PDFs:
The first three represent products that we’re either support presently or actively developing. The last two discuss areas we’re moving into in the future. All together, these five papers give a fair indication of what we’ve been up to lately.
OWLED is a community event, to be sure; but we’ve always felt a special kinship with it, since Bijan Parsia was one of the founders and all that happened while many of us were at UMD’s Mindlab. In some sense, it’s our “home conference”—which is a bit weird for a startup, granted, but our biz model is commercializing early-stage research, so that suggests having at least a few toes of one foot firmly in academia.
We’ve sponsored OWLED corporately over the past few years, and I helped organized (with Peter Patel-Schneider) the previous OWLED this past April in Washington, DC.
That said, OWLED 2008 EU is in Karlsruhe, Germany, colocated with ISWC 2008, and that’s a long and expensive (thanks to the dollar’s weakness in the EuroZone) trip. But we’ve got a lot planned for OWLED, so while we’re not yet sure who will be there representing C&P, someone will be. We’re busy now writing papers about:
As people discover from time to time, OWL isn’t really a schema language; because of its semantics (open world assumption, etc), instance data that looks different from schema data doesn’t lead to the reasoner declaring a constraint violation. Rather, it may lead to the inference of new knowledge.
This is neither a good nor bad thing; it just means there are at least two Big Use Cases people want to use OWL for: inferring new knowledge (CHECK) and abstract, expressive schema language (FAIL).
But giving a new—alternative or supplemental—semantics to OWL is non-trivial, which is why when NIST put out an SBIR solicitation for R&D related to using OWL in Supply Chain Management, we submitted an integrity constraints proposal. (Also, a sidenote for OWL Haters: you can’t, by definition, get more practical or “real world” than SCM. Not possible.)
Happily, we found out recently that we were awarded this funding (with very strong technical evaluations—yay!) and work is beginning soon:
We’re working on an OWLED EU paper, with Evan Wallace of NIST, that describes the design space; I’ll post links to the paper ASAP.
We’re actively soliciting use cases and requirements for OWL integrity constraints: please get in touch if you have feedback.
There will be some software released by early 2009 at the latest for experimentation.
SBIR Phase I funding will cover this work; but the follow-on, Phase II, is considerably larger and that’s where, if we can snag Phase II, we’ll push the prototype to production-grade, integrate it with the rest of Pellet and other OWL-based systems of ours (Owlgres comes to mind), and make a concerted “push to market”.
An industrial-quality Integrity Constraint subsystem for Pellet will open it up for use in new apps and, thus, in new markets. That’s exciting and, as always, we’ll be looking for commercial partners to work with us on those challenges.
Lee Feigenbaum has put together a really nice posting discussing different ways of modelling statistical data using RDF. I wanted to contribute to that discussion and add in a few comments about how I've been modelling some of the OECD's statistical publications using RDF.
Note the emphasis: what I've been doing is capturing metadata about individual statistical tables and graphs, their association with specific publications, their metadata, etc. I've not attempted to capture the detail of the statistics themselves, but do have a few relevant comments there.
The background to this is that I'm currently technically leading a project to build the latest version of OECD's electronic library. All of the metadata is stored in RDF, with content available as HTML, PDFs, Excel spreadsheets or as views into the OECD.stat application that the OECD have developed as a power tool for housing and delivering their statistical data.
As Lee discovered in the EuroStat data, regions and countries are core concepts. All of the OECD's statistical output can be classified by country and region, and these are types defined within our schema. We assign URIs to the countries using either the ISO 3166-1 alpha-2 country code or, in the case of classifying data that refers to countries that no longer exist as a specific entity (e.g. Yugoslavia), we use the ISO 3166-3 4 letter country code.
A country may be associated with zero or more Regions, using an Is Part Of relationship. A region may be the European Union, OECD member states or other arbitrary grouping. I suspect the same basic requirements will apply to other statistical datasets.
There are some other types of classification that we associate with the tables:
An indicator of whether the table is a "comparative table": e.g. does it include data from multiple countries?
An association between the table and a "Table Series" which constitute a collection of tables published over time
The statistical Variables that the table contains, e.g. GDP
A summary of the time range that the table covers, e.g. "2007", "2005-2007", "2000, 2002-2005", etc. These are captured as simple literals for now as we have to do little/no processing on them at this level.
And then there's the usual collection of title, description, etc. all as multi-lingual literals. All tables are also assigned a DOI to provide a stable link that can be cited in publications. If the table was originally published in a specific Book or journal Article then that relationship is also captured.
Obviously this metadata is, largely, at a level above that which Lee has been exploring, but I thought this might provide some useful context. For anyone looking at capturing statistical data in RDF, there are some other useful places to look at for defining terms and drawing on prior experience.
Secondly, the Statistical Data and Metadata EXchange (SDMX) initiative is also worthy of a look. It's not RDF but, as well as defining XML Schemas and web services for exchanging statistical data, the guidelines include lists of cross-domain concepts and their mappings to those in use by EuroStat, OECD, IMF, etc. So plenty of scope for grounding RDF vocabularies for statistical in a lot of prior art.
Finally, the OECD have some public documentation about the design and implementation of their "MetaStore" database that supports OECD.stat (it's a different beast to the Ingenta MetaStore, I should point out). For example, the document "Management of Statistical Metadata at the OECD" (PDF) has some interesting detail about the different types of metadata (structural, technical, publishing) that is stored in these multi-dimensional data cubes.