Archive

Archive for December, 2008

Help Us Build the 2009 Calais Roadmap

December 31st, 2008

We'd like to thank each and every one of you for contributing to the success of Calais over the last year. From an idea in mid-2007 to our launch in January 2008, it's been an amazing ride.

We're wrapping up the year with 9,000 registered users of the Calais service who are submitting one to two million transactions each day. You've built dozens, if not hundreds, of innovative applications and you've provided regular feedback to help us make the service more useful. Thank you. We truly could not have done it without you.

Now we'd like to ask for a little more help. We're pulling together our roadmap for 2009 and want to hear directly from you about what you'd like to see. As a user-driven project, Calais needs your feedback to make sure we're on the right path.

Please drop us a note at 2009@opencalais.com and let us know two things:

First: What would you like to see delivered in 2009? New extractions? Languages? New tools or API capabilities? Integration with your favorite content management system? Let us know - nothing is off the table.

Second: How are you using Calais? Even if it's experimental and not for public consumption, we'd love to hear what you're up to - or even what you're thinking about.

Again, thanks for your support. There are some exciting announcements coming with Release 4 of Calais in mid-January and we're looking forward to a great 2009. Your input will help make it so.

Our best wishes for happy and healthy new year,

The Calais Team

English, Official Blog, feedback

Videos of 2008 ICWSM presentations

December 31st, 2008

The submission deadline for the Third International Conference on Weblogs and Social Media is just three weeks away. To inspire yourself to work on a submission, you can check out videos from the 2008 ICWSM which are online at Videolectures.net. Here are highlights.

English

El caparazón, nuevas tecnologías, educación, futuros, web 3.0, web …

December 31st, 2008

UMBC ties for first in 2008 Pan-American Intercollegiate Team Chess Championship

December 30th, 2008

Congratulations to the UMBC Chess team and their advisor and our colleague, UMBC CSEE Professor Alan Sherman, for a first place tie in the 54th Pan-American Intercollegiate Team Chess Championship.

UMBC tied for first place with University of Texas at Dallas (B Team) in the sixth and final round of the three-day 2008 Pan-Am Championship which was held in Dallas. This year 29 four-person college teams competed in the annual event which is known as the “World Series of College Chess“. UMBC has now won the Pan-Am tournament a record eight times. The final standings are available at swchess.

The two first place winners will meet again with the third and fourth place teams, the University of Texas Brownville and Stanford, in the special Final Four of Chess tournament, which is held in spring 2009.


The UMBC chess team: front row, L to R: WGM Sabina Foisor, GM Timur Gareev, GM Sergey Erenburg, and GM Leonid Kritz, board one, Back row: UMBC coaches GM Sam Palantnik and NM Igor Epshteyn. Photo Alexey Root.

English

Wikirage tracks whats hot on Wikipedia

December 30th, 2008


Wikirage is yet another way to track what’s happening in the world via changes in social media, in this case, Wikipedia. As the site suggests, “popular people in the news, the latest fads, and the hottest video games can be quickly identified by monitoring this social phenomenon.”

Wikirage lists the 100 Wikipedia pages that are being heavily edited over any of six time periods from the last hour to the last month. You can see the top 100 by your choice of six metrics: number of quality edits, unique editors, total edits, vandalism, reversions, or undos. Clicking on a result shows a monthly summary for the article, for example, December 2008 Gaza Strip airstrikes, which is at the top of today’s list for number of edits as I write. I understand the Gaza article, but what’s up with the Tasmanian tiger?

The interface has some other nice features, such as marking pages in red that have high revision, vandalism or undo rates and showing associated Wikipedia flags that indicating articles that need attention or don’t live up to standards. Wikirage is also available for the English, Japanese, Spanish, German and French language Wikipedias.

Wikirage was developed by Craig Wood and is a nicely done system.

(via the Porn Sex Viagra Casino Spam site)

English

Akshay Java Ph.D.: Mining Social Media Communities and Content

December 30th, 2008

Akshay Java defended his PhD dissertation this fall on discovering communities in social media systems and the submitted version is now available online. Akshay is now a scientist at Microsoft’s Live Labs. The citation, link and abstract are below.


Akshay Java, Mining Social Media Communities and Content, Ph.D. Dissertation, Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, December 1, 2008. Available at http://ebiquity.umbc.edu/paper/html/id/429/Mining-Social-Media-Communities-and-Content.

Social Media is changing the way people find information, share knowledge and communicate with each other. The important factor contributing to the growth of these technologies is the ability to easily produce “user-generated content”. Blogs, Twitter, Wikipedia, Flickr and YouTube are just a few examples of Web 2.0 tools that are drastically changing the Internet landscape today. These platforms allow users to produce and annotate content and more importantly, empower them to share information with their social network. Friends can in turn, comment and interact with the producer of the original content and also with each other. Such social interactions foster communities in online, social media systems. User-generated content and the social graph are thus the two essential elements of any social media system.

Given the vast amount of user-generated content being produced each day and the easy access to the social graph, how can we analyze the structure and content of social media data to understand the nature of online communication and collaboration in social applications? This thesis presents a systematic study of the social media landscape through the combined analysis of its special properties, structure and content.

First, we have developed a framework for analyzing social media content effectively. The BlogVox opinion retrieval system is a large scale blog indexing and content analysis engine. For a given query term, the system retrieves and ranks blog posts expressing sentiments (either positive or negative) towards the query terms. Further, we have developed a framework to index and semantically analyze syndicated1 feeds from news websites. We use a sophisticated natural language processing system, OntoSem, to semantically analyze news stories and build a rich fact repository of knowledge extracted from real-time feeds. It enables other applications to benefit from such deep semantic analysis by exporting the text meaning representations in Semantic Web language, OWL.

Secondly, we describe novel algorithms that utilize the special structure and properties of social graphs to detect communities in social media. Communities are an essential element of social media systems and detecting their structure and membership is critical in several real-world applications. Many algorithms for community detection are computationally expensive and generally, do not scale well for large networks. In this work we present an approach that benefits from the scale-free distribution of node degrees to extract communities efficiently. Social media sites frequently allow users to provide additional meta-data about the shared resources, usually in the form of tags or folksonomies. We have developed a new community detection algorithm that can combine information from tags and the structural information obtained from the graphs to effectively detect communities. We demonstrate how structure and content analysis in social media can benefit from the availability of rich meta-data and special properties.

Finally, we study social media systems from the user perspective. In the first study we present an analysis of how a large population of users subscribes and organizes the blog feeds that they read. This study has revealed interesting properties and characteristics of the way we consume information. We are the first to present an approach to what is now known as the “feed distillation” task, which involves finding relevant feeds for a given query term. Based on our understanding of feed subscription patterns we have built a prototype system that provides recommendations for new feeds to subscribe and measures the readership based influence of blogs in different topics.

We are also the first to measure the usage and nature of communities in a relatively new phenomena called Microblogging. Microblogging is a new form of communication in which users can describe their current status in short posts distributed by instant messages, mobile phones, email or the Web. In this study, we present our observations of the microblogging phenomena and user intentions by studying the content, topological and geographical properties of such communities. We find that microblogging provides users with a more immediate form of communication to talk about their daily activities and to seek or share information.

The course of this research has highlighted several challenges that processing social media data presents. This class of problems requires us to re-think our approach to text mining, community and graph analysis. Comprehensive understanding of social media systems allows us to validate theories from social sciences and psychology, but on a scale much larger than ever imagined. Ultimately this leads to a better understanding of how we communicate and interact with each other today and in future.

English

The true cost of sending SMS messages

December 28th, 2008

The NYT has an article, What Carriers Aren’t Eager to Tell You About Texting , about new interest in understanding why charges for SMS service has been increasing even while volume is up and communication costs are down.

I learned one interesting thing from the article about the length of SMS messages. I’d never thought much about where the limit on the number of characters came from. According to the article, the limit is 160 (7 bit) characters because that’s what will fit into the control channel messages that mobile phones exchange with cell towers.

“The lucrative nature of that revenue increase cannot be appreciated without doing something that T-Mobile chose not to do, which is to talk about whether its costs rose as the industry’s messaging volume grew tenfold. Mr. Kohl’s letter of inquiry noted that “text messaging files are very small, as the size of text messages are generally limited to 160 characters per message, and therefore cost carriers very little to transmit.” A better description might be “cost carriers very, very, very little to transmit.”

A text message initially travels wirelessly from a handset to the closest base-station tower and is then transferred through wired links to the digital pipes of the telephone network, and then, near its destination, converted back into a wireless signal to traverse the final leg, from tower to handset. In the wired portion of its journey, a file of such infinitesimal size is inconsequential. Srinivasan Keshav, a professor of computer science at the University of Waterloo, in Ontario, said: “Messages are small. Even though a trillion seems like a lot to carry, it isn’t.”

Perhaps the costs for the wireless portion at either end are high — spectrum is finite, after all, and carriers pay dearly for the rights to use it. But text messages are not just tiny; they are also free riders, tucked into what’s called a control channel, space reserved for operation of the wireless network. That’s why a message is so limited in length: it must not exceed the length of the message used for internal communication between tower and handset to set up a call. The channel uses space whether or not a text message is inserted.”

There’s a lot more to the protocols, of course. The Wikipedia SMS article looks like a good place to start.

English

The true cost of sending SMS messages

December 28th, 2008

The NYT has an article, What Carriers Aren’t Eager to Tell You About Texting , about new interest in understanding why charges for SMS service has been increasing even while volume is up and communication costs are down.

I learned one interesting thing from the article about the length of SMS messages. I’d never thought much about where the limit on the number of characters came from. According to the article, the limit is 160 (7 bit) characters because that’s what will fit into the control channel messages that mobile phones exchange with cell towers.

“The lucrative nature of that revenue increase cannot be appreciated without doing something that T-Mobile chose not to do, which is to talk about whether its costs rose as the industry’s messaging volume grew tenfold. Mr. Kohl’s letter of inquiry noted that “text messaging files are very small, as the size of text messages are generally limited to 160 characters per message, and therefore cost carriers very little to transmit.” A better description might be “cost carriers very, very, very little to transmit.”

A text message initially travels wirelessly from a handset to the closest base-station tower and is then transferred through wired links to the digital pipes of the telephone network, and then, near its destination, converted back into a wireless signal to traverse the final leg, from tower to handset. In the wired portion of its journey, a file of such infinitesimal size is inconsequential. Srinivasan Keshav, a professor of computer science at the University of Waterloo, in Ontario, said: “Messages are small. Even though a trillion seems like a lot to carry, it isn’t.”

Perhaps the costs for the wireless portion at either end are high — spectrum is finite, after all, and carriers pay dearly for the rights to use it. But text messages are not just tiny; they are also free riders, tucked into what’s called a control channel, space reserved for operation of the wireless network. That’s why a message is so limited in length: it must not exceed the length of the message used for internal communication between tower and handset to set up a call. The channel uses space whether or not a text message is inserted.”

There’s a lot more to the protocols, of course. The Wikipedia SMS article looks like a good place to start.

English

Yongmei Shi PhD: Linguistic Information for Speech Recognition Error Detection

December 26th, 2008

Yongmei Shi defended her PhD dissertation earlier this fall on using syntactic and semantic information to detect errors in spoken language systems under the direction of Dr. R. Scott Cost (JHU/APL) and Professor Lina Zhou (UMBC). Her dissertation has been submitted an is now available online.

Yongmei Shi, An Investigation of Linguistic Information for Speech Recognition Error Detection, Ph.D. Dissertation, Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, October 2008.

After several decades of effort, signi?cant progress has been made in the area of speech recognition technologies, and various speech-based applications have been developed. However, current speech recognition systems still generate erroneous output, which hinders the wide adoption of speech applications. Given that the goal of error-free output can not be realized in near future, mechanisms for automatically detecting and even correcting speech recognition errors may prove useful for amending imperfect speech recognition systems. This dissertation research focuses on the automatic detection of speech recognition errors for monologue applications, and in particular, dictation applications.

Due to computational complexity and ef?ciency concerns, limited linguistic information is embedded in speech recognition systems. Furthermore, when identifying speech recognition errors, humans always apply linguistic knowledge to complete the task. This dissertation therefore investigates the effect of linguistic information on automatic error detection by applying two levels of linguistic analysis, speci?cally syntactic analysis and semantic analysis, to the post processing of speech recognition output. Experiments are conducted on two dictation corpora which differ in both topic and style (daily of?ce communication by students and Wall Street Journal news by journalists).

To catch grammatical abnormalities possibly caused by speech recognition errors, two sets of syntactic features, linkage information and word associations based on syntactic dependency, are extracted for each word from the output of two lexicalized robust syntactic parsers respectively. Con?dence measures, which combine features using Support Vector Machines, are used to detect speech recognition errors. A con?dence measure that combines syntactic features with non-linguistic features yields consistent performance improvement in one or more aspects over those obtained by using non-linguistic features alone.

Semantic abnormalities possibly caused by speech recognition errors are caught by the analysis of semantic relatedness of a word to its context. Two different methods are used to integrate semantic analysis with syntactic analysis. One approach addresses the problem by extracting features for each word from its relations to other words. To this end, various WordNet-based measures and different context lengths are examined. The addition of semantic features in con?dence measures can further yield small but consistent improvement in error detection performance. The other approach applies lexical cohesion analysis by taking both reiteration and collocation relationships into consideration and by augmenting words with probability predicted from syntactic analysis. Two WordNet-based measures and one measure based on Latent Semantic Analysis are used to instantiate lexical cohesion relationships. Additionally, various word probability thresholds and cosine similarity thresholds are examined. The incorporation of lexical cohesion analysis is superior to the use of syntactic analysis alone. In summary, the use of linguistic information as described, including syntactic and semantic information, can provide positive impact on automatic detection of speech recognition errors.

English

Llega con fuerza la Web Contextual (1)

December 25th, 2008

Decíamos al final del artículo sobre tendencias en la web para 2009 que se aventura un futuro en el que el objetivo será combatir la sobreinformación, que a pesar de la progresión creciente en nuestras habilidades cognitivas para procesarla adecuadamente, nos llevará a ser mucho más selectivos, a filtrar bajo criterios sociales o (otra vez) semánticos, cada vez más y con herramientas más eficientes, nuestras fuentes de información en la red

La idea es poderosa y parece ir un paso  más allá de los estrictos criterios que parece exigir cualquier aplicación para poder ser considerada semántica: La web contextual pretenderá que navegadores y páginas reconozcan con mayor precisión lo que el usuario realmente quiere encontrar. Menos opciones y más significado, menos búsquedas en Google y más contexto, persiguiendo las siguiente mejoras en la experiencia de usuario:

  • Relevancia: entender mejor el contexto conlleva mayor relevancia de los contenidos para el usuario.
  • Eficiencia - Atajos: Los atajos contextuales facilitarían las búsquedas.
  • Personalización: El contexto está basado en las intenciones y la historia de navegación del usuario.
  • Remezcla - Mashups: en entornos abiertos, puede devolvernos información relevante e interoperable entre servicios de la web (Ubiquity puede insertar un mapa en un correo electrónico de forma muy fácil).

Este  tipo de tecnologías contextuales tienen en muchas ocasiones como base lenguajes propios de la web semántica. Se basan también en la filosofía de las APIs abiertas (que permiten la interacción entre distintas aplicaciones de la web).

El html plano, sin marcas semánticas, xml, rdf o microformatos, entre otros lenguajes de marcado semántico (metadatos), no permitía la interacción con el navegador a los niveles actuales. Hoy, cuando el navegador puede inferir ideas acerca de las páginas que visitamos, es capaz de devolvernos información relativa y/o relevante.

Tal y como decíamos al hablar de la web semántica, la web contextual entiende  mayor medida el comportamiento del usuario. La combinación de la información sobre la página con el comportamiento del usuario es lo que crea el contexto y por tanto, una web más inteligente.

No creo que como afirma Alex Iskold, de quien extraigo algunas de las  ideas en este post, la web contextual vaya a sobrepasar, a superar en cuanto a su frecuencia de uso, la costumbre de hoy de aproximarnos a la información a partir de resultados en el buscador. No en un primer momento, desde la premisa lógica de que no hay contexto sin información previa (información+comportamiento=contexto) y por tanto la primera aproximación a lo que buscamos deberá pasar casi siempre por buscadores, pero sí después, evitándonos muchos clicks innecesarios  y haciendo por tanto mucho más eficiente nuestra navegación posterior.

Creo, además, más allá de la idea original, que el tema debe incluir algunas cosas más, que también deben ser consideradas contextuales:

-La geolocalización, o oferta de contenidos según el lugar desde el que esté físicamente el usuario

-Los contenidos de relevancia “social”, aquellos que preferimos porque son los que prefieren nuestros contactos en redes sociales.

-También aprendería de nuestro comportamiento como usuarios, evitándo que nos encontremos una y otra vez con resultados que consideramos irrelevantes (Google está poniendo en práctica ya un sistema de filtrado de resultados personalizados según nuestras valoraciones previas, Google SearchWiki)

Una de las formas de aportar metadatos a las páginas que escribimos son los microformatos:

Presento siempre los microformatos como precursores, de fácil comprensión, de la web semántica.  Ofrecen una forma compatible con los estándares XHTML de embeber metadatos sobre diversas cosas, diciéndole al navegador que son gente, lugares, eventos, revisiones, etc…

Los Web Slices, introducidos por Internet Explorer 8, por ejemplo, entienden el microformato hAtom. Los Web Slices permiten a los que publicamos contenidos notificar a los usuarios de IE8 cualquier cambio en la información de nuestras páginas web. Weather.com podría, por ejemplo, crear un Web Slice que que notificara al usuario cualquier actualización en el clima local. El concepto es similar a lo que hacen los sistemas de sindicación de contenidos (feeds), pero de forma más focalizada en partes de la página y permitiendo al usuario la interacción con el sitio de forma directa, a través del navegador en la página.

XML realiza, en aplicaciones como Cooliris, un trabajo similar, señalando al navegador si una página contiene o no imágenes para que el visitante pueda verlas en 3D. AdaptiveBlue trabaja la web contextual mediante ABMeta, formato que permite anotar páginas que contienen información sobre libros, música, películas, productos, restaurantes, etc…

Todas estas aproximaciones se basan en el marcado de las páginas. Y a pesar de que algunos, preocupados por la web semántica, dedican tiempo a hacerlo, la mayoría de las páginas siguen estando escritas en HTML plano.

La web contextual en navegadores

Tanto Internet Explorer como Firefox, han incorporado potencialidades de la experiencia contextual, mediante distintos tipos de atajo: Internet Explorer 8 incorpora una nueva tecnología al respecto con sus Accelerators.

Según Microsoft, Accelerators ofrece acceso a servicios online comunes, desde cualquier página que visitemos. Son pequeños trozos de variables predefinidas en XML por el propio navegador: la URL activa, el dominio activo y el texto seleccionado. La acción más común de Accelerator es la búsqueda de información contextual en base a las selecciones del usuario. Otor ejemplo típico es la búsqueda de mapas a partir de direcciones.

No se trata, en ese caso, de semántica. Los accelerators resultan aún pesados de manejar y requieren bastante tiempo e intervención del usuario. Firefox mejora el tema, con una aproximación basada en menús, ofreciendo la tecnología contextual mediante texto. Su extensión es Ubiquity, hoy sólo una extensión pero muy posiblemente característica crucial en próximas actualizaciones.

He estado probándolo esta tarde y la veremos con mayor profundidad en un próximo post, pero resumiendo, podríamos decir que devuelve mashups generados por el usuario, basándose en el lenguaje. Funciona de forma similar a los accelerators: el usuario puede seleccionar un fragmento de texto, invocar Ubiquity y escribir un comando. Existen cientos de ellos ya implementados.


Veremos en la segunda parte de esta entrada los Widgets para blogs y complementos para navegadores (Firefox). En fin…que ya os debo dos entradas ;)

¿Os he deseado ya Feliz Navidad?

Compártelo: bookmark bookmark bookmark bookmark bookmark bookmark bookmark bookmark bookmark bookmark bookmark bookmark

Spanish

Social media conferences, sympoisa, workshops and events

December 25th, 2008

JD Lasica’s SocialMedia.biz blog has a post, 2009 conferences: Social media, tech, marketing, that lists “some of the best social media, technology, media and marketing conferences for the upcoming year” in the US. The list doesn’t include any technology research-oriented conferences, but does have quite a range of others. The post invites everyone to suggest additional entries by adding comments about them. (I suggested ICWSM and the AAAI Spring Symposium on the Social Semantic Web.)

This list complements Akshay Java’s Social Media Events calendar which is focused mostly on research conferences. He also invites suggestions which you can submit by email or through comments.

English

WWGD: Understanding Google’s Technology Stack

December 24th, 2008

It’s popular to ask “What Would Google Do” these days — The Google reports over 7,000 results for the phrase. Of course, it’s not just about Google, which we all use as the archetype for a new Web way of building and thinking about information systems. Asking WWGD can be productive, but only if we know how to implement and exploit the insights the answer gives us. This in turn requires us (well, some of us, anyway) to understand the algorithms, techniques, and software technology that Google and other large scale Web-oriented companies use. We need to ask “How Would Google Do It”.

Michael Nielsen has a nice post on using your laptop to compute PageRank for millions of webpages. His posts reviews PageRank and how to compute it and shows a short, but reasonably efficient, Python program that can easily do a graph with a few million nodes. While not sufficient for many applications, like the Web, there are lots of interesting and significant graphs this small Python program can handle — Wikipedia pages, DBLP publications, RDF namespaces, BGP routers, Twitter followers, etc.

The post is part of a series Nielsen is making on the Google Technology Stack including PageRank, MapReduce, BigTable, and GFS. The posts are a byproduct of a series of weekly lectures he’s giving starting earlier this month in Waterloo. Here’s the way that Nielsen describes the series.

“Part of what makes Google such an amazing engine of innovation is their internal technology stack: a set of powerful proprietary technologies that makes it easy for Google developers to generate and process enormous quantities of data. According to a senior Microsoft developer who moved to Google, Googlers work and think at a higher level of abstraction than do developers at many other companies, including Microsoft: “Google uses Bayesian filtering the way Microsoft uses the if statement” (Credit: Joel Spolsky). This series of posts describes some of the technologies that make this high level of abstraction possible.”

Videos of the first two lectures, Introducion to PageRank and Building our PageRank Intuition) are available online. Nielsen illustrates the concepts and algorithms with well-written Python code and provides exercises to help readers master the material as well as “more challenging and often open-ended problems” which he has worked on but not completely solved.

Nielsen was trained as a as a theoretical Physicist but has shifted his attention to “the development of new tools for scientific collaboration and publication”. As far as I can see, he is offering these as free public lectures out of a desire to share his knowledge and also to help (or maybe force) him to deepen his own understanding of the topics and develop better ways of explaining them. In both cases, it an admirable and inspiring example for us all and appropriate for the holiday season. Merry Christmas!

English

¿Porqué no enlazamos?, Red para psicólogos, Taller total, Pasapalabra y videotutoriales sobre la web semántica

December 23rd, 2008

Llegaron las recomendaciones navideñas a El caparazón. Estos son algunos de los sitios que creo que merecen una visita esta semana. En algunos casos son muy especiales para mi, así que tratádmelos bien ;)

  • Reflexiones e irreflexiones - Por qué no enlazamos? Los datos que nos presenta Fernando dan que pensar, y mucho, sobre la cultura picaresca del país. Casi los últimos en el noble ejercicio del enlace.
  • Entre psicologos - Grupo de psicología para comentar casos clínicos, dudas de psicología y terapias afines: Como tributo a mi vocación perdida, participé en la idea de esta comunidad para psicólogos creada en Ning, Aclarar dudas de casos clínicos (diagnóstico, tratamiento…), en definitiva, ayudar en la práctica profesional, muchas veces solitaria de los terapeutas.
  • PASAPALABRA ONLINE: Uno de mis vicios favoritos, ahora online. Para desconectar un poco….
  • Multimèdia a eTT: Espacio de exposición de los trabajos en vídeo y flash de Teresa Julià, una compañera de estudios en la UOC de hace un tiempo. Estoy segura de que la felicitación navideña en flash de este año os alegrará un poco el momento.

    nadala-2008

  • VideoLectures.net acerca de la web semántica: Más de 90 Vídeos de alta calidad y ponencias desde la séptima conferencia internacional de Web Semántica.

Compártelo: bookmark bookmark bookmark bookmark bookmark bookmark bookmark bookmark bookmark bookmark bookmark bookmark

Spanish

Make Your Own Digital Newspaper

December 23rd, 2008

Before entering the holidays, one may wish that the news we get everyday were somehow customized to our interests. For example, “I am not really interested in Baseball, or I like Jazz news to appear in my first glance view, or I need to monitor emerging progress about synthetic insulin, or…” People can have variety of first-grade interests, but they have to collect these information from different places everyday, or through clicking bunch of links. Why not have my own newspaper where every column is about my selected interest, laid out in the way I want?

We built my.hakia.com, which does exactly what is described above. A screenshot is shown below.

myhakia1

The screenshot above tells the whole story except one important differentiator.

Semantic technology of hakia allows high-level of precision compared to any other similar platform. This enables the user to park highly specific questions against the emerging news. Therefore, my.hakia.com can be considered as “intelligence gathering dashboard”. Let us tell you how.

If you search Google news for Obama’s strategy for the new team, you will see that the results are mostly irrelevant. Try to create a Google alert for this query and see the results for yourself on a continuous basis.

The same search at hakia for Obama’s strategy for the new team produces dead-on results. This is because semantic technology does not need “link referrals” to pull relevant results unlike Google-esque search engines. For dynamic content like news, there is no time to collect “Link-referral” statistics and that is why Google-esque search engines fail beyond simple triggers. Try the same query at Yahoo, it displays the same confusion.

This fundamental differentiation is a valuable asset for my.hakia.com users because they will be getting precise results for specific interests that they cannot get it anywhere else. Some ideal cases are outlined below;

- Monitor your business competitors
- Get information on latest progress in the treatment of diseases
- Keep an eye on your favorite artists by activity (like album releases)
- Stay in touch with particular economic developments (such as in real estate in your city)

Try my.hakia.com, and tell us if we have met your expectations.

Happy holidays

English

Videos of Semantic Web talks and tutorials from ISWC 2008 now online

December 22nd, 2008

High quality videos of tutorials and talks from the Seventh International Semantic Web Conference are now available on the excellent VideoLectures.net site. It’s a great opportunity to benefit from the conference if you were not able to attend or, even if you were, to see presentations you were not able to attend.

Videolectures captured the slides for most of the presentations (which are available for downloading) and their site shows both the the speaker’s video and slides in synchronization. Videolectures used three camera crews in parallel so were able to capture almost all of the presentations. Here are some highlights from the ~90 videos to whet your appetite.

English

Tendencias para la web en 2009

December 22nd, 2008
La evolución de la web en 2009 será hacia la personalización, hacia hacer de la experiencia en la web algo mucho más real (”Web real world”), adaptado a nuestras necesidades, menos desconectado (servicios web auxiliando necesidades offline como la geolocalización, la web móvil, etc…) y por tanto, mucho más satisfactorio para el usuario final.

delicious

Tom Briggs Ph.D.: Constraint Generation and Reasoning in OWL

December 22nd, 2008

Tom Briggs defended his PhD dissertation last month on discovering domain and range constraints in OWL and the final copy is now available.

Thomas H. Briggs, Constraint Generation and Reasoning in OWL, 2008.

The majority of OWL ontologies in the emerging SemanticWeb are constructed from properties that lack domain and range constraints. Constraints in OWL are different from the familiar uses in programming languages and databases. They are actually type assertions that are made about the individualswhich are connected by the property. Because they are type assertions these assertions can add vital information to the individuals involved and give information on how the defining property may be used. Three different automated generation techniques are explored in this research: disjunction, least-common named subsumer, and vivification. Each algorithm is compared for the ability to generalize, and the performance impacts with respect to the reasoner. A large sample of ontologies from the Swoogle repository are used to compare real-world performance of these techniques. Using generated facts is a type of default reasoning. This may conflict with future assertions to the knowledge base. While general default reasoning is non-monotonic and undecidable a novel approach is introduced to support efficient contraction of the default knowledge. Constraint generation and default reasoning, together, enable a robust and efficient generation of domain and range constraints which will result in the inference of additional facts and improved performance for a number of Semantic Web applications.

English

Tendencias para la web en 2009

December 21st, 2008

Este es un post necesario, elaborado y obligado en esta época del año. Coincide en esta ocasión y a partir de la colaboración iniciada con la Asociación DIM-Espiral en su publicación periódica, en concreto, su sección sobre Últimas tendencias en la red.

Deriva del artículo publicado en el número 11 de la revista digital BITS-Espiral, que contiene material interesante, del que destacaría una entrevista al Director de Google España a la que podéis acceder en vídeo desde el enlace.

Agradecer desde aquí a Juanmi Muñoz y Irene Pelegrí, su directora, su trabajo en la publicación.

Agradeceros a todos la importante evolución de este blog, de la que sois los responsables más directos. FELIZ, muy FELIZ 2009.

Espero que disculpéis la extensión de esta entrada y entendáis que es mucho, tanto lo que pasó, como lo que está por llegar a este nuevo mundo.

  • tendencias 2009 para la web2008, La socialización de la red hispana.

Año frenético de noticias, de desarrollo imparable de la web social. Más, como preveíamos en Tendencias para 2008, en el caso del ámbito hispano.

Redes sociales y otros servicios como Facebook, Reddit (agregador de noticias) o últimamente Friendfeed (lifestreaming) se lanzaban y lograban conquistar con éxito, sobretodo en el caso de la primera, nuestro mercado.

El protagonista absoluto del año 2008 hispanohablante, siguiendo el ejemplo ya mítico de Wired, hemos sido los usuarios.

Nosotros, nuestra creciente destreza, capacidad y voluntad de compartir en un campo abonado han dejado un ecosistema cada día más rico en contenidos.

Resultan innumerables los servicios nuevos o clones de los ya existentes (como Yahoobuzz, alternativa a Digg o las distintas versiones de Twitter) que han surgido en 2008, con anécdotas significativas como la que acercó Google a las calles de nuestras ciudades o el importante papel de la red en las elecciones norteamericanas (menos en las españolas), que han completado el panorama.

A la vez, diversas aplicaciones se mostraban en rebeldía, prometiéndonos un futuro, o por lo menos, una experiencia de usuario mejor en la prometedora web 3.0: Los Navegadores 3D que nos proporcionan mayor inmersión en imágenes o textos, desarrollos en la Internet de las cosas o las aplicaciones de la web semántica, entre las que Twine, la primera red social y de contenidos que trabaja internamente con algoritmos semánticos, Zemanta, una aplicación que enriquece los contenidos que generamos, o los desarrollos en el ámbito del etiquetado semántico de Calais, ocupan, en mi opinión, un lugar privilegiado.

Sobre herramientas que han evolucionado, creo que el ejemplo estrella es, una vez más, Twitter, que para muchos era un boom momentáneo y se convierte ya casi en imprescindible en la nueva web.

  • 2008, los blogs como nuevas e indispensables herramientas de comunicación:

En cuanto a los blogs, más que morir seguían creciendo, mejorando su calidad, ampliando sus funciones desde la bitácora personal al periodismo ciudadano. Un dato importante nos lo confirma justo antes de cerrar este artículo:

El uso de internet por los 100 periódicos principales en EEUU cambió radicalmente en 2008. Según el informe anual de la industria, un 58% de estos medios digitales consultan y utilizan fuentes de contenido generado por los usuarios (User Generated Content) frente al 27% que lo hacía en 2007.

Una muestra más de que todo se amplía, se define, se actualiza, en mi opinión más como moda que con convencimiento, en 2008: Educación, Política, Administración, Márketing…se apellidan ya con  la coletilla imprescindible del año en la internet hispana: el “meme” 2.0.

  • 2009, nuevos portales, APIs abiertas:

Especialización…, propuestas cada vez más elaboradas  de nuevos portales que lleven hasta sus últimas consecuencias la filosofía, la cultura de las APIs abiertas, como apuestas por el desarrollo de la Inteligencia colectiva, que parece que, a nivel mainstream (de uso masivo) podría ser una utopía protagonizada, en poco tiempo y si no viene Microsoft a terminar o completar la historia a través de una compra final muy, muy previsible, Yahoo:

Así, el portal en 2009 propuesto por Yahoo, modelo que seguirán, con toda probabilidad muchos otros (AOL, por ejemplo), será un agregado de widgets y aplicaciones de terceros (video de Netflix, música de iTunes, productos de Amazon…). La filosofía de las APIs abiertas de  Facebook, pero otorgando mayor importancia a cada marca y sobretodo,  fuera de los “jardines vallados” propios de la misma.

  • 2009, identidades únicas, gestión del caos en las redes sociales:

Contribuye a este escenario, también,  el desarrollo de Google Friend Connect, Facebook Connect y demás herramientas que intentarán librarnos de la sobreinformación y el caos mediante identidades o passwords únicos para acceder a todas las aplicaciones en la web y simplificar nuestros grafos sociales.

E incluso resulta probable el desarrollo de aplicaciones similares a las anteriores pero que respetarán los derechos de los usuarios en la web social , librándonos de marcas y propietarios oportunistas de nuestros datos y los de nuestras relaciones reportadas. Así, cosas como Noserub para redes sociales, aplicaciones propias para lifestreaming (flujo de nuestra actividad en la red) o….adservers (sistemas de anuncios) como OpenX, surgirán como alternativas potentes, incluso al monopolio de la publicidad de Google.

  • 2009, más aplicaciones, la web 3.0, sigue el desarrollo en la web hispana:

Nos esperan muchas novedades: la aparición de Digg en español,  Amazon también en nuestro ámbito, nuevos intentos de acercamiento de la web semántica a los usuarios… y un incremento importante en spam y virus de las más variadas especies (también para las redes sociales) son otras de las cosas que estarán de actualidad en el año de la austeridad globalizada.

Otros detalles serán los dispositivos portátiles de tinta electrónica delgados como un bloc de papel (Plastic Logic), nuevas formas de democratizar y rentabilizar el trabajo de los creadores en el ámbito artístico y nuevos hitos en la evolución hacia la web 3.0 como Imindi, una prometedora alternativa a Twine en cuanto a  red social de la web semántica o Tikitag, que está construyendo una Internet para las cosas, los objetos, los productos de consumo, mediante chips RFID.

  • 2009, El paradigma de la web 2.0:

Pero lo más importante es que parece que 2009  será, por fin, el año en el que lo 2.0 evolucione desde la moda, el meme simple, hacia un nuevo paradigma social, cultural y económico del que nadie quiere quedar fuera. Así, será ahora cuando las empresas norteamericanas y europeas van a apostar fuertemente por la introducción de las tecnologías y conceptos de la Web 2.0 en sus procesos y modelos organizativos.


Tendencias generales en la web para 2009:

predicciones para la web en 2009

El análisis que publicábamos hace un tiempo ya, sobre estrategias tecnológicas para 2009, de Gartner, compañía líder a nivel mundial en consultoría y desarrollo de tecnologías de la información, partner indispensable de 60,000 clientes en 10,000 organizaciones, y en 80 países, presentado en la definida como la conferencia anual más importante de consultoría estratégica en tecnologías, destacaba hace poco las tendencias que deberían tener en cuenta la mayoría de organizaciones durante el próximo año:

-
Computación en la nube (Cloud Computing):

Aunque se trata de un fenómeno ya tradicional para las empresas, ha sido la segunda mitad de 2008 y parece que será 2009 el año en el que más leeremos sobre el término, casi tan memético como el de Web 2.0 ya.

Se habla de Arquitecturas empresariales Orientadas a la Web, de servidores virtuales y  virtualización en almacenamiento de datos y dispositivos cliente (las aplicaciones o el software que hoy se encuentran en recursos físicos, equipos o servidores, de la empresa). Plataformas como servicio (PPAS), Software como servicio (SASS), etc…son términos relativos al tema.

Como lo ha hecho la web 2.0 con el público general, el fenómeno describe un estilo de computación en el que los proveedores prestan una variedad de servicios basados en Internet a los consumidores, en este caso las empresas y organizaciones.


-Mashups
, Sistemas más especializados para entornos formales:

La tendencia hoy es utilizar sistemas comunes de la web 2.0 para distintas funcionalidades en la empresa u organización. La madurez alcanzada por muchos de los servicios de la web social incidirá en la aparición de nuevas herramientas, más seguras y adaptadas a entornos formales.

Como extensión del tema, en los metaversos hemos vivido en 2008 la emergencia de Second life para educación, conferencias, o comunicaciones empresariales. Entornos como el de Fosterra, empresa que está a punto de sacar a la luz un informe según el cual sus universos virtuales especializados son entornos ideales, más rentables y efectivos que los actuales servicios de conferencia web, serán un ámbito de importante desarrollo en 2009.


-Software, Networking social:

Redes sociales, colaboración, blogs corporativos (a pesar de la controversia acerca de su factibilidad)… las organizaciones considerarán la vertiente social de sus websites o aplicaciones. El riesgo es el de quedar rezagados, de quedar fuera de la conversación global actual.


- Knowledge management
(gestión del conocimiento), también 2.0:

Recoger, compartir el conocimiento es la prioridad, cultural y tecnológica más importante de las citadas en el informe. Mejores decisiones, más inteligentes, basadas en contenidos mejor organizados o tecnologías semánticas (el añadido es mío) que mejoren, hagan más rápidas, las decisiones empresariales en un entorno día a día más competitivo.

En relación a esto, según un reciente informe de la consultora Gartner, la generación virtual requiere que las empresas, los servicios, creen o conecten con las aplicaciones sociales. En 2010, más del 60% de las 1000 primeras compañías en EEUU estarán implicadas o crearán algún tipo de comunidad virtual de mejora de la práctica (CoPs) o como medio de mejora de la relación con el cliente. Lo veíamos también en 2009, El año de los community managers.


-Tecnologías más responsables:

Las regulaciones intentan adaptarse al crecimiento tecnológico exponencial, limitando los riegos ecológicos del crecimiento incontrolado de, como ejemplo, los datacenters (espacios físicos de almacenamiento de la información).

Buscar nuevas formas, más responsables con el medio, de crecimiento, es una prioridad ahora para gobiernos protectores y empresas que quieran cumplir sus cuotas de responsabilidad social sin sacrificar por ello su evolución tecnológica.

En cuanto a las aplicaciones, es probable que aumenten los usos sociales, incluso públicos, de tecnologías que se han demostrado adecuadas para el control de catástrofes naturales, entre otros problemas se servicio público. El microblogging (Twitter) o los usos sociales de la telefonía móvil en países subdesarrollados son modelos prometedores que veremos avanzar, con toda probabilidad, durante el próximo año.

Por último,  un reciente informe - investigación de Forrester (tenéis una interesante presentación del mismo en este enlace)  nos pone sobre la pista de los principales hitos del 2009 para los expertos en marketing en los medios sociales:


-Shopping Social:

Con Facebook Connect, Google FriendConnect y la OpenID, los consumidores podrán ver, de forma fácil, opiniones y críticas de sus contactos en redes sociales, gente en la que confían o conocen.


-Twitter como herramienta de comunicación horizontal

A pesar de las resitencias que en principio genera, Twitter sigue creciendo. Resulta difícil para cualquiera resitirse a usarla.


-Evaluación del éxito en redes sociales:

Si en 2008 hemos vivido la emergencia de innumerables servicios y redes sociales, toca ahora evaluar cuáles son los más adecuados para cada tipo de campaña.


-Calidad y áuge de las redes sociales verticales:

La segmentación de mercados, la publicidad por sectores, mucho más especializada y adaptada a los intereses tiene en las redes verticales su caldo de cultivo en 2009. Hay público ya para este tipo de redes, que crecerán ahora.

En fin…diría que en general, esa será la tendencia fundamental: combatir la sobreinformación, que a pesar de la progresión creciente en nuestras habilidades cognitivas para procesarla adecuadamente, nos llevará a ser mucho más selectivos, a filtrar bajo criterios sociales o (otra vez) semánticos, cada vez más y con herramientas más eficientes, nuestras fuentes de información en la red

 

Conclusión: LA CRISIS ECONÓMICA IMPULSARÁ LA WEB 3.0:

Sigo creyendo lo que os comentaba hace un tiempo: la evolución será hacia la personalización, hacia hacer de la experiencia en la web algo mucho más real (”Web real world”), adaptado a nuestras necesidades, menos desconectado  (servicios web auxiliando necesidades offline como la geolocalización, la web móvil, etc…) y por tanto, mucho más satisfactorio para el usuario final.


Artículos relacionados:

Spanish

Disco: a Map reduce framework in Python and Erlang

December 21st, 2008

Disco is a Python-friendly, open-source Map-Reduce framework for distributed computing with the slogan “massive data - minimal code”. Disco’s core is written in Erlang, a functional language designed for concurrent programming, and users typically write Disco map and reduce jobs in Python. So what’s wrong with using Hadoop? Nothing, according to the Disco site, but…

“We see that platforms for distributed computing will be of such high importance in the future that it is crucial to have a wide variety of different approaches which produces healthy competition and co-evolution between the projects. In this respect, Hadoop and Disco can be seen as complementary projects, similar to Apache, Lighttpd and Nginx.

It is a matter of taste whether Erlang and Python are more suitable for the task than Java. We feel much more productive with Python than with Java. We also feel that Erlang is a perfect match for the Disco core that needs to handle tens of thousands of tasks in parallel.

Thanks to Erlang, the Disco core remarkably compact, currently less than 2000 lines of code. It is relatively easy to understand how the core works, and start experimenting with it or adapt it to new environments. Thanks to Python, it is easy to add new features around the core which ensures that Disco can respond quickly to real-world needs.”

The Disco tutorial uses the standard word counting task to show how to set up and use Disco on both a local cluster and Amazon EC2. There is also homedisco, which lets programmers develop, debug, profile and test Disco functions on one local machine before running on a cluster. The word counting example from the tutorial is certainly nicely compact:

from disco.core import Disco, result_iterator

def fun_map(e, params):
    return [(w, 1) for w in e.split()]

def fun_reduce(iter, out, params):
    s = {}
    for w, f in iter:
        s[w] = s.get(w, 0) + int(f)
    for w, f in s.iteritems():
        out.add(w, f)

results = Disco(”disco://localhost”).new_job(
		name = “wordcount”,
                input = ["http://discoproject.org/chekhov.txt"],
                map = fun_map,
		reduce = fun_reduce).wait()

for word, frequency in result_iterator(results):
	print word, frequency

English

Library of Congress forces LCSH Linked Data site to shut down

December 21st, 2008

Back in May I was among others who welcomed the initiative by, Talking with Talis interviewee, Ed Summers in setting up lcsh.info.  This site was set up by Ed to demonstrate how the Library of Congress Subject Headings could be represented as a Semantic Web application using SKOS.

In the intervening months many including myself used Ed’s work as a pointer to how useful publicly available data could, with the use of open Linked Data principles, become a valuable part of sites and services across the globe.   For instance another Talking with Talis interviewee Martin Malmsten, from the Royal Library of Sweden, almost immediately made use of the links to the LCSH data.  Ed went on to get lots of feedback, and wrote a paper which he then presented at DC2008.

It is therefore with great disappointment that I read this on the lcsh.info site the other day:

On December 18th I was asked to shut off lcsh.info by the Library of Congress. As an LC employee I really did not have much choice other than to comply.

As a LC employee he was put in an untenable position when they obviously decided that they didn’t like this useful service based on publicly available data being delivered from a domain that doesn’t end in loc.gov.  I wonder if there are any other Linked Data enthusiasts, not held back by who their employer is, who would pick up from where he left off?

Ed goes on to say:

It was always my intention for concept URIs at lcsh.info to be cool. I advertised the service as ‘experimental’ and indicated it was going to hopefully inform the development of a similar continually updated service at LC where I work. …  My thought was I could leave the service running until there was something similar at LC that I could redirect the concept URIs to. After a year or two when people had rewritten there data to point at loc.gov I could retire lcsh.info. I never imagined I would be asked by LC to take it down.

LOC should have listened to Ed in the first place and taken the high ground in leading the work in to creating a semantic web of data with their valuable publicly available data.  At the end of his post Ed hints that LC is still considering running a service like lcsh.info at loc.gov, but it’s not there yet.  Why-o-why did they not learn from his work and ride the wave of introducing their own service based on his great initiative.  Instead they present to the world a short-termist not-invented-here attitude, that reminds me of other well established leviathans of the world of library metadata.

Let’s hope that Ed’s hint is correct and we will soon be able to welcome the release of Open Linked LCSH and other Data from the electronic portals of the LofC.

Traffic Squad Police (LOC) image published in the The Library of Congress’ photostream on Flickr

English

New RIF specification releases

December 20th, 2008
The W3C Rule Interchange Format (RIF) Working Group published five new Working Drafts today. Since the Last Call Working Draft of RIF Basic Logic Dialect (BLD), the group has been developing other key dialects, components, and test cases. The new publications are: The Working Group is nearing Last Call on these remaining elements of RIF, and welcomes feedback from rulesystem users and designers.

English

CoMET: La web 3.0 es la web de las personas

December 19th, 2008

Tomar diversas fotografías, etiquetarlas mediante marcado semántico (el cielo, la gente, las calles, las casas) y fundirlas para crear un entorno completo de forma inteligente es una de las cosas que pude ver hace poco en Adobe MAX 2008 Milan - Sneak Peeks.

Shai Avidan, el responsable para Adobe de otra impresionante herramienta para fotocomposición, “content-aware scaling” (escalado contextual de imágenes) que ya incorpora, como vimos, Photoshop CS4 es el responsable de la historia.

Os hablo de ella sólo como ejemplo de aplicación práctica de tecnologías semánticas, complementarias a las actuales, que mejorarán la web actual.

La idea básica de la aplicación anterior como de esta es la misma, la Internet de las personas que intenté transmitir en un taller reciente acerca de web 3.0, expresada por Simon Bergweiller, compañero de Matthieu Deru en la creación del proyecto CoMET en el Advanced Tangible Interface Lab (centro de investigación Alemán para la inteligencia artificial), en los siguientes términos:

“Las operaciones complejas deben estar ocultas. Son gestos simples del usuario los que permiten interactuar con la complejidad de las bases de datos de la Web 3.0, sin necesidad de conocer los complejos lenguajes de programación de base.”

Dicho en otros términos, la idea es que las cosas en internet, descritas de forma más completa, pueden ser entendidas por las máquinas: igual que en el ejemplo anterior, la inteligencia artificial generada por ordenadores y redes cada vez más potentes, puede representar importantes avances en usabilidad, en mejora de la experiencia del usuario.

Su kiosko o espacio interactivo compartido sería buen ejemplo de ello:

CoMET es un nuevo terminal creativo de intercambio que permite interactuar con objetos anotados de forma semántica.

En definitiva, se trata de un quiosco virtual experimental basado en un iPhone y un puntero que permite arrastrar iconos a través de la pantalla táctil.

Los MP3 que contiene son “cosas”, objetos traducidos al lenguaje de la máquina (etiquetas ID3) con información sobre el álbum y el artista. Un círculo alrededor de cada uno nos devuelve un listado automático de canciones, ordenadas por género, artículo o artista.

Cerca de la pantalla, varios “spotlets,” agentes inteligentes que permiten interactuar de distintas formas con los objetos, pueden reproducir MP3 o buscar en Youtube vídeos relacionados bajo un mismo criterio.

Veremos pronto la versión web de la aplicación, así como la ampliación de las modalidades de interacción con la herramienta a instrucciones verbales (reconocimiento de voz)

Imaginad las implicaciones del tema para entornos de entretenimiento doméstico…pudiendo hablar con el televisor para obtener objetos multimedia relativos a lo que estamos viendo…

Con solo un poco más de imaginación podemos ampiar el ejemplo a la Internet de las cosas: los componentes de cualquier mecanismo (un coche, por ejemplo) podrían estar dotados de antenas RFID que contengan información del producto e interactuen con el ordenador o “quiosco” para proveernos de detalles técnicos…

Fuente: Making sense of the ’semantic Web’

Spanish

Our Approach to Modeling, Fidelity, and KR

December 19th, 2008
SSI data link modules
Image via Wikipedia

For some people, the point of the Semantic Web is distributed, web-friendly knowledge management and knowledge representation. Generally we’re in that camp. But that camp breaks down into several factions, and it’s useful to be clear about which faction we’re in.

There is a spectrum that runs from Maximum Fidelity to Maximum Scalability. Given our roots in Description Logic, we lie somewhere in-between these two poles. Notice that I have intentionally avoided calling these “extremes”; they are endpoints, and perfectly respectable, useful ones, depending on who you are and what you’re trying to achieve.

The Max Fidelity folks want to model as closely as possible some world-chunk in as fine-grained and faithful manner as is possible. This often means that they are at least first order logic fans, and sometimes higher-order logic users. They debate edge cases, corner cases, alternate and competing semantics and logics in an attempt to ever more faithfully mirror reality. The price they pay is, generally, computability. For some use cases, that price is perfectly acceptable. For other use cases, that price is entirely too high, since the most perfect representation of the world is useless if you can’t practically compute with it—at least, that’s how Max Fidelity often looks to us.

At the far end of the spectrum we have Max Scalability folks, for whom the point of the Semantic Web is rather more the “Web” than the “Semantic” part—we might playfully call them the “semantic WEB” crowd, in order to reflect their ideal ratio. Here the point isn’t to model perfectly; but, rather, to do something with lots and lots of data, ideally Webfuls of data. This means, in the argot of current tech choices, that they tend to be RDF and Linked Data fans and users, since that’s just about the only approach to doing anything at all interesting with Webfuls of data. The price they pay, of course, is expressivity. For some use cases, that’s just fine, since you don’t always need a lot or even much semantic fidelity to get the job done. Sometimes we build applications for customer that take this approach. But, as above, for other use cases, this is simply a killer, because without enough or the right semantics, you don’t get the right kind of help from the machine in figuring out complex stuff.

So what do we have so far? First, we have a notional (and idealized) spectrum that runs from Webfuls of data to, roughly, at least first order logic. Second, we have obviously tons of interesting use cases at (probably) every point along this spectrum. And, third, we have the suggestion that we aim for some kind of sweet spot in the middle—where “sweet spot” and “in the middle” are not absolute notions, but are interest-relative and goal-specific, and where the interests and goals we care about are, surprise-surprise, ours.

(In other words, I’ve setup a little fantasy where we are the Heroes—where we naturally occupy the “sweet spot”—but then, since I’m not a complete jerk, I’ve ironized or called into question that very fantasy in an effort to suggest that we, just like everyone else, try to spin things to make ourselves look smart, cool, and useful.)

And—will miracles never cease?—that’s just about where Description Logic fits along such an idealized spectrum. Technically, it’s the decidable subset of first order logic, which means that we try to balance Fidelity and Scalability in a way where we can get some of both.

The Max Fidelity folks are forever poking us with sticks to the effect that we can’t model world-chunks nearly as faithfully as they can. Well, no crap, of course we can’t! Then the Max Scalability folks poke us with different sticks to the effect that we can’t scale to Webfuls of data—again, no duh!

And then we poke back at both camps—hey, they started it!—to the effect that we can model far better than Max Scalers and we can scale far further than Max Fideliters (yes, I just made that word up…Rock!)...

Finally, a word about how this positioning issue plays out in our approach to modeling. In short, we model such that we get the right inferences, since getting the inferences is typically what our kind of applications (analysis, decision support kinds of apps, in short) are all about. So that means some edge or corner cases, even if they fit into DL, get ignored or dropped out or even distorted when there’s no point—given requirements analysis—to fidelity for its own sake. And it means, on the flip side, that we don’t worry too much that that inference over Webfuls of data is not realistically achievable anytime soon. Fast enough for the customer’s data is sufficient scalability in most cases for us.

Reblog this post [with Zemanta]

English

Why Reasoning Matters: Motivations

December 19th, 2008
German KUKA Industrial robots doing vehicle un...
Image via Wikipedia

The perceived utility of automated reasoning for a wide range of applications matters to us greatly, which makes sense, given that our biz proposition is “semantic infrastructure OEM”. In other words, we’re trying to make money by licensing reasoning infrastructure, and related pieces, for semantic applications to other developers to use in their apps. With the right APIs and tool maturity, as well as supporting materials, our customers should be able to treat automated reasoning as a black box—not a black art.

A problem with demonstrating automated reasoning’s utility is that automated reasoning is complex, with non-trivial logical background and framework, including oodles of domain-specific vocabulary. Another problem is that automated reasoning is, in the end, just a kind of mechanical term rewriting often according to, considered individually, quite trivial rules. (Pellet isn’t really a rules engine, but we’ll talk about that another time.)

That means that for toy cases, which is what most people new to the subject are ready for, it seems dull and unimpressive. And for the hard cases? Well, most people aren’t ready for hard cases, so they simply tune out. And who can blame them, really? It’s like my example about Emma and Jack. I mean, that example really sucked, but what’s the alternative?

This is not an easy problem to solve.

My approach, rather than showing more toy or real examples, is just to talk about the utility of automated reasoning in plain language, in an attempt to communicate not so much specific details as the general mindset or approach to solving particular sorts of problems using automated reasoning. This approach to marketing mirrors our approach to technology development: both are iterative and experimental, but not just for us. As the man said, even a blind pig occasionally finds an acorn.

Reblog this post [with Zemanta]

English

APQC in Houston

December 19th, 2008
JPL logo
Image via Wikipedia

I don’t have slides for my time at the APQC in Houston, I was not slated to present, so no cool slide widget with my presentation in this post.  I was merely there to observe and learn, and maybe answer some questions about POPS.

As Kendall mentioned, POPS was nominated as a best practice as part of NASA JPL’s overall efforts in Knowledge Management. The meeting at APQC was for all the nominees to give a short talk and to hear the overall findings of the study conducted by APQC, which in this case was on Expertise Location and Social Networking.

I got to see some great presentations by folks from IBM, Sun, Pratt-Whitney, Rockwell Collins, and Mitre and get a lot of insight into what they’re doing with Expertise Location and Social Networking; challenges they faced in the past, lessons learned, and what they’re doing now, and in the future, to continue their efforts in these areas.

It was a great experience, the people from APQC were fantastic, very friendly and put on a great event, and all the nominees and study partners, a group which included L3, Marathon Oil, ExxonMobil, and Wyeth, were all great and added a lot to the discussions.

Hopefully I get the chance to participate or work with APQC in the future.

Reblog this post [with Zemanta]

English

Eigenfactor.org measures and visualizes journal impact

December 18th, 2008

eigenfactor.org is a fascinating site that is exploring new ways to measure and visualize the importance or journals to scientific communities. The site is a result of work by the Bergstrom lab in the Department of Biology at the University of Washington. The project defines two metrics for scientific journals based on a page-rank like algorithm applied to citation graphs.

“A journal’s Eigenfactor score is our measure of the journal’s total importance to the scientific community. With all else equal, a journal’s Eigenfactor score doubles when it doubles in size. Thus a very large journal such as the Journal of Biological Chemistry which publishes more than 6,000 articles annually, will have extremely high Eigenfactor scores simply based upon its size. Eigenfactor scores are scaled so that the sum of the Eigenfactor scores of all journals listed in Thomson’s Journal Citation Reports (JCR) is 100.

A journal’s Article Influence score is a measure of the average influence of each of its articles over the first five years after publication. Article Influence measures the average influence, per article, of the papers in a journal. As such, it is comparable to Thomson Scientific’s widely-used Impact Factor. Article Influence scores are normalized so that the mean article in the entire Thomson Journal Citation Reports (JCR) database has an article influence of 1.00.”

For example, here are the ISI-indexed journals in the AI subject category ranked by the Article Influence score for 2006.

The site makes good use of GoogleDoc’s motion charts to visualize the changes of metrics for top journals in a subject area. You can also interactively explore maps that show the influence of different subject categories on one another as estimated from journal citations.

Map of Science

The details of the approach and algorithms are available in various papers by Bergstrom and his colleagues, such as

M. Rosvall and C. T. Bergstrom, Maps of random walks on complex networks reveal community structure, Proceedings of the National Academy of Sciences USA. 105:1118-1123. Also arXiv physics.soc-ph/0707.0609v3 [PDF]

(spotted on Steve Hsu’s blog)

English

Journal requires authors to include Wikipedia article with submissions

December 18th, 2008

Scientific journals are undergoing rapid evolution as they adapt to the Web and various forms of social media. As reported by Nature (Publish in Wikipedia or perish) and in ReadWriteWeb, the journal RNA Biology is experimenting with a connection to Wikipedia. Articles submitted for publication about new RNA molecules must also include a draft Wikipedia page that summarizes the work. The journal will then peer review the page before publishing it in Wikipedia.

Here are the guidelines from the RNA Biology site:

“To be eligible for publication the Supplementary Material must contain: (1) a link to a Wikipedia article preferably in a User’s space. Upon acceptance this can easily be moved into Wikipedia itself together with a reference to the published article.

At least one stub article (essentially an extended abstract) for the paper should be added to either an author’s userspace at Wikipedia (preferred route) or added directly to the main Wikipedia space (be sure to add literature references to avoid speedy deletion). This article will be reviewed alongside the manuscript and may require revision before acceptance. Upon acceptance the former articles can easily be exported to the main Wikipedia space. See below for guidelines on how to do this. Existing articles can be updated in accordance with the latest published results.”

This is definitely an interesting and forward looking idea. Yet, I can not help having the cynical thought that it’s also a great way for the journal to boost it’s page rank.

English

Calling All RDF Dumps

December 18th, 2008

Today on the Linking Open Data mailing list, Kingsley Idehen of OpenLink Software announced that he is preparing to load the entire LOD cloud into Virtuoso 6.0 Cluster Edition. The datasets are being added to a table on the ESW wiki, making it convenient for anyone doing Semantic Web research to get a hold of the datasets. Once all the datasets are added we should have a better idea of how much linked data there really is out there. This may also raise the bar for other triple stores and force them to develop methods for storing several billion triples.

Here are his instructions for adding your dataset to the table:

  • Go to: http://esw.w3.org/topic/DataSetRDFDumps
  • Add your data set to the table (if it isn't already listed) or correct erroneous entries
  • Add a URL entry to the "Archive URL" column
  • Add a Publisher URI to the "Publisher / Maintainer" column (used for the construction of Attribution Triples)

If you don't have a URI for yourself, you can get one by registering and you will receive one.

Got something to say? Leave a comment!


English

UMBC to host FIRST Lego League Maryland state championship

December 18th, 2008

UMBC will again host the 2008-09 FIRST Lego League Maryland State Championship on January 31, 2009. FIRST Lego League (FLL) an international competition for elementary and middle school students that is run by the FIRST organization with support by Lego. FLL teams use Lego Mindstorms kits to build small autonomous robots built with a limited number of sensors and motors that complete to perform predefined challenge given tasks.

“Guided by adult mentors and their own imaginations, FLL students solve real-world engineering challenges, develop important life skills, and learn to make positive contributions to society. FLL provides students age 9-14 with an opportunity to challenge their math and science skills in an internationally recognized competitive environment. FLL combines a hands-on, interactive robotics program with a sports-like atmosphere. Teams of up to 10 players focus on team building, problem solving, creativity, and analytical thinking to develop a well thought out solution to a problem currently facing the world - the Challenge.”

UMBC’s FLL activities are led by Mechanical Engineering Professor Anne Spence.

English

Report Announced from Workshop on Semantic Web in the Oil & Gas industry

December 18th, 2008
Q and A session after the keynote at the workshop Today W3C published a report on the W3C Workshop on Semantic Web in Oil & Gas Industry. 54 experts from 33 organizations discussed how Semantic Web technologies can help to handle the staggering amount of new data that is produced every day as well as the challenges of interfacing to service companies and managing joint ventures between operators that are very important in this industry. Participants discussed issues related to data integration, ontology management and creation, presented applications and tool developments in the oil & gas area. The Workshop concluded with a panel that explored the next steps that this community may take, possibly in conjunction with W3C, to explore this area further. W3C thanks Chevron for hosting the Workshop, which took place in Houston, Texas, USA, on the 9 and 10 December, 2008. The report has also links to the 17 accepted position papers and (most of) the presentation slides.

English