Archive

Posts Tagged ‘United States’

Google Book Search pays authors $125M and opens up access to books in the US

October 28th, 2008

Google books_sm Since 2005 Google have been in negotiations over US lawsuits brought by a group of authors and publishers, along with the Authors Guild and Association of American Publishers (AAP) around copyright issues.

Today Google have announced an agreement with AAP that brings the lawsuits to a close and will result in the establishment of the Book Rights Registry.  To quote their New chapter for Google Book Search blog post:

Google is also funding the establishment of a Book Rights Registry, managed by authors and publishers, that will work to locate and represent copyright holders. We think the Registry will help address the "orphan" works problem for books in the U.S., making it easier for people who want to use older books. Since the Book Rights Registry will also be responsible for distributing the money Google collects to authors and publishers, there will be a strong incentive for rightsholders to come forward and claim their works.

The money collection they refer to, is from a new feature they will introduce, as explained:

…in addition to being able to find and preview books more easily, users will also be able to read them. And when people read them, authors and publishers of in-copyright works will be compensated. If a reader in the U.S. finds an in-copyright book through Google Book Search, he or she will be able to pay to see the entire book online. Also, academic, library, corporate and government organizations will be able to purchase institutional subscriptions to make these books available to their members. For out-of-print books that in most cases do not have a commercial market, this opens a new revenue opportunity that didn’t exist before.

Google in this announcement, also recognise the value of books to libraries, and obviously the value the Book Search service has gained from partnering with some of them:

In addition to expanding the commercial market for these books, Google, the authors and the publishers have worked hard with our library partners at Stanford, the University of Michigan, the University of California and the University of Wisconsin-Madison to ensure this agreement advances libraries’ efforts to preserve, maintain and provide access to books for students, researchers and readers. The agreement gives public and university libraries across the U.S. free, full-text viewing of books at a designated computer in each of their facilities. That means local libraries across the U.S. will be able to offer their patrons access to the incredible collections of our library partners — a huge benefit to the public.

So what does this mean to the public – if you are not in the US very little at the moment.  Although they hint at intentions to spread this to ‘other countries’.

In libraries inside US boarders, there will be a computer [I wonder how many users will be allowed to logon to it at once] providing access to a massive collection which was not available before. 

For the US public inside and outside the library walls, they will able to find and preview books more easily, and then be able to read them.

If a reader in the U.S. finds an in-copyright book through Google Book Search, he or she will be able to pay to see the entire book online. Also, academic, library, corporate and government organizations will be able to purchase institutional subscriptions to make these books available to their members. For out-of-print books that in most cases do not have a commercial market, this opens a new revenue opportunity that didn’t exist before.

As this comes out of a legal settlement with authors and publishers, the can be forgiven for the emphasis on new revenue opportunities. 

For me the big story behind this is that Google have started to complete the links in the search-for-it-discover-it-get-it chain for books in the same way that we are accustomed to for web pages.  It is all about getting to the information, and previous century’s legal frameworks have been getting in the way of what has been technically possible for ages.  Google’s weight is starting to sweep away some of those restrictions.

The publishers are probably very pleased with their agreement [see AAP’s FAQs], but this could easily be one of the early significant steps in disintermediating their role in getting author’s works to their readers.  Eventually, as Google becomes the de facto route to find and read, who needs publishers? We are already starting to see music being distributed with very different models that often don’t include the traditional music publishers.

I must stop hypothesising too much as the agreement behind this has yet to be finalised, and then we need to see how the details are fleshed out.  Nevertheless I think the word ‘significant’ can most definitely be associated with this announcement.

English , , , , , , , , , , , , ,

Did you try ‘my hakia’: Your Personalized Search Interface?

October 27th, 2008

We would like to tell you more about a new hakia feature - “my hakia.” Think of “my hakia” as your newspaper where you can be as specific as you like about what you would like to read!

The “My hakia” front-page personalization interface allows you to obtain search results from areas such as emerging news and dynamic content. Additionally, the service acts as a unique news monitoring dashboard that can be customized by queries. You can pose a question and track related news such as” Should microsoft acquire yahoo!? or “Why should I vote for McCain?”

Enhanced by many other options to monitor dynamic content, what we are offering is a glimpse of what the future of Web will be, where users pull highly customized content they need, rather than having something general - like an area of interest such as politics- pushed to them.

The custom search options available on “My hakia” include:

News search – Search results from emerging news articles for user specified queries
News headlines – Headlines for user specified region (US)
PubMed search – Emerging Pubmed articles per user specified queries
Wikipedia – Wikipedia articles per user specified queries
Other – Several other items ranging from weather forecasts to YouTube videos to cartoons exclusive to hakia

To create “Your hakia” front-page, please visit http://my.hakia.com.

English , , , ,

Library Journal to recognise Movers & Shakers beyond their backyard

October 23rd, 2008

LibraryJournal.com I note with interest that Library Journal’s annual Movers & Shakers nomination process has been broadened to include both international and vendor nominations.  Previous years have seen this process limited to those residing in the USA & Canada.

It is great to see recognition, in a Journal with global reach, that North America doesn’t have the monopoly on innovation, drive and passion for libraries.

A hat tip to the folks at Library Journal, for planning to include the wider library constituency in their "50-plus up-and-coming individuals" (that is the quantity of individuals, not the qualifying age! ).

Now it is up to the rest of the world to rise to the challenge and nominate the folks that we all know fit in to that Movers & Shaker category.

Enter your nominations on the Library Journal site, before the November 10th deadline.

English , ,

XACML Policy Management (XPM): An Overture

October 16th, 2008

I hereby return from my long blogging hiatus!

The cheers, they shake my heart.

Instead of finishing my old SWRL series, I’m going to start a brand, spanking new series on policy management.
I do recommend you go back and read the older posts on policy management, especially Kendall’s musings and Markus’s report.

Policies In General


A policy is a kind of rule which governs behavior. If you adhere to the requirements of a policy, then you conform with or adhere to the policy. A policy can present positive requirements (i.e., obligations; things you must do) or negative requirments (i.e., forbiddens; things you must not do) or possibilities (i.e., things you may or may not do, according to your own judgments and needs). Policies are not “active” rules, that is, they don’t generate behavior. Instead they are a check on behavior…a constraint on the possible space of behaviors.

(We can also have, in the realm of the permitted, target goals, e.g., “keep discretionary spending under $1000/person-year”. Conformance, in this case, becomes an potentially complex optimization problem.)

So, it may be corporate policy that no one can spend more that $5000 without the sign off of at least one other person at a comparable level (for a sanity check) plus 2 week notification of accounting (so they can manage the cash flow appropriately).

Notice that there are different levels here. We have some high level goals (to spend money wisely or not to bounce checks but also not to have money sitting idle in the checking account) but we also have what might be termed implementation details (e.g., getting a sanity check from a peer). The implementation details may vary widely while the goals remain fixed. It is, of course, possible for the goals to vary as the implementation stays fixed! (Since the same implementation may meet many goals.)

Whether some rule is an implementation or a goal often depends on the context. For example, the overall goal is probably to be fiscally responsible. When we evaluate the success or correctness of a policy, we do so in light of higher goals.

RBAC


The kind of behavior we’re concerned with modeling can be reduced to various sorts of access (thus, our policies aim to control access). Essentially, we need to determine what actors have “performative” access to which objects (aka, resources). If we have a blueberry pie, for example, we want to ensure that only the right sorts of people (i.e., those whose first name begins with “B” and can be forced into rhyming with “Dijon”) have access to the pie. If we are willing to grant access to the appearance of the pie (but not to the taste) we might put it into a cage. If we also want to restrict smell access, we’d put it in a tupperware container. If we want to grant people like me eat access, we’d give me a key to the cage.

If the cage is too weak, or the lock easily picked, or the bars of the cage wide enough to let the tupperwared pies slip through, then the wrong sorts of people (e.g., those whose first name begins with “K” and rhymes with “End-all”) will be able to get at—and eat—my pie.

There are many sorts of general models of access control, but the sort we’re concerned with is role based access control (RBAC), as that is the basic model behind XACML and is pretty popular anyway. In RBAC we do not associate access permissions with actors directly, but with different roles an actor might have.

Deployment vs. Development (and Auditing)


As I’ve written before, it is important to distinguish deployment time and development time, especially when dealing with analysis services that do a lot of work and, thus, are computationally uncertain. This is especially true for policy management. In XACML, there has been a lot of attention on runtime behavior (e.g., Policy Decision Points (PDPs) and Policy Enforcement Points (PEPs)). It is tempting to think that we should add intelligence to PDPs which are, after all, decision points. There are two problems with this strategy, 1) PDPs tend to be time critical and high volume and 2) complexity in a PDP is a vulnerability. Even if one did wish the PDP to be smart, one would still need tools to build up confidence that the PDP wasn’t too smart for its own (and your organization’s) good. Thus, we are going to focus on development time services, i.e., how can we better support the engineering, maintenance, and auditing of complex sets of RBAC policies.

What are we trying to improve?


Fundamentally, people—- developers, testers, and policy setters—- need to understand their policy regime and to understand it in a way that gives them clear control over their behavior. If you don’t understand your set of policies, including whether the policies meet your goals, then in addition to suffering from a fair bit of anxiety, you may open yourself up for serious legal sanction. The quotidian can be considerable as well: every chance requires extensive testing. Change, even to improve matters, is a hostile force.

Given the high stakes involved with access control, it is surprising how little attention is paid to producing tools and services that support more effective development and auditing of policy regimes. Policy languages, such as XACML, have already moved toward the more declarative, so there’s at least some hope for reasoning sensible about XACML based policies. However, XACML is, itself, rather complex and opaque. It’s pretty clear that people are going to have a terrible time analyzing even small XACML policy sets.

This is exactly where our experience with OWL kicks in. We already know, to a fair degree, how to build support services to help people work with and understand large, complex information structures (we just call them “ontologies”).

The key is to have the computer do the tedious aspects of reasoning and make the results of its analysis salient to a human decision maker. Interestingly, it’s generally not enough to provide a useful service—- or even an amazingly useful service, but you need to provide a usable service as well. This means that you must take into account existing (or related) practices. Which we do!

HIPAA, our running example


In order to make all this concrete, Markus and I have been working on a realistic example of a set of patient information access policies for a doctor’s office/hospital. Health care information is something which is intrinsically the kind of thing you want to keep private (“Whoa, your prostate is HOW big?! And, btw, we didn’t need to hack the hospital computers to know you have hair plugs. Dude. Everybody knows…”) but you also want the right people (insurance agents, the sane doctor, etc.) to know it at the right time. Plus, your information can be of high societal value by contributing to medical research. In the US, there’s a very big and complex law, the Health Insurance Portability and Accountability Act (HIPAA) which, effectively, mandates that health care providers have pretty strong mechanisms to ensure only the appropriate access to your health data. If you blow that, you are in for some serious fines.

In my next post, I’ll delve a little deeper into the example.

English , , ,

Brooke Aker talks with Talis about Expert System and the Semantic Web

October 10th, 2008

In our latest podcast I talk to Brooke Aker, CEO of Expert System in the United States. We talk about Expert System and Brooke’s views on the utility of semantic technologies.

During the conversation, we refer to the following resources;

This conversation was conducted using Skype on Thursday 2 October, recorded with Ecamm Network’s Call Recorder for Skype, and edited on a Mac with Garageband.

For other Talis podcasts in this Nodalities series, see here. To subscribe to updates from all of Talis’ podcast series, see here.

English , , , ,

A Week in Den Haag

September 29th, 2008

I was in the Netherlands for a week on business: a few days at a NATO “semantic interoperability” workshop and then a day meeting with some of our new Dutch research partners—more about that (and them) in the near future.

Thoughts and observations, more personal than professional:

  1. Nearly everyone I spoke with for more than a few minutes referred to the financial shockwaves in the US, often with more than a little schadenfreude. But I can’t blame them.
  2. The Dutch, like the Danes, are courteous and helpful as a rule. I seem to meet more Dutch who speak English than Danes, though, so that biases my perceptions of the two a bit. I’ve also spent more time in NL than DK (6 weeks versus 2). Neither are as effusively friendly and, well, warm as the Spanish or Italians in my experience.
  3. Heineken in public, casually, on trains, walking down the street—that’s cool and very Dutch, but not my thing. I don’t think I could ever live anywhere, including the Netherlands, long enough to drink a beer on a train.
  4. My wife, a graphic designer, is quite well-traveled, speaks Spanish and Italian, and is worldly-wise; but she’s still never been to the Netherlands and is envious of my trips largely because of the Rijksmuseum and Dutch design generally.
  5. Related: I claim the Netherlands is the most typographically correct culture on earth. It’s a small thing, I suppose, but it makes me happy. I suppose the Swiss or perhaps the Danes would object.
  6. From Den Haag to Delft via taxi: 50 euros. Public transit: 5 euros.
  7. The fashion trends today in Europe for young women replay the preppy fashion of the late 80s, when I was in high school. This makes me feel incredibly old, in a way few things do.
  8. If an adjacent table of self-styled “liberal Republicans” talks so loudly as to make it impossible not to listen, my failure to ignore their conversations, by virtue of eating alone myself, is not eavesdropping, it’s self-defense.
  9. That I don’t repeat their names here—public figures, both of them—and some of their more shockingly tasteless dinner conversation (“oh, it’s one of those Jewish names” and other tidbits) is an undeserved mercy.
  10. If you ever need a good dinner in Den Haag, Spijs is lovely.
  11. My server—new: her third night—at Spijs insisted that a “gamboa” was not a “shrimp”, in English, since it wasn’t very small. I, a native English speaker, tried three times to assure her that “shrimp” was correct, but she knew better.
  12. She did accept my claim that “rack of deer” was better replaced with “rack of venison”, however.
  13. Monkfish—which she described as “the one so ugly they cut off its head”—grilled in a tandoor like kebab is surprisingly good.
  14. The beach in Den Haag—in Scheveningen, precisely—is stunning this time of year, if you’re lucky enough to get clear weather, as I was for three days.
  15. I’m now officially too old—a trend, that—for the ritual walk through Amsterdam’s red light district: boring and pointless. Give me a stroll through Leiden, a bami goreng (or even just a kroket), and an afternoon with van Rijn, and I’m a pig in mud.

English , , , , , , ,

Mobile texting now more popular than calling in US

September 29th, 2008

The NYT has a short note (Letting Our Fingers Do the Talking ) on a new Nielsen Mobile report on texting use in the US.

“In the fourth quarter of 2007, American cellphone subscribers for the first time sent text messages more than they phoned, according to Nielsen Mobile. Since then, the average subscriber’s volume of text messages has shot upward by 64 percent, while the average number of calls has dropped slightly.”

Average Number of Monthly Calls vs. Text Messages Among U.S. Wireless Subscribers

 

Calls

Texts

Qtr 1, 2006

198

65

Qtr 2, 2006

216

79

Qtr 3, 2006

221

85

Qtr 4, 2006

213

108

Qtr 1, 2007

208

129

Qtr 2, 2007

228

172

Qtr 3, 2007

226

193

Qtr 4, 2007

213

218

Qtr 1, 2008

207

288

Qtr 2, 2008

204

357

Source: Nielsen Mobile

The article also points out that “Teenagers ages 13 to 17 are by far the most prolific texters, sending or receiving 1,742 messages a month”. The Nielsen data shows that this age group sends two orders of magnitude more data than people over 65.

<td width="252" colspan="3" valign="bottom" class="header_nav_on
style34">

Source: Nielsen Mobile

Average Number of Monthly Calls vs. Text Messages Among U.S. Wireless Subscribers by Age (Q2 2008)

 

Calls

Texts

All Subs

204

357

12 & Under

137

428

Ages 13-17

231

1742

Ages 18-24

265

790

Ages 25-34

239

331

Ages 35-44

223

236

Ages 45-54

193

128

Ages 55-64

145

38

Ages 65+

99

14

English ,

Wall Street’s collapse may be IT program’s gain

September 26th, 2008

Virtually all information technology programs in the US and Europe saw their enrollments drop significantly after the dot com bubble deflated in 2001. Students decided to pursue other majors, even though the IT job market remained strong — it just wasn’t insanely strong.

At UMBC, the number of our Computer Science majors fell by almost 50%, even though the number of BS degrees we produced declined only slightly. Our Information Systems Department suffered an even greater decrease in their undergraduate programs.

One of the popular alternatives students moved toward was business, especially finance, banking and trading, where young people with good analytic skills who were willing to work hard could do very well.

Computer World has an article, Wall Street’s collapse may be computer science’s gain, that speculates the flow will reverse.

“The collapse of Wall Street may help make computer science and IT careers attractive to students who abandoned these fields in droves after the pop of the last big bubble, the dot-com bust of 2001.
    William Dally, chairman of the computer science department at Stanford University, said that for the last several years, he has watched some students interested in technology go into banking and finance because those fields could be more lucrative.
    ”Many thought they could make more money in hedge funds,” Dally said. He said students are returning to computer science because they like the field and not because it can necessarily make them rich.”

My only regret is that the IT industry (including the academic sector) didn’t get a multi-hundred-billion dollar federal bailout back in 2000.

(h/t to Marie desJardins)

English , , , , , , , , ,

Can you surf the web and chew gum at the same time?

September 26th, 2008

The NYT has an article (Get Off the Internet, and Chew Some Gum) on an ad campaign by Cadbury advocating that young people spend less time on the Internet and more being up close and personal, after, of course, sweetening their breath with Dentyne.

“The campaign, called “Make face time,” was created by McCann Erickson for Dentyne, a brand owned by Cadbury, the No. 2 gum maker in the United States after Wrigley. The ads feature happy people embracing and kissing — their breath presumably freshened by Dentyne — as an alternative to pounding their BlackBerrys or sending electronic messages to their Facebook friends.” (src)

Somewhat ironically, the ad campaign is going into high-gear now with a Web component. There is also a three minute version at www.makefacetime.com.

“It opens with a warning announcing that it will shut down after three minutes. “When people are surfing the Web, they’re missing the best part of life — being together,” it reads.”

When I viewed it, however, it was against a background of PHP error messages. Maybe that’s part of the message — get off the Web, it’s run by flaky machines that speak in strange and unnatural languages.

UMBC’s Zeynep Tufekci was quoted in the NYT article as a skeptic.

“That strategy could be a gamble, as the ads focus on exactly the people who are most passionate about these digital tools.

“I think most college kids would roll their eyes” at the ads, said Zeynep Tufekci, a sociologist at the University of Maryland, Baltimore County, who studies the way young people use technology to socialize. “In fact, they’re checking out these sites in the hopes that sooner or later it will end up in a hug or kiss.”

Ms. Tufekci said that the idea that social networking sites and other digital tools have separated people from those that matter in their lives will probably not sit well with the gum industry’s young customers.

“This is a false dichotomy,” she said. People use online tools as a way to be more social, she said, updating their acquaintances on what they are doing and making plans to meet in person. Her research has shown that people who use these tools have just as many offline friends and spend just as much time with them as people who do not socialize online.” (src)

Ths makes me wonder if the Dentyne lose its flavor on the Web post overnight.

(Apologies to The Happiness Boys)

English , , , , , , , , , , ,

Extending Google: First Look at SemantiFind

September 23rd, 2008

Just stumbled upon SemantiFind via T3N, and then upon the review on ReadWriteWeb from last week Thursday.

What’s it about? Semantifind is an IE and FF browser plug-in that extends Google’s search functionalities, most notably through a typeahead functionality that allows you to refine your search results before hitting ‘enter’. ReadWriteWeb wasn’t too impressed though:

Unfortunately, SemantiFind is one of those tools that’s good in theory, but not so good in practice. When performing some test searches, results were not as precise as they should have been. For example, in the above-mentioned search for “Georgia,” a search for the U.S. state returned Google results for the country as well.

Ambiguities due to homonyms such as Georgia vs Georgia, or Java vs Java are among the faves of people who are trying to pitch a semantic tool to you - but I really wonder whether the effects of homonyms aren’t highly overrated? How often do people really search for these, and in particular search for these without context, i.e. further search terms such as in ‘Georgia Tech’, ‘Georgia war’, ‘Java Coffee’ or ‘Java bugs’?

I must say I was quite impressed by the choice of search terms offered, and if you (like me) are easy prey for the serendipity effect, then SemantiFind can please and distract you endlessly. Here is a preview of what appears if you enter ’serendipity’ - please note the preview of possible descriptions and definitions which you get on the Google homepage with the plugin (click > big):

Once you pick a term it turns into a kind of button (just slightly annoying: you cannot edit a term after it’s turned into a button, but would have to delete the whole thing and type again if you want to change your search query):

And then, what happens? On the search results page, you see results filtered by SemantiFind’s user-generated, user-approved labels on top of the other search results - which irritated me at first as it comes across as a search engine within the search engine. Admittedly: I’d rather sift through 13 results than through 10,900,00 search results (even though I never make it to the end of Google’s search list anyway; does anybody?) - but does the article about trees doing their best work with thermostats at 70° really deserve the second rank in SemantiFind’s list of recommended search results?

So while I agree with RWW that this “just goes to show why search engines that rely on people to filter the results might not work. Human error shouldn’t be a factor in web searches”, I am still quite fond of the suggestions and definition previews. I would probably use SemantiFind regularly if they allowed me to configure the plugin in such a way that I’d get the suggestions on the input page, but not the recommended results on the results page.

What’s the source of these results anway? SemantiFind’s recommended results seem to rely entirely on input generated by users - to add input, you need to install their toolbar and start adding labels to websites; if a website has been labeled before, you can confirm or reject existing labels. What’s nice: a label recommender (only presumably the same one that’s used for search queries) reduces ambiguity. What’s curious: You can also browse the pages you have already labeled in what they call your “catalogue” - which makes the service even more reminiscent of a bookmarking service, and which makes me wonder whether one shouldn’t possibly link this with a del.icio.us/Mr.Wong/Bibsonomy/Faviki account (Faviki would probably be the best, considering their tag recommendations are based on DBpedia, and considering that Faviki just made it past the 1 million tags mark)

Questions that remain: I’d really like to know how they maintain their list of suggested labels - ambiguity, typos, plurals forms, i.e. the usual folksonomy issues must be a big challenge. Also, I’d like to know where they get their definitions in the preview from - from Google? Or are these user-generated as well? There must, after all, be some use for the “request a new definition” form?

Too bad they don’t have a blog to which one could send a track back, and there is nothing much on their company page either.

Reblog this post [with Zemanta]

English , , , , , , , , , , , , , , ,

Obama and McCain answer questions on Science policy

September 15th, 2008

Here are word clouds generated from the answers that US presidential candidates Barack Obama and John McCain gave to a set of 14 questions about science policy. Click on an image to see a larger size. Try to guess which is which and leave a comment. A link to the answer is after the word clouds.

“In November, 2007, a small group of six citizens - two screenwriters, a physicist, a marine biologist, a philosopher and a science journalist - began working to restore science and innovation to America’s political dialogue. They called themselves Science Debate 2008, and they called for a presidential debate on science. … Among other things, these signers submitted over 3,400 questions they want the candidates for President to answer about science and the future of America. Beginning with these 3,400 questions, Science Debate 2008 worked with the leading organizations listed to craft the top 14 questions the candidates should answer.


Candidate 1
Candidate 1

 

Candidate 2
Candidate 2

answer

Uncategorized , , ,

Congress asks telecos why text messaging rates are rising

September 11th, 2008

The US congress is asking the four major mobile phone providers why their charges for text messages have gone up by 100% over the past few years. As Chris Gaylord notes in his blog on the Christian Science Monitor, “text messages cost about $1,310 per megabyte. That seems a tad high.”

“With text-messaging rates doubling over the past three years, Sen. Herb Kohl has started asking questions. The Wisconsin Democrat and head of the Senate’s antitrust subcommittee sent a letter to the four major cellular companies on Tuesday with some interesting points.

In 2005, the industry charged about 10 cents per text. Now it’s 20 cents. All four carriers upped their rates at about the same time. The number of nationwide competitors slipped from six to four. And the remaining big-timers are gobbling up regional carriers.”

US Senator Herb Kohl’s press release includes the letter to the telecos.

“Today, US Senator Herb Kohl (D-WI), chairman of the Senate Antitrust Subcommittee, asked the presidents and chief executive officers of the four largest wireless telephone companies to justify sharply rising rates for its customers to send and receive text messages. In a letter, Senator Kohl requested an explanation from Verizon Wireless, AT&T, Sprint and T-Mobile, which collectively serve more than 90 percent of the nation’s cellular phone users. The text of Senator Kohl’s letter follows below.”

Uncategorized , , , , , , , , , , , , , ,

A-Space: a social networking site for intelligence analysts

September 7th, 2008

Sixteen US intelligence agencies are encourage their staff to use A-Space, a new social-networking site for analysts being developed by the US Government and slated for launch on 22 September.

A-Space is an effort sponsored by the Office of the Director of National Intelligence. The Defense Intelligence Agency is managing the project with serving as the prime contractor for development.

CNN has an article, CIA, FBI push ‘Facebook for spies’, with some of the details.

“It’s a place where not only spies can meet but share data they’ve never been able to share before,” Wertheimer said. “This is going to give them for the first time a chance to think out loud, think in public amongst their peers, under the protection of an A-Space umbrella.” Wertheimer demonstrated the program to CNN to show how analysts will use it to collaborate.

“One perfect example is if Osama bin Laden comes out with a new video. How is that video obtained? Where are the very sensitive secret sources we may have to put into a context that’s not apparent to the rest of the world?” Wertheimer asked. “In the past, whoever captured that video or captured information about the video kept it in-house. It’s highly classified, because it has so very short a shelf life. That information is considered critical to our understanding.”

Material on A-Space is, of course, highly classified and compartmentalized, so there will be stringent access control procedures. To further prevent information from being inappropriately accessed or used, A-Space will employ additional mechanisms, including monitoring for anomalous access patterns.

“We’re building [a] mechanism to alert that behavior. We call that, for lack of a better term, the MasterCard, where someone is using their credit card in a way they’ve never used it before, and it alerts so that maybe that credit card has been stolen,” Wertheimer said. “Same thing here. We’re going to actually do patterns on the way people use A-Space.”

Federal Computer week also has a recent article on A-Space, A-Space set to launch this month.

Uncategorized , , , , , , , , ,

W3C Organizes Workshop on Semantic Web in Energy Industries; Part I to Focus on Oil and Gas

August 8th, 2008

W3C invites people to participate in a Workshop on Semantic Web in Energy Industries; Part I: Oil & Gas to be hosted by Chevron in Houston, Texas, USA on 9-10 December 2008. Participants will explore how Semantic Web technologies can play a role in the management and analysis of the huge amounts of data gathered from highly diverse sources in this sector of the energy industry. Position papers are due 19 September.

English , , , , ,

RDF as self-describing data

August 7th, 2008
From time to time, someone will give me access to an RDF data set for me to 'have a look at'. One of the advantages of how RDF works is that it is possible to query a dataset without knowing anything about the data set at the outset. There are some simple queries that you can get started with to show how this works. As an example, let's check out the dbpedia (query web page available at http://www.dbpedia.org/sparql). When I first learned about this, Orri Erling just gave me a link; he told me nothing about the dataset.

The dbpedia page starts out with a simple sample query:

SELECT DISTINCT ?Concept WHERE {[] a ?Concept}

So let's start by running that. It is a bit of an advanced query, since the query graph includes a blank node; if you aren't comfortable with blank nodes in queries, think of

SELECT DISTINCT ?Concept WHERE {?x a ?Concept} 

instead.

This gives us all classes that have any members. There are a lot of these, maybe even too many. But we can get a feeling for the sort of thing that dbpedia talks about.

Another useful first query is

SELECT DISTINCT ?p WHERE {?s ?p ?o}

This gives you all the properties that are used in this data set.

Those are starting points - but lets go a bit further. Suppose we had a class that we were interested in. For example, when I ran the default query, one of the answers on the first page was http://dbpedia.org/class/yago/Airline102690270. So perhaps we can learn about airlines.

So, let's see the airlines that dbpedia knows about. So now I make a new query, based on what I learned in the previous one.

SELECT ?air WHERE {?air a <http://dbpedia.org/class/yago/Airline102690270>}

I get a lot of answers, including http://dbpedia.org/resource/Delta_AirElite_Business_Jets

Well - now what does dbpedia know about this Delta subsidiary? We can find out with a query like this:

SELECT ?p ?o WHERE {<http://dbpedia.org/resource/Delta_AirElite_Business_Jets> ?p ?o}

Among the answers here, we get

?p?o
http://dbpedia.org/property/headquarters http://dbpedia.org/resource/United_States

This is interesting - are there other airlines that have headquarters in the United States? Let's find out

SELECT ?other
WHERE {?other a <http://dbpedia.org/class/yago/Airline102690270> .
?other <http://dbpedia.org/property/headquarters> <http://dbpedia.org/resource/United_States> .}

We get quite a list of airlines.

We can continue in this way in a number of directions; find other places where certain airlines have headquaters, find other things that have US headquarters, etc.

What is special about RDF / SPARQL that allowed this to happen? There are a few things here - we were able to query the schema using the same query language as we did for the data. The pattern

{?x a ?Concept}
returns the set of (nonempty) classes in the data set - a schema-level result. If this were a relational database, this would be akin to querying to find out the tables in the database.

We can even mix schema and data in the same query. For instance, the pattern

{<http://dbpedia.org/resource/Delta_AirElite_Business_Jets> ?p ?o}
tells us all the properties that correspond to Delta Air Elite Jets, as well as the values of those properties. This is like querying for the columns in a table that are filled in for a particular row, along with the values in those cells.

This is a real sense in which an RDF store is 'self-describing' - there is no need to know about traditional metadata (schemas) before exploring a data set.

Uncategorized , , , , ,

Google Data APIs (and partial YouTube) supporting OAuth

June 27th, 2008

Building on last month’s announcement of OAuth for the Google Contacts API, this from Wei on the oauth list:

Just want to let you know that we officially support OAuth for all Google Data APIs.

See blog post:

You’ll now be able to use standard OAuth libraries to write code that authenticates users to any of the Google Data APIs, such as Google Calendar Data API, Blogger Data API, Picasa Web Albums Data API, or Google Contacts Data API. This should reduce the amount of duplicate code that you need to write, and make it easier for you to write applications and tools that work with a variety of services from multiple providers. [...]

There’s also a footnote, “* OAuth also currently works for YouTube accounts that are linked to a Google Account when using the YouTube Data API.”

See the documentation for more details.

On the YouTube front, I have no idea what % of their accounts are linked to Google; lots I guess. Some interesting parts of the YouTube API: retrieve user profiles, access/edit contacts, find videos uploaded by a particular user or favourited by them plus of course per-video metadata (categories, keywords, tags, etc). There’s a lot you could do with this, in particular it should be possible to find out more about a user by looking at the metadata for the videos they favourite.

Evidence-based profiles are often better than those that are merely asserted, without being grounded in real activity. The list of people I actively exchange mail or IM with is more interesting to me than the list of people I’ve added on Facebook or Orkut; the same applies with profiles versus tag-harvesting. This is why the combination of last.fm’s knowledge of my music listening behaviour with the BBC’s categorisation of MusicBrainz artist IDs is more interesting than asking me to type my ‘favourite band’ into a box. Finding out which bands I’ve friended on MySpace would also be a nice piece of evidence to throw into that mix (and possible, since MusicBrainz also notes MySpace URIs).

So what do these profiles look like? The YouTube ‘retrieve a profile‘ API documentation has an example. It’s Atom-encoded, and beyond the video stuff mentioned above has fields like:

  <yt:age>33</yt:age>
  <yt:username>andyland74</yt:username>
  <yt:books>Catch-22</yt:books>
  <yt:gender>m</yt:gender>
  <yt:company>Google</yt:company>
  <yt:hobbies>Testing YouTube APIs</yt:hobbies>
  <yt:location>US</yt:location>
  <yt:movies>Aqua Teen Hungerforce</yt:movies>
  <yt:music>Elliott Smith</yt:music>
  <yt:occupation>Technical Writer</yt:occupation>
  <yt:school>University of North Carolina</yt:school>
  <media:thumbnail url=’http://i.ytimg.com/vi/YFbSxcdOL-w/default.jpg’/>
  <yt:statistics viewCount=’9′ videoWatchCount=’21′ subscriberCount=’1′
    lastWebAccess=’2008-02-25T16:03:38.000-08:00′/>

Not a million miles away from the OpenSocial schema I was looking at yesterday, btw.

I haven’t yet found where it says what I can and can’t do with this information…

Uncategorized , , , , , , , , ,

Firefox 3 NYC Launch Party Recap

June 18th, 2008

Firefox 3 successfully set the record yesterday for the most downloads in a single day. Globally, the new browser was download 8.3 Million times, with 2.5 Million of those downloads coming from the United States.

It hasn’t been verified yet but early numbers show that 1.3 Million of those downloads occurred at the NYC FF3 Launch Party :D.

What an evening. Between our post and the MozillaParty page we had well over 100 people RSVP. Best guesses peg the total guest count at 70.

Quick rundown in numbers:

  • 16 large Pizzas (more on this later)
  • 9L of wine
  • 150 beers
  • 4L of soda
  • 80 buttons
  • 4 t-shirts
  • 100 party hats (my personal favorite)

ff3launchparty-061708.JPGA big thank you to Alex at Mozilla for sending over the swag.

A few observations: (1) Firefox fans are very prompt ;) (2) Firefox fans are hungry: the initial 8 pizzas that we provided were demolished in a feeding frenzy that lasted all of 10 min. Tim and Corey, co-founders of Notch.es came through with another 8 pies later in the evening. (3) The NYC FF community is alive and healthy. (4) Confidence is sparse when required to buzz up for access “to the, ugh… Firefox party.” (5) interest in party hats grew proportionally to the amount of alcohol consumed.

Thanks to everyone who made it, was nice to see such a diverse crowd. We only have the one photo — if you have any please send them my way and I’ll post them on the site.

English , , , , ,

SWSA supports ISWC 2008 student travel grants

June 18th, 2008

The Semantic Web Science Association (SWSA) has generously contributed 10,000€ for travel grants for Ph.D students to attend the 2008 International Semantic Web Conference. SWSA is a non-profit organization whose purpose is to promote scholarly work in Semantic Web and related fields throughout the world. This gift complements the $20,000 award from the US National Science Foundation for grants for students from US institutions. There will be a single application process for students to apply that will be available on the ISWC 2008 site in July.

Uncategorized , , , , , ,

OCLC announce more links with Google

May 22nd, 2008

From the press release:

DUBLIN, Ohio, USA, 19 May 2008—OCLC and Google Inc. have signed an agreement to exchange data that will facilitate the discovery of library collections through Google search services.

Digging in to the detail, it looks like this will mean a few things.  Apart from Google Book Search providing a Find this book in a library link to WorldCat.org as already available from other parts of Google, it appears to be only relevant to OCLC member libraries which also participate in the Google Book Search program.

It means that these libraries are able to share the MARC records for the books they have contributed [to Google Book Search] with Google, to enable them to make them easier to discover.  I am not clear if OCLC rules prevented this sharing happening prior to the agreement.  

Implicitly this also means that Google, at least in the Book Search team, recognise the value of metadata created by the library profession for making books more discoverable.  Something the library community have been saying for a long time - parsing and indexing the content is only part of the solution to making books findable.

Also in the press release:

The new agreement enables OCLC to create MARC records describing the Google digitized books from OCLC member libraries and to link to them.

OCLC should therefore be creating catalogue records for the digitized books held by Google.   This meaning that a search in WorldCat will direct a searcher to the digitized manifestation as well as to the library that contributed it.   A great way to gain wider exposure to a library’s collection without necessarily increasing the number of people through it’s doors.

To enable OCLC to create catalogue records for items in the Google Book Search collection, Google must, I presume, have made some commitment to creating and maintaining a permanent  URI for each digitized book.  I wonder if those URIs are generally available, with a commitment to maintain them, in a way that others could reliably catalogue them?

The announcement is one of a continuing series additions to the Google Book Search service, such as the recent release of their API.   

Listening to Google Product Manager, Frances Haugen in her guest slot on the Library 2.0 Gang, it is obvious that at least one person in the Book Search team is interested in and motivated by libraries - lets hope we see even more links between them and the wider library community.

English , , , , , , ,

WWW2008 Beijing: Dr. Kai-Fu Lee (Google) - “Cloud Computing”

April 23rd, 2008

Kai-Fu Lee is Vice President of Engineering at Google, and President of Google Greater China. He joined Google in 2005, and developed the first speaker-independent continuous speaker recognition system, for which he won a Business Week award in 1988.

He started by talking about the “people theme”, saying that this is what the (Chinese) Internet is all about. (For April Fool’s Day, Google China announced that they were going to shut down their servers to save electricity, and that they would have to hire 25 million people to do their searches for them. They got 1,800 resumes for the positions.)

There are 235 million people on the Internet in China. What do these people want? Kai-Fu listed these things: accessibility, shareability, freedom (data wherever they are), simplicity, and security. Google believes that cloud computing solves a lot of these problems. It’s not new, so Google are just a part of it like we all are. But day by day, cloud computing is changing the way we use the Internet.

He then explained a little bit about what the Cloud is. Data is stored in the Cloud, on some server somewhere that is not necessarily known by the user, but it’s just there and accessible. Software and services are also moving to the Cloud, usually accessible via a full-featured web browser on the client device. He also advocated the use of open standards and protocols, which he says are “liked” by Google (e.g. Linux, AJAX, LAMP, etc.) so as to avoid control by one company. Finally, the Cloud should be accessible from any device, especially from phones. He said that when the Apple iPhone hit the market, they found that web usage from that device was 50 times greater than that from other web-capable phones, and that Google’s servers really felt it.

Next up was a history lesson on cloud computing. The PC era was hardware centric. Then, the client-server era was more software centric, which was great for enterprise computing. Cloud computing now abstracts that server and makes it very scalable, by hiding complexities, and with the server being anywhere. This is service centric.

Banks too have become “Clouds”, allowing people to go to any ATM and remove money from their bank wherever they are. Electricity can be thought of similarly, as it can come from various places, and you don’t have to know where it comes from: it just works.

Driving forces behind cloud-based computing include: (i) the falling cost of storage, (ii) ubiquitous broadband, and (iii) the democratisation of the tools of production. This is beginning to make cloud-based computing more like a utility. A lot of this is due to IBM and DEC’s work in the 1990s, who realised that computing should be a utility. It is only now that these three key things are in place that this becoming a reality.

There are six further properties that make this area exciting, being: (1) user centric, (2) task centric, (3) powerful, (4) accessible, (5) intelligent, (6) programmable.

(1) User centric. The data moves with you, and the application moves with you. People don’t want to reload their address book or applications on new machines, as it is painful to do. For example, how bad do you feel if you drop or break your laptop? How easy is it to switch your cellphone? It’s hard, because synchronising your data is usually hard to do. The IR functionality on a mobile is not easy to use / user centric: how often do people use it to backup stuff to their laptops?

If data is all stored in the Cloud - images, messages, whatever - once you’re connected to the Cloud, any new PC or mobile device that can access your data becomes yours. Not only is the data yours, but you can share it with others (e.g. on Picasa Web, your photos are stored in the Cloud). You don’t have to worry about where it is. We’re not there just yet, but the time is approaching where the way we deal with photographs will change. Another example is GMail, as you can use it on any device (since large storage is not required on the device). Kai-Fu bets that everyone in the room has some kind of cloud computing-based e-mail.

PCs are normally our window to the world, but mobile devices can do more. Since services know who you are and where you are (eek!), they can give you more targetted content. There are 600 million cellphone users in China, three billion worldwide, dwarfing the number of PCs that are Internet-accessible. Intelligent mobile search is useful for cellphones, giving you local listings and results relevant to your context. The most powerful and popular application is maps, especially when people get lost, or if they spontaneously want to go somewhere. Maps are more than the traditional flat piece of paper, allowing you to search nearby, see real-time traffic flows, etc. Such mashups provide even more power - calling these integrations a map is a misnomer - the capabilities are enormous. As there’s a move from e-mail usage towards maps and photos, these new applications have to go into the Cloud as well. And with the shift in this direction, another question is how do you make this economic?

Instant information sharing is also important, e.g. via Google Docs, Page Creator, etc. Recently, Google Sites was released - Google hosts it all for you, so there’s no need for you to buy servers or hosting - 50,000 sites were set up in the first few hours after it began. Not only can you access the data, but you can create it anywhere. The browser is the platform.

(2) Task centric. The applications of the past - spreadsheets, e-mail, calendar - are becoming modules, and can be composed and laid out in a task-specific manner. For example, a task may be teachers creating a departmental curriculum, where you can see the people viewing the curriculum spreadsheet and they can have debates in parallel in real time. Spreadsheet editing allows collaboration and publishing to a selected group of people, with version control.

Google considers communication to be a task, such that in GMail you see pop-up chats and chat histories which provide zero-latency discussions combined in communications tasks. If you want, you can have real-time discussions instead of waiting for e-mail responses if people are online in the contacts list. You can also organise all of your common tasks, e.g. using iGoogle’s widgets portal.

(3) Powerful. Having lots of computers in the Cloud means that it can do things that your PC cannot do. For example, Google Search is faster than searching in Windows or Outlook or Word. Of course, Google Search has to be be much faster, even though there are many more documents. In terms of how much storage is required, if there are 100 billion pages at 10 kB per page, that’s about 1000 TB of disk space. Cloud computing should have an infinite amount of disks / computation at its disposal. When you issue a query to the Google web search engine, it queries at least 1000 machines (potentially accessing 1000s of terabytes).

(4) Accessible. Universal search (”searchology”) was announced by Google last year. Traditional web page search does IR / TF-IDF / page rank stuff pretty well on the Web at large, but if you want to do a specific type of search, for restaurants, images, etc., web search isn’t necessarily the best option. It’s difficult for most people to get to the right vertical search page in the first place, since they usually can’t remember where to go. Universal search is basically a single search that will access all of these vertical searches.

This search requires simultaneously querying and searching over all the specific databases: news, images, videos, tens of such sources today, with potentially hundreds and thousands of them in the future. There are lots of these simultaneous searches which then get ranked, so it is even more computing intensive than current web search.

(5) Intelligent. Data mining and massive data analysis are required to give some intelligence to the masses of data available (massive data storage + massive data analysis = Google Intelligence).

In their machine translation work for translate.google.com, a trillion words were collected from bilingual and monolingual text, and they wanted to not only find various orders of words but also the mappings of words. Statistical models of translation were trained, and they saw how an English-Chinese pair could be aligned. Then, they needed to extract phrases and collect statistics (e.g. how often variations of a certain translation were being used, such as variations for latest / last / newest / most recent). As more training data is added, the quality improves. Context is also an important matter for consideration, and it provides an advantage for the phrase analysis part of Google’s translators. There are estimates that their translator is equivalent to a high-school student’s level of translator quality.

Lots of data can be processed by machine analysis to generate intelligence. But this needs to be combined with humans - via their collaboration and contributions - to change a mess / mass of photos or data or whatever into a very powerful combination. People and tools together can create intelligent knowledge. Applications like Google Earth are much more useful when people can contribute to them, e.g. by National Geographic sticking loads of high-res photos into it. Reviews, 3-D buildings, etc. can turn a tool from a bunch of pictures into something special. Creativity adds connections to data-centric applications, enabling intelligent combinations of content.

With all this data comes the issue of server costs. If you are trying to choose between buying $42,000 high-end servers or cheap PC-class servers for $2,500 each, you can get 33 times cost efficiency by going for the PC-class servers. You can get a 1000 CPU PC-class cluster for the same price as a high-end 64 CPU server, with possibly 30 times the performance (figures may be out of date).

Even though there is a lower cost, there still needs to be high reliability. Google search is mainly based on low-cost commodity PCs running Linux. Failures are expected in every system every day. If we assume that there are 20,000 machines, there’s typically a failure rate of 110 per day. Google has built a custom software layer that can tolerate failure. (They have also deployed a new data centre in just three days.)

(6) Programmable. This follows on from the previous description of data requirements. How does one program for 10,000 “flaky servers” in a Google farm? There needs to be: (i) fault tolerance, (ii) distributed shared memory (if storing every web page in yahoo.com, no one machine can store that, so multiples are required), and (iii) new programming paradigms required for storing stuff.

For (i) fault tolerance, Google uses GFS or distributed disk storage. Every piece of data is replicated three times. If one machine dies, a master redistributes the data to a new server. There are around 200 clusters (some with over 5 PB of disk space on 500 machines).

The “Big Table” is used for (ii) distributed memory. The largest cells in the Big Table are 700 TB, spread over 2000 machines.

MapReduce is the solution for (iii) new programming paradigms. It cuts a trillion records into a thousand parts on a thousand machines. Each machine will then load a billion records and will run the same program over these records, and then the results are recombined. While in 2005, there were some 72,000 jobs being run on MapReduce, in 2007, there were two million jobs (use seems to be increasing exponentially). Not everything is suitable for MapReduce, e.g. parallelising SVMs. Matrix operations can’t be split and re-glued together easily. For this, they use Incomplete Cholesky Factorisation.

Cloud computing needs new skills, especially when working with tens of thousands of machines as opposed to just one. The Academic Cloud Computing Initiative in the US and China (at Tsinghua) was launched by Google and IBM. Cloud computing is not just for web-based problems, but it can help provide solutions for scientific problems that were previously very hard to solve.

In terms of benefits, everything should just work, changing the way we work and play. IT should become “simple and safe”, by outsourcing IT to a “trusted shop” via a browser. Entrepreneurs should have new opportunities with this paradigm shift, being freed from monopoly-dominated markets as more cloud-based companies evolve that are powered by open technologies. Governments should leverage such “innovation-enabling platforms”, where people can effectively program tens of thousands of machines themselves. With $540 million of venture capital infused into China last year, Kai-Fu sees cloud-based computing as being a catalyst of economic growth. He finished up saying that cloud computing has arrived. “Embrace the Cloud!”

There was one question from the audience. The questioner said that Kai-Fu made cloud computing sound simple (i.e., it was well explained, not that the techologies or efforts were trivial). He asked what is the societal change rather than the technological change? Assume we have cloud-based computing, how we can start to encourage “cloud thinking” within society? The questioner works with universities looking at open access, trying to encourage people to share their intellectual outputs, but believes it is difficult to persuade knowledge workers to move their work into the Cloud. His question was, what can we do encourage cloud thinking and “cloud knowledge”?

Kai-Fu’s answer was firstly that cloud computing is not simple, rather it is incredibly complex, but we can learn from what has happened so far. There have been efforts to categorise world knowledge, e.g. Cycorp, which Kai-Fu said has not resulted in a success yet (however, I’ll note here that they are becoming part of the Linked Data initiative: as Kingsley Idehen said yesterday, “Yoda is awake”!). There has been some success in various question and answering systems with pieces of knowledge that can be mined and found. He stated that these were the two extremes, but believes that the answer lies somewhere in the middle: some organisation, but not too much. Wikipedia is a step in this direction, so he suggested bringing the question and answering approach and the Wikipedia approach closer together.

He said that two things would be required. Firstly, he saw the need for some kind of translation capability. There is so much knowledge in English, which spoils native English speakers. In China, people are also spoiled. However, for many other countries, there is very little local language content. If auto translation doesn’t work well, some kind of assisted translation is required. Secondly, there should be mobile endeavours to make knowledge available. There may also need to be some economic incentive for people to create and share content via their mobiles.

(More reviews at 1, 2 and 3.)

English , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

True Knowledge: The Natural Language Question Answering Wikipedia for Facts

February 26th, 2008
True KnowledgeTrue Knowledge is a natural language search engine and question answering site, but to leave it at that would not do the site justice. What makes it stand out from similar sounding services like Powerset and Freebase? True Knowledge tackles natural language search and question answering (much like Powerset and Hakia), and it also maintains a knowledge base of facts about the world (similar to DBpedia and Freebase). However, what makes True Knowledge stand out is that they've combined these features and encourage their userbase to contribute facts and add new knowledge.

A brief overview of True Knowledge

True Knowledge has combined their technologies to create something that doesn't easily fall into any one category. In fact, you can categorize it as all of the following:
Question-Answering site
You can ask questions about any subject and get a direct response. Unlike human-powered Q&A sites, you don't need to wait for someone to respond. The computer answers your question using knowledge stored in a form it can comprehend, and isn't just regurgitating text that it doesn't understand. For this reason it can answer questions it hasn't seen before and can combine knowledge through a process of inference and cross-referencing stored information to produce a reasoned answer.
Natural language search engine
True Knowledge also returns search results like a standard search engine, however not without first passing it through their natural language technology. Your query may be a standard question; even if it isn't, they may be able to work out what you are looking for and give you the answer directly. Because of the way facts are assessed you can enjoy a high degree of confidence that any information they retrieve will be accurate (unlike information on any single Web page). You aren't limited to properly constructed questions, you can also use the typical two and three word "keywordese" queries that many search engine users are accustomed to. Where what is typed is just the name of an entity, their technology can produce a small information screen giving core information about the entity (as well as search engine results).
Wikipedia for facts
The knowledge in their system comes from two main sources: information they import themselves from various sources (such as the CIA Factbook) and facts added by their userbase. A big part of their technology is enabling users to add knowledge without having to have any technical understanding of the underlying computer processes. Unlike Wikipedia, where the knowledge in each entry is buried in natural language, True Knowledge stores each piece of knowledge as a discrete fact that can be reasoned on. Once a fact has been established with enough evidence it can't be easily changed. Furthermore, facts that contradict this knowledge are also automatically prevented, which helps the system deal with vandalism.
"Universal database"
With a typical database-driven application the developers sit down and create a schema. They then write code which manipulates and processes the data in that schema and when the application is finished this code is run by users. The knowledge that such a system can process is extremely narrow and remains so because nothing that happens after launch expands the scope of the application. Users may add data to the tables but the schema remains fixed. True Knowledge is like a database application except that everything in it is amenable to expansion by users. The scope of the knowledge that it can store expands every time a user adds a new class, relation or attribute; and knowledge about every conceivable entity can be put into the system and be used to answer questions.
In short, they've created a platform for representing the world's knowledge in a form that is clear and accessible to humans, as well as being comprehensible to computer.

Information about their architecture

At the heart of the True Knowledge system is the Knowledge Base - a huge database of facts on any topic represented in a form that can be processed by computer. Facts are also inferred by the Knowledge Generator, either using Knowledge Base facts, other generated facts or external feeds of knowledge. Users can ask questions through a browser interface and those questions are translated via Natural Language Translation into queries expressed in the True Knowledge query language. Their technology has the ability to disambiguate ambiguous questions, including removing interpretations of questions that are unlikely. Questions can also be abbreviated to two or three ("keywordese") words and still be understood - similar to typical keyword search terms. Their question answering system uses the Knowledge Base and generated facts to answer queries. The API provides an alternative interface to the question answering system from remote computers. System Assessment further processes existing facts in order to maintain semantic consistency of knowledge. For example, facts can be marked as untrue if they are contradicted by other facts. The browser interface provides a means for users to assess the validity of facts (User Assessment), enabling them to endorse or contradict particular facts. A user's reputation and track record is used to automatically weight this information. In combination with System Assessment this prevents the back-and-forth battles that are common on Wikis. The Knowledge Base grows through Knowledge Addition, either from users via the browser interface, or imported in volume from external sources. A key design decision is that all components are extendable by users. In addition to users adding facts, they can also extend the questions that can be translated into whole new areas and even provide new inference rules (and even executable code for steps that involve calculation) for the Knowledge Generator.

True Knowledge API

No service such as this would be complete without an API! They say their API can execute any query you supply it with, however they are in the process of releasing a series of API services. These simple services encapsulate areas of knowledge which are well served by their current Knowledge Base. All these services can be accessed via the same query interface using a single account. Click on the names of the services below to test each one!
IP Geolocation
Converts an IP address to a probable geographical location of an internet user (e.g. the user of a website). This geographic knowledge can then be used in subsequent queries to retrieve further relevant facts about the location from the Knowledge Base: including the user's likely language, preferred currency, local time etc.
Local Time
Identifies a place either from an IP address obtained automatically or from a supplied string denoting the place and obtains a local time either now or at some past or future time. Possible applications included an online or phone conferencing system wanting to inform the participants about the date/time of the meeting in their local time zone.
Name-to-Gender
Takes a personal name (first name or full name) and returns the gender inferred by the system for that name. The system applies certain heuristics to a string representing a person's name in an attempt to judge the gender of the person. If the gender can be determined with reasonable probability, then it will be returned. This service would be useful to, for example, a social networking site wishing to use gender-specific language about a user whose name, but not gender, was known.
Email-to-Name
Takes an email address and returns the forename inferred from its local-part (if a name can safely be inferred). Businesses with access to users' email addresses but not names could use this to address emails more personally. This service can be combined with the Name-to-Gender service to infer a person's gender from his/her email address.
Trading Day
Takes a point in time and a geographical location and returns 'no' if it is a weekend day or a public holiday in the location and 'yes' otherwise.
Location-to-Language
Returns a language which can be read by a significant number of people at a location. True Knowledge has complete coverage at the national level and partial coverage for smaller areas. This can be used in combination with the IP Geolocation service to decide which language(s) are appropriate when displaying websites to international users, for example.
Telephone Number-to-Location
Returns the geographical location of the specified landline telephone number.
Don't worry, the road doesn't end there. True Knowledge says they are currently working on even more services to add to this list.

Adding knowledge to True Knowledge

Time for some hands-on stuff! What do True Knowledge and Jurassic Park have in common? Nothing as far as I'm aware of. However, I am going to show you step-by-step how I taught True Knowledge something it didn't know. To be more specific, I'm going to show you how to add new knowledge from start to finish and then how to expand on it. Because True Knowledge seems to update itself in real-time, I was able to see the fruits of my labor right away. Not having to wait for an index to rebuilt made the task of adding knowledge feel more worthwhile. After playing with a few test queries I tried to find something it didn't know anything about. I asked "who is the author of jurassic park?", which returned the response "I don't know" and a more detailed explanation:
It sounds like "jurassic park" may be a thing that is published that I don't currently know about. If you want, you can add the thing that is published called "jurassic park" to the Knowledge Base.
Incidently the search results that appear along the side the answer are pretty relevant. The first result contains the answer to my question. By chance, the title is exactly my answer. Clicking the link took me to a screen that asked me to enter the most common name for "a thing that is published." I entered "Jurassic Park." They do ask that you don't enter information about fictional things (e.g., unicorns). I had to think for a moment if Jurassic Park is considered a fictional thing in this context. I came to the conclusion that Jurassic Park is not fictional in the sense that it is both a literary work and the title of several movies so I clicked Submit. After a quick look at the confirmation page I was ready to proceed. I should note that there are several confirmation pages along the way. If you're comfortable enough with the process you can disable each confirmation page individually by checking the box that says "Don't show me this confirmation page again." Next I was presented with a possible Wikipedia match and a helpful extract from the page. I was satisfied that the Wikipedia entry presented to me was indeed talking about the very same Jurassic Park so I clicked continue. The next screen asked me if I knew anything that Jurassic Park is that is more specific than a "thing that is published." It was trying to figure out the name of the class of things Jurassic Park belonged to. I clicked yes, entered "movie" and clicked submit. True Knowledge is already aware of what a movie is and asks me specifically if what I meant was "movie (connected cinematic narrative)." Satisfied that I had my match I clicked submit and continued on. This is where I thought things got interesting. The next screen asked me to be more specific about what kind of movie Jurassic Park is and gave me the following options to choose from:
  • Made for TV movie
  • Made for video movie
  • Big screen movie
Since we all know Jurassic Park was a major motion picture I chose "big screen movie" and clicked select. Alternatively if I didn't want to choose any of those refinements (e.g., if they didn't apply) I could simply click Yes and proceed with Jurassic Park labeled as a "movie." The next screen asked me to enter a phrase that could be used instead of Jurassic Park in all circumstances. Basically they were asking for a short but descriptive phrase that makes it absolutely clear what Jurassic Park is. They give a few examples such as "France, the Republic of France" and "Star Wars, the 1977 adventure action sci-fi movie Star Wars." Going off the Star Wars example I entered "Jurassic Park, the 1993 movie about dinosaurs" and clicked submit. I was then asked to confirm that the phrase I entered was an unambiguous way of saying Jurassic Park, which would be recognized by anyone wanting to say something about that big screen movie. After confirming a few points about the ambiguity of my phrase I clicked Yes. I was then asked to enter a few alternate names. I entered "JP" (the US promotional title) and "Jurassic Park 1" (a common way of referring to the original movie after the sequels were released). Next I had to enter a unique, human readable ID. The page informed me that [jurassic park] was available and auto-populated that value for me. I certainly couldn't think of a better ID so clicked submit. After submitting the ID I was presented with a list of facts that the system had gathered from the information I entered. Reading through the list of facts you can see how each step along the way input the information into True Knowledge. I am listed as the source for each fact because I have not specified any other sources. Luckily I am able to do that at the bottom of the page. As I want this information to be trustworthy, I included a trustworthy source: The IMDB entry for Jurassic Park. I entered the URL for the entry on IMDB and clicked add new source. This took me to a mini-process of adding a document stored in a remote system (i.e., a Web page). I clicked OK to start the process. The next screen asked me to verify that the contents below were what I was expecting. Everything checked out so I clicked confirm. Now that I have a new source available to me (the IMDB page) I changed the source where appropriate. Once I had the sources set I clicked add these facts to finish up the process of adding new knowledge. All done! Clicking on OK will take you to a page with your new entry. The page has a few links for adding more information that would be relevant to the entry. I wasn't done yet since I still couldn't answer the question "who is the author of jurassic park?" Of course now I have a whole new problem, I told the system that Jurassic Park was a movie, not a literary work. We'll see how the system handles this. On the add knowledge page I selected "add a new fact." On the add a fact page I was given three textboxes to enter a (subject,object,predicate) tuple about anything. Since I want to enter the author information for Jurassic Park I entered "Michael Crichton" -> "is the author of" -> "Jurassic Park" and clicked submit. The next screen actually informs me that the system is already aware of Michael Crichton, the American author born in 1942. Since we're both talking about the same person I clicked submit. On the fact confirmation page that followed I was given the option to go ahead and add the fact as-is or to change the left or right part of the fact (the subject or object). Although the proper course of action would have probably been to create a new entry in True Knowledge for the literary work Jurassic Park, I wanted to see if the property "author" could be applied to an instance of class "movie." I also wanted to determine whether or not something can belong to multiple classes ("book" and "movie"). I chose to add Michael Crichton as the author of Jurassic Park (the movie), and clicked Yes. When it came time to list sources I told it that I was not the source, and I listed the Wikipedia entry for Jurassic Park and went through the two-step process of adding a Web page. Now True Knowledge knows about Jurassic Park (the 1993 movie about dinosaurs) and Michael Crichton, the author of Jurassic Park (the literary work). It should be noted that True Knowledge is under the impression that Michael Crichton is actually the author of the movie Jurassic Park. I tried my original question and this time I got a direct answer, including how it came to that conclusion. So you can apply an author to a movie. It feels weird to me that you can do that, because I don't feel you can be the "author" of a movie (rather, the movie's script and screenplay). Back on the add a fact page I tell True Knowledge that "Jurassic Park" -> "is a" -> "book." This time around I'm given three options of what a book might be. I chose the last option, "book (a written work intended to be published as a set of pages bound together on one side)" because I felt it was the best definition of what a book is. After confirming the fact and adding my source (Wikipedia again) I am informed that "Jurassic Park is a book" contradicts previously inserted facts. In this case, it is apparent that a movie cannot also be a book. In the end the fact did not get added because it contradicts an existing fact in the system. Today was just my first day, so I'm sure I'll get better at this.

My first impression of True Knowledge

I found my first experience with True Knowledge very satisfying! The user interface is simple and it's hard to get lost trying to do something new. They are still in beta, and as such they still have some polish to apply before the general public is let in, but the product is solid and I can't wait until more users are let in the gates. I'm interested to see how it will prevail over similar services. Components of True Knowledge compete with many semantic services (Freebase, Hakia, Powerset, DBpedia, etc) and even non-services like Cyc. I am of the opinion that True Knowledge has the winning combination of each approach. Got something to say? Leave a comment!

English , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Conference season 2008

February 6th, 2008

JFK-SAN-AUS-SFO

The March 2008 US conference season is nearly upon us. I'm just on my way back from representing Dopplr at Social Graph Foo Camp (find out more by listening to the Citizen Garden Podcast I participated in after the camp), but I'll be back here again in three weeks.

I'm spending a few days in New York, where I'll be hosted by the lovely Chris Shiflett, and then it's on down to San Diego for ETech. That'll be swiftly followed by SXSW Interactive where I'll be on a panel entitled "Creative Collaboration: Building Web Apps Together", about working in multidisciplinary teams. Finally, a week in San Francisco decompressing and having a few meetings.

I'm particularly excited by the trip to ETech. The last two years have brought smart people together to talk mostly Web 2.0 topics, but this year looks significantly more awesome. Full of genuinely emerging technology, the lineup looks like one Matt Jones and Tony Stark would appreciate.

Some highlights for me include a talk from Google's economics groups on Prediction Markets, Computing for Socio-economic Development, and the excitingly-titled Antigenic Cartography: Visualizing Viral Evolution for Influenza Vaccine Design. Hope I see you there.

Permalink

Uncategorized , , , , , , , , ,

New Book by Kathleen Spivack (My Mother)

December 9th, 2007
My mother, Kathleen Spivack, a Pulitzer Prize nominee, recently published her sixth book! My mother taught me how to write. She's a brilliant writer and poet who teaches widely in the USA and Europe....

English , ,

Parsing Miss South Carolina’s Statement

August 31st, 2007

It’s not like it’s easy to parse Wikipedia, but at least most of the its text is (usually) written with correct spelling, capitalized proper names, meaningful paragraph structure and so on. A natural question is: how will our system perform on the rest of the Web with all of its slang, non-standard syntax, and so on? To put Powerset to the test, two of our engineers, Lukas Biewald and Brendan O’Connor, ran our entire parsing and indexing system on the hardest corpus we could find: Miss South Carlolina’s response to the question, "Recent polls have shown that a fifth of Americans can’t locate the US on a map. Why do you think this is?" They fed this transcription into the XLE verbatim, disfluencies and all:

I personally believe that U.S. Americans are unable to do so because uh some uh people out there in our nation don’t *have* maps and uh I believe that our ed- education like such as in South Africa and uh the- the Iraq everywhere like such as and I believe that they should uh our education over here in the U.S. should help the U.S. or- or- should help South Africa and should help the Iraq and the Asian countries so we will be able to build up our future

One might think that such a convoluted mess of words (I hesitate to call it "English") would be impossible to parse, but here is the C-structures that our parser generates: 

Unsurprisingly, the sentence is fragmented quite a bit, but the parser clearly managed to extract useful structure throughout the sentence. The last large verb phrase “should help South Africa and should help the Iraq and the Asian countries so we will be able to build up our future” seems very close to correct, which is pretty impressive (Language Log has more discussion on the weird “the Iraq” construction). The output of the semantics system is too long to put here, but in some ways, it’s amazing that we were able to extract any semantics at all. And, believe it or not, we can actually run some queries against the Carolina Index (as it’s known at Powerset). It’s hard to think of a reasonable question, but we asked, "Who does education help?" and returned and highlighted the right answer: "Americans". Or should we have returned "U.S. Americans"? 

English , , , , ,

The Tyranny of the Common Name

July 25th, 2007

An acquaintance of mine asked me to email him something, so I asked him for his email address. He gave me his business card. It had only his first and last name printed on it, nothing else.

I said, “Uh…wait a sec…so what’s your email address?”

He said, “Just put my name into any reasonable search engine and my homepage will pop right up.”

I was immediately a little bit jealous – I’d always wanted to be able to use my name alone as a ‘personal URI’, but either I’m not famous enough or my name isn’t unique enough. Or some combination of the two.

If I put my first and last name into any *reasonable search engine*, I do not *pop right up*. In fact, the most famous person with my name is [an English darts player, who became even more famous for the first televised nine-dart finish.

I have on occasion told people they could find me (by which of course, I meant ‘find my little self-created electronic shrine on the web’) by putting my *full name* – first, middle, last – into a reasonable search engine and then my homepage (more or less) pops right up. And of course not everyone who might want to find me knows those details or realizes that they form a useful search key for finding me on the web.

Now, there’s a big business in trying to stand out in the results from search engines and I’m not even going to go there in this essay. “Optimizing” (some would say “gaming”) search engine results is in many ways a recapitulation of the process of trying to be first (or sometimes last) in lexicographic sort order in the yellow pages: AAAAAAA Plumbers has a better change of getting called by a desperate homeowner with a leaking toilet than the next name on the list. Ever since I was a child I’ve thought that people with names closer to the front of the alphabet have a bit of an advantage over those of us whose names put us in the back of the pack. But perhaps this advantage is cancelled out by those occasions when being first is the last thing one would want.

I think the little accidents (and sometimes choices) of personal and cultural history that automatically improve the visibility of certain individuals and bury others are a never-ending source of wonder. Conventions for personal names are really quite complicated, a topic of long and deep study called onomastics. If you go looking for this in wikipedia you’ll be disappointed – it’s only a placeholder for this topic. It’s a \hypernym of what you really want: anthroponymy (scans like *anthropology*).

The patronymic conventions which are frequently encountered in the modern descendents of Indo-European cultures are in fact rather recent (*Johnson* was at one time the son of someone named John, and *Berger* was something like a townsperson), but when considered from a more global perspective things get really interesting. It’s been estimated that there are 296 million Lis in the world (nearly the population of the US), this being the most common surname in China. (The notion *surname* is really not quite right here since in that part of the world the family name is usually cited first, but that’s how they’re called in my source. The same is true in Japan, Korea and elsewhere). In Burma, it was pretty common to name your kid after the day of the week he or she was born on, so there’s a dearth of first names there. In places where there are not a lot of people it’s not so hard to identify someone and so a single name does just fine; just don’t try to google someone from the Maldives and expect their email address to pop right up.

I really feel sorry for people who have names that are already in use as common nouns (or adjectives) – the Browns, Greens, Bushes, and Stones of the world – they’re even further down the search engine result list. Unless, again, you’re already famous, like Robin Hood. Oh wait, we now need to distinguish him from the rest of the robin hoods of the world out there giving to the poor after stealing from the rich. Oh, the irony of having one’s personal name become a household name! Or worse, a verb (e.g. to get *borked* or *meired*)

Perhaps someday conventions will arise to allow us commoners to distinguish ourselves from the hordes of others clamoring to use our names. I know someone who signs his emails ‘Jeff (meaning “an instance of Jeff” in Lisp). Or perhaps we could start using our email addresses or screen names – it would certainly help in disambiguating names in scientific papers. Anyone for an ISO standard for adding diacritics, indexicals, or skolems to names to distinguish one from another?

I’d love to hear from you if you are interested in this topic. Just put my full name, all three of them, into any reasonable search engine and you’ll find me.

-John (Brandon) Lowe, Senior Scientist

English , , , , , , , , , , , , , ,

Conferences 2007, Part One

February 27th, 2007

I'm on the road again. On Thursday March 1st I'm flying to San Francisco and I'll be in the USA for the whole month.

While I'm there, I'll be hanging out at the usual places - I'm staying in the Mission so it'll be Ritual Roasters by default. I'll also be at the following events:

I hope I see you there.

Permalink

Uncategorized , , ,

Coming in to land

October 8th, 2006

It's nearly time to return to London for a pause and a stretch. Since I quit my job at the BBC almost exactly a year ago, I've spent 4 months snowboarding, attended 6 conferences and spoken at 3 (LIFT06, ETech, SXSW, XTech, Railsconf and Foocamp), worked on at least 5 freelance contracts, lived in 3 different countries (France, Holland and the USA) and spent time in at least 5 others (Spain, Switzerland, Germany, Austria, Finland). I've travelled more than 40,000 miles by air, taken a flight every 2 weeks on average, and probably met more people in one year than in all the previous years of my life put together.

Although it's no substitute for simply avoiding wasteful airtravel, after doing the calculations for this post I paid for a 15,000 lbs CO2 carbon offset from TerraPass.

My final stop on the current journey is the Near Field Interactions workshop at NordiCHI in Oslo. I'll be representing Thinglink along with Ulla-Maaria Mutanen.

On October 17th I'll be back in my own flat in Hackney, East London and considering my next steps. 2007 has a lot to live up to. Of course, the planning for XTech 2007 has already begun and I've just submitted my talk proposal for next year's ETech.

Permalink

Uncategorized , , , , , , , , , , ,

Net Neutrality: This is serious

June 21st, 2006

( real video, download m4v )

When I invented the Web, I didn't have to ask anyone's permission. Now, hundreds of millions of people are using it freely. I am worried that that is going end in the USA.

I blogged on net neutrality before, and so did a lot of other people. (see e.g. Danny Weitzner, SaveTheInternet.com, etc.) Since then, some telecommunications companies spent a lot of money on public relations and TV ads, and the US House seems to have wavered from the path of preserving net neutrality. There has been some misinformation spread about. So here are some clarifications. ( real video Mpegs to come)

Net neutrality is this:

If I pay to connect to the Net with a certain quality of service, and you pay to connect with that or greater quality of service, then we can communicate at that level.
That's all. Its up to the ISPs to make sure they interoperate so that that happens.

Net Neutrality is NOT asking for the internet for free.

Net Neutrality is NOT saying that one shouldn't pay more money for high quality of service. We always have, and we always will.

There have been suggestions that we don't need legislation because we haven't had it. These are nonsense, because in fact we have had net neutrality in the past -- it is only recently that real explicit threats have occurred.

Control of information is hugely powerful. In the US, the threat is that companies control what I can access for commercial reasons. (In China, control is by the government for political reasons.) There is a very strong short-term incentive for a company to grab control of TV distribution over the Internet even though it is against the long-term interests of the industry.

Yes, regulation to keep the Internet open is regulation. And mostly, the Internet thrives on lack of regulation. But some basic values have to be preserved. For example, the market system depends on the rule that you can't photocopy money. Democracy depends on freedom of speech. Freedom of connection, with any application, to any party, is the fundamental social basis of the Internet, and, now, the society based on it.

Let's see whether the United States is capable as acting according to its important values, or whether it is, as so many people are saying, run by the misguided short-term interested of large corporations.

I hope that Congress can protect net neutrality, so I can continue to innovate in the internet space. I want to see the explosion of innovations happening out there on the Web, so diverse and so exciting, continue unabated.

Uncategorized , , , , , ,

Neutrality of the Net

May 2nd, 2006

Net Neutrality is an international issue. In some countries it is addressed better than others. (In France, for example, I understand that the layers are separated, and my colleague in Paris attributes getting 24Mb/s net, a phone with free international dialing and digital TV for 30euros/month to the resulting competition.) In the US, there have been threats to the concept, and a wide discussion about what to do. That is why, though I have written and spoken on this many times, I blog about it now.

Twenty-seven years ago, the inventors of the Internet[1] designed an architecture[2] which was simple and general. Any computer could send a packet to any other computer. The network did not look inside packets. It is the cleanness of that design, and the strict independence of the layers, which allowed the Internet to grow and be useful. It allowed the hardware and transmission technology supporting the Internet to evolve through a thousandfold increase in speed, yet still run the same applications. It allowed new Internet applications to be introduced and to evolve independently.

When, seventeen years ago, I designed the Web, I did not have to ask anyone's permission. [3]. The new application rolled out over the existing Internet without modifying it. I tried then, and many people still work very hard still, to make the Web technology, in turn, a universal, neutral, platform. It must not discriminate against particular hardware, software, underlying network, language, culture, disability, or against particular types of data.

Anyone can build a new application on the Web, without asking me, or Vint Cerf, or their ISP, or their cable company, or their operating system provider, or their government, or their hardware vendor.

It is of the utmost importance that, if I connect to the Internet, and you connect to the Internet, that we can then run any Internet application we want, without discrimination as to who we are or what we are doing. We pay for connection to the Net as though it were a cloud which magically delivers our packets. We may pay for a higher or a lower quality of service. We may pay for a service which has the characteristics of being good for video, or quality audio. But we each pay to connect to the Net, but no one can pay for exclusive access to me.

When I was a child, I was impressed by the fact that the installation fee for a telephone was everywhere the same in the UK, whether you lived in a city or on a mountain, just as the same stamp would get a letter to either place.

To actually design legislation which allows creative interconnections between different service providers, but ensures neutrality of the Net as a whole may be a difficult task. It is a very important one. The US should do it now, and, if it turns out to be the only way, be as draconian as to require financial isolation between IP providers and businesses in other layers.

The Internet is increasingly becoming the dominant medium binding us. The neutral communications medium is essential to our society. It is the basis of a fair competitive market economy. It is the basis of democracy, by which a community should decide what to do. It is the basis of science, by which humankind should decide what is true.

Let us protect the neutrality of the net.


  1. Vint Cerf, Bob Kahn and colleagues
  2. TCP and IP
  3. I did have to ask for port 80 for HTTP

Uncategorized , , , , , , , , , , , , , ,