Archive

Archive for October, 2007

The Next Big Thing: User-Contributed Metadata

October 31st, 2007
Dan Farber has an interesting piece today about how user-contributed metadata will revolutionize online advertising. He mentions Facebook, Metaweb and Twine as examples. I agree, of course, with...

English ,

http://www.dreig.eu/caparazon/

October 31st, 2007
Blog sobre tecnología, internet, sociedad de la información

delicious

More News Coming out of Burma

October 26th, 2007
Only now the full horror of the situation in Burma is emerging.

English

Radar Networks Coming Out of Stealth - Friday, October 19

October 26th, 2007
News Flash! My company, Radar Networks, is coming out of stealth this Friday, October 19, 2007 at the Web 2.0 Summit, in San Francisco. I'll be speaking on "The Semantic Edge Panel" at 4:10 PM, and...

English ,

Understanding The Semantic Web: A Response to Tim O’Reilly’s Recent Defense of Web 2.0

October 26th, 2007
Tim O'Reilly, recently blogged another article about Web 2.0 Versus Web 3.0 in which he responded to some of my points about what Web 3.0 is and is not. There are several points in his post that I...

English

First Release of Hbase in Hadoop

October 16th, 2007

HbaseWith all of the excitement in the past few weeks with IBM partnering with Google to create a distributed systems lab for college students based on Hadoop and Kosmix releasing the open source Kosmos Distributed File System with Hadoop compatibility, we thought it would be a good time to talk about an area of the Hadoop project that we’ve spent a lot of time on over the past year and want to invite you to use and contribute to: HBase.

There is a lot to the "secret sauce" that is Powerset Web search, like the XLE, licensed from PARC, ranking algorithms, and the ever-important onomasticon. These components are consequently proprietary.

For any other component, we try to use open source software if it is available. One of the unsung heroes that forms the foundation for all of these components is the ability to process insane amounts of data. This is especially true for a Natural Language search engine. A typical keyword search engine will gather hundreds of terabytes of raw data to index the Web. Then, that raw data is analyzed to create a similar amount of secondary data, which is used to rank search results. Since our technology creates a massive amount of secondary data through its deep language analysis, Powerset will be generating far more data than a typical search engine, eventually ranging up to petabytes of data.

Google uses a number of well-known components to fulfill their data processing needs: [a distributed file system GFS , Map/Reduce, and BigTable.

Instead of creating a proprietary copy of these pieces of infrastructure, Powerset decided instead to turn to Hadoop, a Lucene sub-project that is a framework for running data-intensive applications on large clusters of commodity hardware. Powerset’s already benefited greatly from the use of Hadoop: our index build process is entirely based on a Hadoop cluster running the Hadoop Distributed File System (HDFS) and makes use of Hadoop’s map/reduce features.

Unfortunately, there was no Hadoop equivalent to Google’s Bigtable storage engine. Because we have benefited greatly by leveraging the available Hadoop technology, Powerset decided to give back to the community by developing an open source analog to Bigtable that is built on top of HDFS. After all, we need to develop it anyway, it isn’t part of the Powerset "secret sauce", and we, in turn, could benefit from contributions from other members of the community.

For the past 10 months, Michael Stack (known to his friends as, St.Ack) and I have been working full time on developing the open source Bigtable-like storage engine that we call HBase, a sub-project of Hadoop. Because of the progress that we and the other HBase contributors have made in the past couple of months, we’re happy to announce that the first HBase release will be available with the Hadoop 0.15.0 release. As with any project, there’s always more to do, but that’s exactly why we invite you to join the community. If you’re an interested software developer or a company that needs a platform for large scale computing, in particular if you are looking for a distributed storage system for massive amounts of structured data, try HBase. We’re happy to help you get started: all of our contact information is located conveniently within the HBase site.

We hope that you’ll both use HBase and contribute! – Jim Kellerman, Senior Engineer, Powerset

English , , , , , , , , , , , , , , , , ,

Aprendizaje 3.0

October 16th, 2007