Monday, May 21, 2007
rkgblog writes about Jimmy Wales critiquing Google's lack of transparency.
I've just been briefly reviewing the history of Google and competitors in capsule form, as I write the new edition of the book. Kept me up late thinking about it, to be honest.
There is something so compelling about an open source search engine: maybe search can actually get better if it goes in that direction - tapping into distributed developer expertise. In non-public or low-scale settings, search engines like Nutch and its cousin Lucene SOLR have so much promise. And why not? It becomes "our" search engine that allows "us" to customize, while not being beholden to a particular overlord.
Some of that vibe, though, was what led to the Open Directory Project many years ago -- and what happened?
On balance, it looks to me that Nutch et al. (open machine algorithm) and Wiki-something are two very different approaches to the problem. Open source search in the traditional sense is open to a community of developers, and freely licensable. Wikified search is bound to be open in that looser, sometimes chaotically obscure or corrupt way somewhat analogous to the (problems and opportunities of) old ODP. Importantly, the Wiki concept still relies too much on people to produce content. This will not necessarily scale. It's useful for some things, hopeless for others. Another problem is that Wikipedia users won't necessarily be better at the production side than users distributed across many involved online communities. They might be worse.
This is a draft of some thoughts that might go into a book (below). A few older bits still need cleaning up. What are your thoughts?
Beginning life in 1998 as GnuHoo and then NewHoo, the Open Directory Project (ODP) was conceived as a competitor to the Yahoo Directory. The work was to be done by volunteer editors, and the end product was to be licensed to any portal or site that wanted to take advantage of the information. Doesn’t sound like much of a business? Well, it turned out to be a pretty good deal for the founders. The directory’s popularity led to its acquisition by Netscape, which was later acquired by AOL.
AOL became the Open Directory’s major distributor, but the directory was also licensed (at no charge to the publisher) in many other places around the Web. Google began using ODP data fairly early on, calling it the Google Directory. An innovative feature was Google’s use of an “overlay” technique, ranking results in a given ODP category in order based on the site’s Google PageRank score. This was illustrated with a green bar (on a scale of 0 to 10, similar to the way the info is displayed by searchers using the Google toolbar). This could have been a very useful feature indeed had there been more consistency to the underlying content in the directory. The so-called Google Directory still exists, but it has been completely de-emphasized in the Google Search user interface.
A couple of key Open Directory players, founder Rich Skrenta and marketing exec Chris Tolles, eventually moved on to a new venture: Topix, a sophisticated news search engine that competes directly with Google News. Topix is now 75% owned by three major media companies: Gannett, Knight-Ridder, and Tribune.
The ODP came under criticism for many of the same reasons Wikipedia is maligned in some quarters today: a lack of “professional” editorial quality control. The lack of transparency of site submission procedures to the website-owning public, and the huge variations in the degrees of disclosure of editors’ biographical information meant, for me, that this so-called open directory was far from it.[i]
The construction of a comprehensive high-quality human-edited directory remains an elusive (and perhaps now irrelevant) task. The ODP founders were correct in their assumption that a distributed model for vetting editorial recommendations was the only possible way to get a comprehensive categorized directory to scale with the growth of the Web. But they also oversold the value of human contributions insofar as even tens of thousands of these couldn’t scale adequately to cover the enormous explosion of online information – not as compared with improved search algorithms and search interfaces, to say nothing of the massive acceptance of the concept of online collaboration and a wider range of tools to support this. In the past I had come across a couple of alternatives to ODP; two notable ones were put forward by Steve Thomas (Wherewithal, Inc.) and Dave Winer (RSS pioneer). Both would rectify the problem of a fixed category structure being controlled solely by the category owner. They’d allow for collaborative taxonomy, so to speak. Ho, hum. Many of these seemingly radical critiques of ODP have become staples of today’s so-called Web 2.0 movement. Thus many of these early debates have been surpassed by growing acceptance of the need to develop technologies to subtly handle “upstream” of self-organizing editorial output from many users, rather than a top-down (if seemingly democratic) categorization scheme.
Contributing to the organization and sharing of information has to seem fun or worthwhile, and much of the ODP community moved onto other passions. A spinoff site called ChefMoz – also a good idea – found little appeal in the broader public and proved that ersatz claims of “officialdom” for an open-source, .org-based human review site were grandiose; sites (mere websites!) such as Chowhound and Yelp now achieve considerably greater popularity pursuing virtually the same goals. The emergence of a range of ODP offspring proved that it was never really the open directory. It was a human-powered directory that chose to call itself “open.” (A similar realization will no doubt dawn on users of and contributors to Wikipedia, too. It won’t prove to be the be-all and end-all.)
“Humans do it better,” the ODP slogan, was proved wrong in the sense that algorithmic approaches such as Google Search won mass acceptance from users over and above hand-categorized, ostensibly quality-controlled directories. That said, new methodologies of tapping the so-called “wisdom of crowds” (such as Digg) have meant that the machine algorithmic approach isn’t the only winner in the marketplace for ranking and rating online content. And certainly, algorithms can’t create content as tens of thousands of Wikipedia participants have managed to do in their improbable construction of a huge online resource.
In the meantime, then, a whole world of user-built information sources has exploded on the scene, with Wikipedia and Digg leading the pack. Many of the pathologies (and opportunities) that bedevil (and excite) Wikipedia and Digg users today were endemic to ODP. In hindsight, this makes Skrenta and Tolles pioneers to a greater extent than perhaps they realized.
The Google Difference: A Third-Generation Algorithm
If Google hadn’t moved to fill the void left by its struggling predecessors, someone else would have. Scientists in various research projects were working on new ideas about how to rank the importance of web pages vis-à-vis a given user query. What Google did was to popularize some of the best emerging ideas about how to design a large-scale search engine at a time when others were losing momentum. Some of these ideas are so central to the task of ranking pages in today’s web environment that they were adopted in some form or another by all of Google’s main competitors (including Inktomi, AltaVista, and FAST).
The working paper that explains Google’s PageRank methodology, “Anatomy of a Large-Scale Hypertextual Web Search Engine,” is frequently cited.[ii] But the field of information retrieval technology is rich with ongoing experimentation by hundreds of well-funded scientists, some well known, some not. Some scientists take a slightly different approach to the problem tackled by Page and Brin, organizing the Web into topic-based “communities.” Teoma, a search engine acquired by Ask Jeeves (now Ask.com), is the most public example of this approach.[iii] The two approaches tend to provide somewhat different results, but they are clearly cousins of a similar generation of thinking about the “hyperlinked environment,” and both have been a boon to researchers seeking that elusive piece of information online. In practice, algorithms such as Google’s and Ask’s today are really meta-algorithms, looking for “signals” on a wide and shifting spectrum of measures of quality and relevancy, while attempting to filter out or devalue huge volume of junk, spam results. Today’s search engines might be clever enough to measure website usage patterns, background business data, and more. (One potential signal, the age of a website, is now seen as so matter-of-fact that search marketers have a nickname for the apparent difficulty in getting well-indexed in Google if you’re a new website owner: “The Google Sandbox.”) In addition to all that, there are attempts to determine user intent in search queries, to serve up personalized results or even different types of results (news search, maps, financial charts, weather) based on the user’s history or the nature of the query. In today’s mature world of search, no one methodology is billed as “the” best way of arriving at the ultimate ranking of results on a given search query. But arguably, Google consolidated its lead in search based on the mythology that its PageRank system was an invention that led to brilliantly accurate search results.
In any case, the idea behind PageRank was brilliant and intuitive when it was brought to market in 1998. The governing principle revolves around a map of the linking structure of the Web. Pages that have a lot of other important pages pointing to them are deemed important. “PageRank can be thought of as a model of user behavior,” wrote Brin and Page. “We assume there is a ‘random surfer’ who is given a web page at random and keeps clicking on links, never hitting ‘back’ but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank.”
This was a significant advance over previous generations of web search. Although most major engines had experimented with a variety of ranking criteria, many of them had depended heavily on basic keyword matching criteria. Not only did this make good information hard to find because so many pages were locked in a virtual tie for first place, it made it easier for optimizers to feed keyword-dense pages into the search engine in a bid to rank their commercially oriented pages higher. Although this game of keyword optimization is quite effective to this day in ranking pages well on unpopular queries (even on Google Search), it seems to work rather poorly on common queries.
The ascendance of PageRank means that on a Google Search for auto insurance comparison, for example, it’s likely that a well-known site will rank well here rather than some random site that just happens to contain those keywords. When I tried the query, I saw a number of leading insurance comparison sites, and very little “junk.” This dovetails with the notion that authoritative recommendations do indeed confer authority as far as Google’s algorithm is concerned. But it won’t take you long to find a few head-scratchers in the mix. It’s difficult to get a monolithic sense of which types of pages rank well. But few would dispute the fact that a high volume of quality links pointing to one’s site is a great way of getting Google Search to treat you well. PageRank isn’t dead, it’s just part of a bigger mix of factors than ever before.
The ability to break all these “virtual ties” among similar search results was a breakthrough for search engines. Almost all major search technologies today are significantly more sophisticated than those from the mid-1990s. I recall a time when many websites used a free licensed version of Excite Search for their internal site search. The technology was weak, often providing a clutter of irrelevant results. If search was this bad in closed corporate environments, it was definitely in need of improvement if it was to help users sort through the enormous clutter of pages available on the Web. For searching relatively fixed data sets, such as finding pages within a single website, today’s technology is significantly improved over yesteryear’s. The open source movement has even brought us libraries of sophisticated search engine code (such as Lucene SOLR), meaning that a powerful small-scale search engine can be customized at a reasonable cost.
A public web crawler in the same family, Nutch, has gained notice as well. A free, open-source web search technology in 2007 is nearly as sophisticated as industry-leading search engines from a decade ago valued in the hundreds of millions of dollars, but they’re still far from beating Google at its own game. Why? Nutch – like many other search technologies – doesn’t scale as well. In the understatement of the search engine century to date, the Nutch founders write: “Much of the challenge in designing a search engine is making it scale. Writing a Web crawler that can download a handful of pages is straightforward, but writing one that can regularly download the Web's nearly 5 billion pages is much harder.”[iv]
It doesn’t stop there. Taking those billions of pages, now you’ll have to assess them all and determine how much authority each link on each page should be allowed to “pass on” to other websites and pages. Because some site owners will be up to no good (premeditated linking schemes), or simply because fortunes change, the map of how much authority (or, what type of authority) is conferred by all hyperlinks on record is going to need to be updated regularly. A web search engine must also be able to sort out “duplicate” (often stolen or “scraped”) content from the original content, so it doesn’t end up giving visibility to the wrong source. The calculation of link structures and associated authority weights alone – let alone getting the underlying approach to how to do the calculation right – is beyond the capacity of any small-scale search engine infrastructure.
Beyond massive computing power and indexing technology, then, Google’s advantage continues to rely in part on the ability of PageRank and other related technologies to sort out valuable information from information that “dumbly matches” the user’s query. Want proof? Do a search on your favorite topic at Technorati.com, the blog search engine. It’s powered by Nutch. I’m betting you’ll find quite a number of “spammy” results in the mix, in spite of some recent tinkering with a weak cousin to PageRank, an “authority score.” What’s surprising is that Google’s own Blog Search also appears easier to flood with duplicate content and spammy sites than its main search index.
To be clear, the calculations involved in determining PageRank are just the beginning when it comes to determining how high a page ranks for a given user’s query on Google....
[i]. “Why the Open Directory Isn’t Open,” Traffick.com, March 30, 2000.
[ii]. Sergey Brin and Lawrence Page, “Anatomy of a Large-Scale Hypertextual Web Search Engine,” Stanford University Department of Computer Science, 2000. Jon Kleinberg, widely considered to be the leading contributor to this generation of search technology, has published many important papers on search, including “Authoritative Sources in a Hyperlinked Environment,” 1998.
[iii]. For a user-friendly overview, see Mike Grehan’s interview with Paul Gardi, “Inside the Teoma Algorithm,” July 2003, archived at e-marketing-news.co.uk.
[iv] Mike Cafarella and Doug Cutting, ACM Queue 2:2 (April 2004).
Labels: google, jimmy wales, open source, wikipedia
View Posts by Category