Tuesday, August 07, 2007

The Linear Algebra behind Google

One of the practical applications of linear algebra is the success of Google. If a math teacher gives out this response to students, that would certainly produce many Ahh Huuu..., but truth indeed is there in the statement. When it comes to Google and its rocketing success story, some may wonder whether mathematics has so much of practical applications in it! Kurt Bryan and Tanya Leise aptly mentioned [1] that this eigenvector is worth more than $5,000,000,000! Aha, isn't this the richest eigenvector that you ever come across?

One of the core technique behind Google's multi billion dollar success story is the page rank algorithm, developed by its co founders Larry page and Sergey Brin, while they were in Stanford[3]. Let us put the statement mathematically or rather linear algebraic: It is essentially ranking web pages according to an eigenvector of a weighted link matrix. So, Google search has its thrust based on solving this eigenvector computing! Computing eignevalues and eigenvector, are sole linear algebra problems. The deal is quite big though. Let us talk a little bit deep about this problem.

Google's website [2] has only modest thing to say about this fantastic algorithm:

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at considerably more than the sheer volume of votes, or links a page receives; for example, it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important." Using these and other factors, Google provides its views on pages' relative importance.

Of course, important pages mean nothing to you if they don't match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines dozens of aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.

  • A basic listing of the pagerank is available here at howstuffworks.com. Here is the summary extracted from there.
  • PageRank assigns a rank or score to every search result. The higher the page's score, the further up the search results list it will appear.
  • Scores are partially determined by the number of other Web pages that link to the target page. Each link is counted as a vote for the target. The logic behind this is that pages with high quality content will be linked to more often than mediocre pages.
  • Not all votes are equal. Votes from a high-ranking Web page count more than votes from low-ranking sites. You can't really boost one Web page's rank by making a bunch of empty Web sites linking back to the target page.
  • The more links a Web page sends out, the more diluted its voting power becomes. In other words, if a high-ranking page links to hundreds of other pages, each individual vote won't count as much as it would if the page only linked to a few sites.
  • Other factors that might affect scoring include the how long the site has been around, the strength of the domain name, how and where the keywords appear on the site and the age of the links going to and from the site. Google tends to place more value on sites that have been around for a while.
  • Some people claim that Google uses a group of human testers to evaluate search returns, manually sorting through results to hand pick the best links. Google denies this and says that while it does employ a network of people to test updated search formulas, it doesn't rely on human beings to sort and rank search results.


[1]
Kurt Bryan, Tanya Leise, The $25,000,000,000 eigenvector. The linear algebra behind Google. SIAM Review, 48 (3), 569-81. 2006
[2]http://www.google.com/technology/
[3]http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=6,285,999

0 Comments:

Post a Comment

<< Home