August 30, 2007

The World’ s Largest Matrix Computation

The World's Largest Matrix Computation

One of the reasons why Google is such an effective search engine is the PageRank™ algorithm, developed by Google's founders, Larry Page and Sergey Brin, when they were graduate students at Stanford University. PageRank is determined entirely by the link structure of the Web. It is recomputed about once a month and does not involve any of the actual content of Web pages or of any individual query. Then, for any particular query, Google finds the pages on the Web that match that query and lists those pages in the order of their PageRank.

Imagine surfing the Web, going from page to page by randomly choosing an outgoing link from one page to get to the next. This can lead to dead ends at pages with no outgoing links, or cycles around cliques of interconnected pages. So, a certain fraction of the time, simply choose a random page from anywhere on the Web. This theoretical random walk of the Web is a Markov chain or Markov process. The limiting probability that a dedicated random surfer visits any
particular page is its PageRank. A page has high rank if it has links to and from other pages with high rank.

Let W be the set of Web pages that can reached by following a chain of hyperlinks starting from a page at Google and let n be the number of pages in W. The set W actually varies with time, but in May 2002, n was about 2.7 billion. Let G be the n-by-n connectivity matrix of W, that is,
gi,j is 1 if there is a hyperlink from page i to page j and 0 otherwise. The matrix G is huge, but very sparse; its number of nonzeros is the total number of hyperlinks in the pages in W.

Let cj and ri be the column and row sums of G.

cj = i gi,j, ri= j gi,j

The quantities ck and rk are the indegree and outdegree of the k-th page. Let p be the fraction of time that the random walk follows a link. Google usually takes p = 0.85. Then 1-p is the fraction of time that an arbitrary page is chosen. Let A be the n-by-n matrix whose elements
are

ai,j = p gi,j / cj + , where = (1-p) / n.

The matrix A is not sparse, but it is a rank one modification of a sparse matrix. Most of the elements of A are equal to the small constant . When n = 2.7·109, = 5.5·10-11.

The matrix is the transition probability matrix of the Markov chain. Its elements are all strictly between zero and one and its column sums are all equal to one. An important result in matrix theory, the Perron-Frobenius Theorem, applies to such matrices. It tells us that the largest eigenvalue of A is equal to one and that the corresponding eigenvector, which satisfies the
equation

x = Ax,

exists and is unique to within a scaling factor. When this scaling factor is chosen so that

ixi = 1

then x is the state vector of the Markov chain. The elements of x are Google's PageRank.
If the matrix were small enough to fit in MATLAB, one way to compute the eigenvector x would be to start with a good approximate solution, such as the PageRanks from the previous month, and simply repeat the assignment statement

x = Ax

until successive vectors agree to within specified tolerance. This is known as the power method and is about the only possible approach for very large n. I'm not sure how Google actually computes PageRank, but one step of the power method would require one pass over a database of Web pages, updating weighted reference counts generated by the hyperlinks between pages.

No comments:

Post a Comment