codingprolab@gmail.com

- Home
- Numerical Computing
- Numerical Computing Project 1 – PageRank Algorithm

$30.00

Order Now
Category: Numerical Computing

Description

5/5 - (12 votes)

The purpose of this project1

is to learn the importance of numerical linear algebra algorithms to solve fundamental

linear algebra problems that occur in search engines.

Page-Rank Algorithm – Linear Systems

One of the reasons why GoogleTM is such an effective search engine is the Page-RankTM algorithm developed by

Google’s founders, Larry Page and Sergey Brin, when they were graduate students at Stanford University. PageRank

is determined entirely by the link structure of the World Wide Web. It is recomputed about once a month and does

not involve the actual content of any Web pages or individual queries. Then, for any particular query, Google finds

the pages on the Web that match that query and lists those pages in the order of their PageRank. Imagine surfing the

Web, going from page to page by randomly choosing an outgoing link from one page to get to the next. This can lead

to dead ends at pages with no outgoing links, or cycles around cliques of interconnected pages. So, a certain fraction

of the time, simply choose a random page from the Web. This theoretical random walk is known as a Markov chain

or Markov process. The limiting probability that an infinitely dedicated random surfer visits any particular page is its

PageRank. A page has high rank if other pages with high rank link to it.

Let W be the set of Web pages that can be reached by following a chain of hyperlinks starting at some root page

and let n be the number of pages in W. For Google, the set W highly fluctuates over time and is not exactly known,

but was estimated to be about 30 billion in 20162

. Let G be the n-by-n connectivity matrix of a portion of the Web,

that is gij = 1 if there is a hyperlink to page i from page j and zero otherwise. The matrix G can be huge, but it is

very sparse. Its j-th column shows the links on the j-th page. The number of nonzeros in G is the total number of

hyperlinks in W. Let ri and cj be the row and column sums of G:

ri =

X

j

gij and cj =

X

i

gij . (1)

The quantities rj and cj are the in-degree and out-degree of the j-th page. Let p be the probability that the random

walk follows a link. A typical value is p = 0.85. Then 1 − p is the probability that some arbitrary page is chosen and

δ = (1 − p)/n is the probability that a particular random page is chosen. Let A be a n-by-n matrix with elements

aij =

p gij/cj + δ cj 6= 0,

1/n cj = 0.

(2)

Notice that A comes from scaling the connectivity matrix by its column sums. The j-th column is the probability of

jumping from the j-th page to the other pages on the Web. If the j-th page is a dead end – i.e., it has no out-links –

then we assign a uniform probability of 1/n to all the elements in its column. Most of the elements of A are equal

to δ, the probability of jumping from one page to another without following a link. If n = 4 · 109

and p = 0.85,

then δ = 3.75 · 10−11. Matrix A is the transition probability matrix of the Markov chain. Its elements are all strictly

between zero and one and its column sums are all equal to one. An important result from matrix theory, known as the

Perron-Frobenius theorem, applies to such matrices: a nonzero solution of the equation

x = Ax (3)

1This project is originally based on a SIAM book chapter from Numerical Computing with Matlab by Clever B. Moler.

2

see ”Estimating search engine index size variability” by van den Bosch et al., Scientometrics,

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4833824/

1

Numerical Computing, Fall Semester 2022

Lecturer: Prof. O. Schenk

Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨

exists and is unique to within a scaling factor. If this scaling factor is chosen so that

X

i

xi = 1, (4)

then x is the state vector of the Markov chain and corresponds to Google’s PageRank. The elements of x are all

positive and less than one. Vector x is the solution to the singular, homogeneous linear system

(I − A)x = 0 (5)

For modest n, an easy way to compute x in Julia is to start with some approximate solution, such as the PageRanks

from the previous month, or

x = ones(n, 1) / n

Then simply repeat the assignment statement

x = A * x

until the difference between successive vectors is within a specified tolerance. This is known as the power method

and is about the only possible approach for very large n. In practice, the matrices G and A are never actually formed.

One step of the power method would be done by one pass over a database of Web pages, updating weighted reference

counts generated by the hyperlinks between pages. The best way to compute PageRank in Julia is to take advantage

of the particular structure of the Markov matrix. Here is an approach that preserves the sparsity of G. The transition

matrix can be written

A = pGD + ez|

, (6)

where D is the diagonal matrix formed from the reciprocals of the out-degrees,

djj =

1/cj cj 6= 0

0 cj = 0.

(7)

e is the n-vector of all ones, and z is the vector with components

zj =

δ cj 6= 0

1/n cj = 0.

(8)

The rank-one matrix ez|

accounts for the random choices of Web pages that do not follow links. The equation

x = Ax (9)

can be written as

(I − pGD)x = γe (10)

where

γ = z

|x. (11)

We do not know the value of γ because it depends on the unknown vector x, but we can temporarily take γ = 1. As

long as p is strictly less than one, the coefficient matrix I − pGD is nonsingular and the equation

(I − pGD)x = e (12)

can be solved for x. Then the resulting x can be rescaled so that

X

i

xi = 1. (13)

Notice that vector z is not involved in this calculation. The following Julia statements implement this approach.

2

Numerical Computing, Fall Semester 2022

Lecturer: Prof. O. Schenk

Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨

c = vec(sum(G, dims=1));

k = findall(!iszero, c);

D = sparse(k, k, map(x -> 1/x, c[k]), n, n);

e = ones(n, 1);

sI = sparse(I, n, n);

x = (sI – p * G * D) \ e;

x = x/sum(x);

The power method [1, 2] can also be implemented in a way that does not actually form the Markov matrix and so

preserves sparsity. Compute

G = p * G * D;

z = ((1- p) * map(!iszero, c) + map(iszero, c)) / n;

z = reshape(z, (1,500));

Start with

x = e / n;

Then repeat the statement

x = G * x + e * (z * x);

until x settles down to several decimal places.

It is also possible to use an algorithm known as inverse iteration [1, 3] where one defines

x = e / n;

A = p * G * D + e * z;

and then the following statement is repeated until convergence given a suitable value α.

x = (alpha * I – A) \ x;

x = x / sum(x);

For details of when and why this method works, please refer to the references cited above. However, we would like

to briefly observe what happens when we take α = 1 in the expression (αI − A). Since the resulting matrix (I − A)

is theoretically singular, with exact computation some diagonal elements of the upper triangular factor of (I − A)

would, in principle, be zero and this computation should fail. But with roundoff error, the computed matrix (I −A) is

probably not exactly singular. Even if it is singular, roundoff during Gaussian elimination will most likely prevent any

exact zero diagonal elements. We know that Gaussian elimination with partial pivoting always produces a solution

with a small residual, relative to the computed solution, even if the matrix is badly conditioned. The vector obtained

with the backslash operation, (I − A)\x, usually has very large components. If it is rescaled by its sum, the residual

is scaled by the same factor and becomes very small. Consequently, the two vectors x and Ax equal each other to

within roundoff error. In this setting, solving the singular system with Gaussian elimination blows up, but it blows up

in exactly the right direction.

Figure 1 is the graph for a tiny example, with n = 6 instead of n = 4 · 109

. The pages on the Web are identified by

strings known as uniform resource locators, or URLs. Most URLs begin with https because they use the hypertext

transfer protocol. In Julia, we can store the URLs as an array of strings in a cell array. The example in Figure 1 can

be written as follows by using a 6-by-1 cell array:

U = [“https://www.alpha.com”

“https://www.beta.com”

“https://www.gamma.com”

“https://www.delta.com”

“https://www.rho.com”

“https://www.sigma.com”]

In order to access the string contained in a cell we can simply write U[k], with k = 1, . . . , 6 in this small web graph.

Thus, U[1] would, e.g., correspond to the string ”https://www.alpha.com”.

We can generate the connectivity matrix by specifying the pairs of indices (i, j) of the nonzero elements. Because

there is a link to beta.com from alpha.com, the (2, 1) element of G is nonzero. The nine connections are described by

3

Numerical Computing, Fall Semester 2022

Lecturer: Prof. O. Schenk

Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨

Figure 1: A tiny web graph.

i = [ 2 6 3 4 4 5 6 1 1]

j = [ 1 1 2 2 3 3 3 4 6]

A sparse matrix is stored in a data structure that requires memory only for the nonzero elements and their indices.

This is hardly necessary for a 6-by-6 matrix with only 27 zero entries, but it becomes crucially important for larger

problems. The statements

n = 6;

G = sparse(vec(i), vec(j), vec(ones(Bool, 1, 9)));

Matrix(G)

generate the sparse representation of an n-by-n matrix with ones in the positions specified by the vectors i and j and

display its full representation.

0 0 0 1 0 1

1 0 0 0 0 0

0 1 0 0 0 0

0 1 1 0 0 0

0 0 1 0 0 0

1 0 1 0 0 0

The statement

c = sum(G, dims=1)

computes the column sums

2 2 3 1 0 1

Notice that c(5) = 0 because the 5th page, labeled rho, has no out-links. The statements

x = (sI – p * G * D) \ e;

x = x / sum(x);

solve the sparse linear system to produce

julia> x

0.3210

0.1705

0.1066

0.1368

0.0643

0.2007

4

Numerical Computing, Fall Semester 2022

Lecturer: Prof. O. Schenk

Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨

Figure 2: Page Rank for the tiny web graph

The bar graph of x is shown in Figure 2. If the URLs are sorted in PageRank order and listed along with their in- and

out-degrees, the result is

index page-rank in out url

1 0.321017 2 2 https://www.alpha.com

6 0.200744 2 1 https://www.sigma.com

2 0.170543 1 2 https://www.beta.com

4 0.136793 2 1 https://www.delta.com

3 0.106592 1 3 https://www.gamma.com

5 0.0643118 1 0 https://www.rho.com

We see that alpha has a higher PageRank than delta or sigma, even though they all have the same number of

in-links. A random surfer will visit alpha over 32% of the time and rho only about 6% of the time.

For this tiny example with p = .85, the smallest element of the Markov transition matrix is δ = .15/6 = .0250.

A =

0.0250 0.0250 0.0250 0.8750 0.1667 0.8750

0.4500 0.0250 0.0250 0.0250 0.1667 0.0250

0.0250 0.4500 0.0250 0.0250 0.1667 0.0250

0.0250 0.4500 0.3083 0.0250 0.1667 0.0250

0.0250 0.0250 0.3083 0.0250 0.1667 0.0250

0.4500 0.0250 0.3083 0.0250 0.1667 0.0250

We can notice that all the columns of matrix A sum to one, thus agreeing with the definition of left stochastic matrix.

This mini-project includes the Julia file surfer.jl. A statement like

U, G = surfer(’https://www.xxx.zzz’,n)

starts at a specified URL and tries to surf the Web until it has visited n pages. If successful, it returns an n-by-1 cell

array of URLs and an n-by-n sparse connectivity matrix. Surfing the Web automatically is a dangerous undertaking

and this function must be used with care. Some URLs contain typographical errors and illegal characters. There is a

list of URLs to avoid that includes .gif files and Web sites known to cause difficulties. Most importantly, surfer can

get completely bogged down trying to read a page from a site that appears to be responding, but that never delivers the

complete page. When this happens, it may be necessary to have the computer’s operating system ruthlessly terminate

Julia. With these precautions in mind, you can use surfer to generate your own PageRank examples. The statement

U, G = surfer(’https://www.inf.ethz.ch’, 500);

5

Numerical Computing, Fall Semester 2022

Lecturer: Prof. O. Schenk

Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨

accesses the home page of the Department of Computer Science at the Swiss Federal Institute of Technology in Zurich

(ETH) and generates a 500-by-500 test case. The file ETH500.mat included in the code folder was generated by

using this function in September 2014. In the following, it is indicated as the ETH500 data set. The statement

spy(G)

produces a spy plot (Figure 3) that shows the nonzero structure of the connectivity matrix. The statement

pagerank(U, G)

computes the page ranks, produces a bar graph (Figure 4) of the ranks, and prints the most highly ranked URLs in

PageRank order. The highly ranked pages are

julia> pagerank(U,G)

index page-rank in out url

57 0.145755 292 1 http://www.zope.org

466 0.049996 19 1 http://purl.org/dc/elements/1.1

39 0.028766 311 0 http://www.ethz.ch

58 0.021863 291 0 http://www.infrae.com

53 0.021202 288 0 http://www.cd.ethz.ch/services/web

101 0.016601 234 6 http://www.hk.ethz.ch/index_EN

72 0.016473 235 0 http://www.ethz.ch/index_EN

486 0.007374 14 2 http://www.handelszeitung.ch

463 0.006896 17 0 http://ns.adobe.com/xap/1.0

Figure 3: Spy plot of the Department of Computer Science (ETH Zurich) web graph.

Solve the following problems:

Preliminary: Read “A First Course on Numerical Methods”

Read Chapter 8 [1] carefully, in order to gain a better understanding of the topic summarized above.

1. Theory [20 points]

(a) What are an eigenvector, an eigenvalue and an eigenbasis?

(b) What assumptions should be made to guarantee convergence of the power method?

6

Numerical Computing, Fall Semester 2022

Lecturer: Prof. O. Schenk

Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨

Figure 4: PageRank of the Department of Computer Science (ETH Zurich) web graph.

(c) What is the shift and invert approach?

(d) What is the difference in cost of a single iteration of the power method, compared to the inverse iteration?

(e) What is a Rayleigh quotient and how can it be used for eigenvalue computations?

2. Other webgraphs [10 points]

Use surfer.jl and pagerank.jl to compute PageRanks for some subset of the Web of your choice. Do you

see any interesting structure in the results (e.g., cliques, see next question)? Report on three PageRanks for different

webgraphs by showing the connectivity matrix, the PageRanks, and the ten most important entries in the graph.

3. Connectivity matrix and subcliques [5 points]

The connectivity matrix for the ETH500 data set (ETH500.mat) has various small, almost entirely nonzero, submatrices that produce dense patches near the diagonal of the spy plot. You can use the zoom button to find their indices.

The first submatrix has, e.g., indices around 80. Mathematically, a graph where all nodes are connected to each other

is known as a clique. Identify the organizations within the ETH community that are responsible for these near cliques.

4. Connectivity matrix and disjoint subgraphs [10 points]

Figure 5 reports the graph of a tiny six-node subset of the Web. Unlike the example reported in Figure 1, in this case

there are two disjoint subgraphs.

1. What is the connectivity matrix G? Which are its entries?

2. What are the PageRanks if the hyperlink transition probability p assumes the default value of 0.85?

3. Describe what happens with this example to both the definition of PageRank and the computation done by

pagerank in the limit p → 1.

5. PageRanks by solving a sparse linear system [40 points]

The function pagerank(U,G) computes PageRanks by solving a sparse linear system. It then plots a bar graph and

prints the dominant URLs.

7

Numerical Computing, Fall Semester 2022

Lecturer: Prof. O. Schenk

Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨

Figure 5: Another tiny Web.

1. Create pagerank1.jl by modifying pagerank.jl to use the power method instead of solving the sparse

linear system. The key statements are

G = p * G * D;

z = ((1- p) * map(!iszero, c) + map(iszero, c)) / n;

z = reshape(z, (1,500));

while termination_test

x = G * x + e * (z * x);

end

What is an appropriate test for terminating the power iteration?

2. Create pagerank2.jl by modifying pagerank.jl to use the inverse iteration. The key statements are

while termination_test

x = (alpha * I – A) \ x;

x = x / sum(x);

end

Use your functions pagerank1.jl and pagerank2.jl (set α = 0.99) to compute the PageRanks of the

six-node example presented in Figure 1. Make sure you get the same result from each of your three functions.

3. We now want to analyse the impact of α on the inverse iteration. Using the ETH500 example, set α equal

to 0.8, 0.9, 0.95 and 1. Comment on the different number of iterations the four cases take until convergence.

Analyse your results and explain what you observe.

Hint: Check your solution x for all 4 cases. Are they always the same?

4. Use your functions pagerank1.jl and pagerank2.jl (set α = 0.99) to compute the PageRanks of the

three selected graphs from exercise 2. Report on the convergence of the two methods for these subgraphs and

summarize the advantages and disadvantages of the power method implemented in pagerank1.jl against

the inverse iteration in pagerank2.jl.

8

Numerical Computing, Fall Semester 2022

Lecturer: Prof. O. Schenk

Assistants: E. Vecchi, L. Gaedke-Merzhauser ¨

Quality of the code and of the report [15 Points]

The highest possible score of each project is 85 points and up to 15 additional points can be awarded based on the

quality of your report and code (maximum possible grade: 100 points). Your report should be a coherent document,

structured according to the template provided on iCorsi. If there are theoretical questions, provide a complete and

detailed answer. All figures must have a caption and must be of sufficient quality (include them either as .eps or .pdf).

If you made a particular choice in your implementation that might be out of the ordinary, clearly explain it in the

report. The code you submit must be self-contained and executable, and must include the set-up for all the results that

you obtained and listed in your report. It has to be readable and, if particularly complicated, well-commented.

Additional notes and submission details

Summarize your results and experiments for all exercises by writing an extended LaTeX report, by using the template provided on iCorsi (https://www.icorsi.ch/course/view.php?id=14666). Submit your gzipped

archive file (tar, zip, etc.) – named project number lastname firstname.zip/tgz – on iCorsi (see [NC

2022] Project 1 – Submission PageRank Algorithm) before the deadline. Submission by email or through any other

channel will not be considered. Late submissions will not be graded and will result in a score of 0 points for that

project. You are allowed to discuss all questions with anyone you like, but: (i) your submission must list anyone you

discussed problems with and (ii) you must write up your submission independently. Please remember that plagiarism

will result in a harsh penalization for all involved parties.

In-class assistance

If you experience difficulties in solving the problems above, please join us in class either on Tuesdays 16:30-18:15 in

room C1.03 or on Wednesdays 14:30-16:15 in room D1.15.

References

[1] The power method and variants, Chapter 8: Eigenvalues and Singular Values, SIAM Book “A First Course on

Numerical Methods”, C. Greif, U. Ascher, pp. 219-229.

[2] Power iteration, http://en.wikipedia.org/wiki/Power_iteration

[3] Inverse iteration, http://en.wikipedia.org/wiki/Inverse_iteration

9

WhatsApp us