Implementing Google’s Pagerank Algorithm

Written by masseybr | Published 2018/11/20
Tech Story Tags: seo | machine-learning | computer-science | data-science | pagerank-algorithm

TLDRvia the TL;DR App

As a student of programming and, to an extent, computer science, I frequently find new and exciting historical achievements to fascinate me, at least within the realm of computer science. This time, it was Machine Learning, but more than that, it was Google’s Pagerank algorithm. What’s more, it started with an interest in how to increase my internet visibility, then it turned to how does Google decide what gets shown and what stays hidden, finally it turned to “how do I implement this?”. These questions that I continuously ask myself lead to greater and more frequent discoveries.

Well, how does Google decide what gets shown and what stays hidden? In the simplest of terms, it is based on how many times your webpage links to a different webpage and what is the quality of information on the second webpage. In essence, the more your webpage links to other web pages, the more likely Google will promote your webpage.

How does Google determine this? When sifting through web pages, the chance of you clicking a link to go to another web page must add to 100%. Granted, in more advanced implementations of this algorithm, you must account for the chance someone will type in a web page to the URL. For now, at least, we will ignore this possibility for simplicity’s sake. In order to determine what website will send you where and with what frequency, the algorithm forms a matrix using the mathematics of Linear Algebra. The matrix is built by taking a count of each time your webpage links to another webpage. For example, take a look at the illustration below.

In the illustration,

A links to B, C, and D.

B links to A and C.

C links to A, D, and F.

D links to C

E links to D and B

F links to D, G, and C

G links to G.

This provides a matrix that looks like the one pictured below.

Image from https://matrix.reshish.com/

To implement the Pagerank algorithm, you take a vector, let's call it ‘r’. Vector r will represent the people currently on the internet looking at a particular topic. Now, repeatedly take the dot product with the matrix until the numbers stabilize. You do this to check the probability that the people on the internet will be going to each of the particular websites allotted. After so many repetitions, the numbers will converge. The resulting vector will have the probability that each of these people will be on any particular site.

For example, with the matrix above and the vector r containing 100 people random selecting one of the websites above, the result will converge at:

r1 = 0

r2 = 0

r3 = 0

r4 = 0

r5 = 0

r6 = 0

r7 = 100

Why does r7 = 100?

The longer these 100 people are on the internet, the more likely they are to get to webpage G. Webpage G only links to itself, so they will be unable to leave it once they arrive. When the convergence occurs, the only possible solution is that everyone will be on webpage G.

What if we choose a different set of web pages?

Image from https://matrix.reshish.com/

Removing the possibility of webpage G gives the result shown above. Everything else remains the same, except for web page F’s link to G and web page G no longer exists.

When we implement the above process to the new matrix, the result is as follows:

r1 = 16

r2 = 16/3

r3 = 40

r4 = 76/3

r5 = 0

r6 = 40/3

16 + 16/3 + 40 + 76/3 + 0 + 40/3 = 100

Therefore, when we reach convergence, the majority of people will be on webpage 3, followed by webpage 4.

Note* Moses asked a great question. The question was, How do Inlinks (links to a webpage as opposed to from a webpage) influence the Pagerank Algorithm.

To quote my response to his question,

“Most certainly, yes, and thank you for bringing it up. Inlinks are a major part of how these websites are ordered.

Take example 1, Webpage G had only Inlinks to itself. Webpage F linked to Webpage G and Webpage G linked to Webpage G. This caused an issue because it meant that after a reasonable amount of time, everyone would end up on Webpage G, thus putting Webpage G at the top of the list when someone searches for content related to web pages A-G.

In example 2, I corrected this error by removing any possibility of Webpage G. Doing so allowed convergence to provide the result we expected, given the parameters.”

I hope that information will satisfy curiosity toward that question in particular.

That is a basic implementation of Google’s Pagerank algorithm. If you have any questions, comments, or concerns, please leave them as a comment down below.

Thank you for reading!


Published by HackerNoon on 2018/11/20