PageRank
how to represent page as graph:
- Node: web pages
- Edge: hyperlinks
Not only page network, here are citation natwork and reference network:
The problem is how could we measure the important of a web page so that we can rank them.
the idea is: links and votes
links and votes
the intuition is that Page is more important if it has more links
consider 2 web page:
- www.stanford.edu has 23,400 in-links
- thispersondoesnotexist.com has 1 in-link
why we don’t consider out-link?
Because in your website, you can set a lot of out-links, and it is easy to implement. So the inportance only take in-link into consideration.
so, what is vote?
we use in-links to caculate the importance of node j, j have 2 in-links, so it’s importance is above.
And consider other nodes, such as node i, has 3 out-links, so its importance to others is divided to $\frac{r_i}{3}$
for a 3 node graph:
we can get the equation for each node:
Matrix Formulation
Let us start at some notion.
Stochastic adjacency matrix $M$
Here are the rules:
- $d_i$ is the out degree of node $i$
- if $i\to j$, then $M_{ji}=\frac{1}{d_i}$, which means the importance of $i$ is divided and given to $j$
Rank vector $r$
$r_i$ is the importance of page $i$, and $\sum_ir_i=1$.
and the flow equation can be written as:
$$
r=M.r
$$
Connection to Random walk
we have considered the importance of node. now lets start at one page i, and which page should we select next?
- at timepoint t, the surfer is at page i
- at timepoint t+1, surfer choose an put-link from i randomly.
- the surfer at page j
- repeat
so, here are some notions:
- $p(t)$: the coordinate of the probability of node i at the time t. it includes the next pages probability.
- $p(t)$ is a probability distribution over pages
this process can be performed as this fomula:
$$
p(t+1)=M.p(t)=p(t)
$$
$$
r=M.r
$$
$r$ is the stationary distribution for random walk
in ch2, the centrality:
$$
\lambda c=Ac
$$
c is eigenvector, $\lambda$ is eigenvalue
so eigenvector is r, eigenvector is 1
$$
1.r=M.r
$$
Solve PageRank
notion and procedure
we have the equation with eigenvalue 1 and eigenvector $r$, and its equal form is:
now we use iterative procedure to solve it:
the whole procedure is as follow:
Here are some example:
problem
this procedure has 2 problem:
dead ends
which means the page have no out-links
it will lead to important ‘fall cliff’
Spider traps
absorbe all importance
spider trap
Here is solution.
at each step:
- with the prob. $\beta$, use random link
- with the prob. $1-\beta$, jump to a teleport
- Usually us $\beta=0.8\to0.9$
so, use teleport, the surfer will stay at node m, after few step, surfer will jump to the teleport, which could be any node in the graph
dead end
the solution is same with spider trap, just use teleport
see node m, it has no out-links, we use teleport, and each prob. is $\frac{1}{N}$ N is all graph node, include m itself
Google’s solution
e.g.
because of teleport, the gree nodes, which have no in degree, also have their importance. and we can adjust $\beta$ to change its value, according to the reality. if it is important than we set, then we set $\beta$ lower.
personalized and restart pagerank
let’s see the difference:
pagerank
jump to any node
Personalized pagerank
Jump to a set of teleport nodes $S$
random walks with restart
teleport to the starting node : $S={Q}$
the form in simulation: