A Survey of Google's PageRank
Search Engine Ranking
Here we give you an insight into one of the largest and most popular search engines used on the web today. In order to get high on the search engines, you will need to adhere closely to the search engine algorithms!
|
| Within the past few years, Google has become the far most
utilized search engine worldwide. A decisive factor therefore was, besides
high performance and ease of use, the superior quality of search results
compared to other search engines. This quality of search results is substantially
based on PageRank, a sophisticated method to rank web documents.
The aim of these pages is to provide a broad survey of all aspects of PageRank. The contents of these pages primarily rest upon papers by Google founders Lawrence Page and Sergey Brin from their time as graduate students at Stanford University. It is often argued that, especially considering the dynamic of the internet, too much time has passed since the scientific work on PageRank, as that it still could be the basis for the ranking methods of the Google search engine. There is no doubt that within the past years most likely many changes, adjustments and modifications regarding the ranking methods of Google have taken place, but PageRank was absolutely crucial for Google's success, so that at least the fundamental concept behind PageRank should still be constitutive.
|
|
The PageRank Concept Since the early stages of the world wide web, search engines have developed different methods to rank web pages. Until today, the occurence of a search phrase within a document is one major factor within ranking techniques of virtually any search engine. The occurence of a search phrase can thereby be weighted by the length of a document (ranking by keyword density) or by its accentuation within a document by HTML tags. For the purpose of better search results and especially to make search engines resistant against automatically generated web pages based upon the analysis of content specific ranking criteria (doorway pages), the concept of link popularity was developed. Following this concept, the number of inbound links for a document measures its general importance. Hence, a web page is generally more important, if many other web pages link to it. The concept of link popularity often avoids good rankings for pages which are only created to deceive search engines and which don't have any significance within the web, but numerous webmasters elude it by creating masses of inbound links for doorway pages from just as insignificant other web pages. Contrary to the concept of link popularity, PageRank is not simply based upon the total number of inbound links. The basic approach of PageRank is that a document is in fact considered the more important the more other documents link to it, but those inbound links do not count equally. First of all, a document ranks high in terms of PageRank, if other high ranking documents link to it. So, within the PageRank concept, the rank of a document is given by the rank of those documents which link to it. Their rank again is given by the rank of documents which link to them. Hence, the PageRank of a document is always determined recursively by the PageRank of other documents. Since - even if marginal and via many links - the rank of any document influences the rank of any other, PageRank is, in the end, based on the linking structure of the whole web. Although this approach seems to be very broad and complex, Page and Brin were able to put it into practice by a relatively trivial algorithm.
|
|
The PageRank Algorithm The original PageRank algorithm was described by Lawrence Page and Sergey Brin in several publications. It is given by PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) where PR(A) is the PageRank of page A, The PageRank of pages Ti which link to page A does not influence the PageRank of page A uniformly. Within the PageRank algorithm, the PageRank of a page T is always weighted by the number of outbound links C(T) on page T. This means that the more outbound links a page T has, the less will page A benefit from a link to it on page T. The weighted PageRank of pages Ti is then added up. The outcome of this is that an additional inbound link for page A will always increase page A's PageRank. Finally, the sum of the weighted PageRanks of all pages Ti is multiplied with a damping factor d which can be set between 0 and 1. Thereby, the extend of PageRank benefit for a page by another page linking to it is reduced.
|
|
The Random Surfer Model In their publications, Lawrence Page and Sergey Brin give a very simple intuitive justification for the PageRank algorithm. They consider PageRank as a model of user behaviour, where a surfer clicks on links at random with no regard towards content. The random surfer visits a web page with a certain probability which derives from the page's PageRank. The probability that the random surfer clicks on one link is solely given by the number of links on that page. This is why one page's PageRank is not completely passed on to a page it links to, but is devided by the number of links on the page. So, the probability for the random surfer reaching one page is the sum of probabilities for the random surfer following links to this page. Now, this probability is reduced by the damping factor d. The justification within the Random Surfer Model, therefore, is that the surfer does not click on an infinite number of links, but gets bored sometimes and jumps to another page at random. The probability for the random surfer not stopping to click on links is given by the damping factor d, which is, depending on the degree of probability therefore, set between 0 and 1. The higher d is, the more likely will the random surfer keep clicking links. Since the surfer jumps to another page at random after he stopped clicking links, the probability therefore is implemented as a constant (1-d) into the algorithm. Regardless of inbound links, the probability for the random surfer jumping to a page is always (1-d), so a page has always a minimum PageRank.
|
|
A Different Notation of the PageRank Algorithm Lawrence Page and Sergey Brin have published two different versions of their PageRank algorithm in different papers. In the second version of the algorithm, the PageRank of page A is given as PR(A) = (1-d) / N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) where N is the total number of all pages on the web. The second version of the algorithm, indeed, does not differ fundamentally from the first one. Regarding the Random Surfer Model, the second version's PageRank of a page is the actual probability for a surfer reaching that page after clicking on many links. The PageRanks then form a probability distribution over web pages, so the sum of all pages' PageRanks will be one. Contrary, in the first version of the algorithm the probability for the random surfer reaching a page is weighted by the total number of web pages. So, in this version PageRank is an expected value for the random surfer visiting a page, when he restarts this procedure as often as the web has pages. If the web had 100 pages and a page had a PageRank value of 2, the random surfer would reach that page in an average twice if he restarts 100 times. As mentioned above, the two versions of the algorithm do not differ fundamentally from each other. A PageRank which has been calculated by using the second version of the algorithm has to be multiplied by the total number of web pages to get the according PageRank that would have been caculated by using the first version. Even Page and Brin mixed up the two algorithm versions in their most popular paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine", where they claim the first version of the algorithm to form a probability distribution over web pages with the sum of all pages' PageRanks being one. In the following, we will use the first version of the algorithm. The reason is that PageRank calculations by means of this algorithm are easier to compute, because we can disregard the total number of web pages.
|
|
The Characteristics of PageRank The characteristics of PageRank shall be illustrated by a small example. We regard a small web consisting of three pages A, B and C, whereby page A links to the pages B and C, page B links to page C and page C links to page A. According to Page and Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to 0.5. The exact value of the damping factor d admittedly has effects on PageRank, but it does not influence the fundamental principles of PageRank. So, we get the following equations for the PageRank calculation: PR(A) = 0.5 + 0.5 PR(C) These equations can easily be solved. We get the following PageRank values for the single pages: PR(A) = 14/13 = 1.07692308 It is obvious that the sum of all pages' PageRanks is 3 and thus equals the total number of web pages. As shown above this is not a specific result for our simple example. For our simple three-page example it is easy to solve the according equation system to determine PageRank values. In practice, the web consists of billions of documents and it is not possible to find a solution by inspection
|
|
The Iterative Computation of PageRank Because of the size of the actual web, the Google search engine uses an approximative, iterative computation of PageRank values. This means that each page is assigned an initial starting value and the PageRanks of all pages are then calculated in several computation circles based on the equations determined by the PageRank algorithm. The iterative calculation shall again be illustrated by our three-page example, whereby each page is assigned a starting PageRank value of 1.
Also, by means of the iterative calculation, the sum of all pages' PageRanks still converges to the total number of web pages. So the average PageRank of a web page is 1. The minimum PageRank of a page is given by (1-d). Therefore, there is a maximum PageRank for a page which is given by dN+(1-d), where N is total number of web pages. This maximum can theoretically occur, if all web pages solely link to one page, and this page also solely links to itself
|
|
The Implementation of PageRank in the Google Search Engine Regarding the implementation of PageRank, first of all, it is important
how PageRank is integrated into the general ranking of web pages by the
Google search engine. The proceedings have been described by Lawrencec
Page and Sergey Brin in several publications. Initially, the ranking of
web pages by the Google search engine was determined by three factors:
In order to provide search results, Google computes an IR score out of page specific factors and the anchor text of inbound links of a page, which is weighted by position and accentuation of the search term within the document. This way the relevance of a document for a query is determined. The IR-score is then combined with PageRank as an indicator for the general importance of the page. To combine the IR score with PageRank the two values are multiplicated. It is obvious that they cannot be added, since otherwise pages with a very high PageRank would rank high in search results even if the page is not related to the search query. Especially for queries consisting of two or more search terms, there is a far bigger influence of the content related ranking criteria, whereas the impact of PageRank is mainly visible for unspecific single word queries. If webmasters target search phrases of two or more words it is possible for them to achieve better rankings than pages with high PageRank by means of classical search engine optimisation. If pages are optimised for highly competitive search terms, it is essential for good rankings to have a high PageRank, even if a page is well optimised in terms of classical search engine optimisation. The reason therefore is that the increase of IR score deminishes the more often the keyword occurs within the document or the anchor texts of inbound links to avoid spam by extensive keyword repetition. Thereby, the potentialities of classical search engine optimisation are limited and PageRank becomes the decisive factor in highly competitive areas.
|
|
The PageRank Display of the Google Toolbar PageRank became widely known by the PageRank display of the Google Toolbar. The Google Toolbar is a browser plug-in for Microsoft Internet Explorer which can be downloaded from the Google web site. The Google Toolbar provides some features for searching Google more comfortably. The Google Toolbar displays PageRank on a scale from 0 to 10. First of all, the PageRank of an actually visited page can be estimated by the width of the green bar within the display. If the user holds his mouse over the display, the Toolbar also shows the PageRank value. Caution: The PageRank display is one of the advanced features of the Google Toolbar. And if those advanced features are enabled, Google collects usage data. Additionally, the Toolbar is self-updating and the user is not informed about updates. So, Google has access to the user's hard drive. If we take into account that PageRank can theoretically have a maximum value of up to dN+(1-d), where N is the total number of web pages and d is usually set to 0.85, PageRank has to be scaled for the display on the Google Toolbar. It is generally assumed that the scalation is not linearly but logarithmically. At a damping factor of 0.85 and, therefore, a minimum PageRank of 0.15 and at an assumed logaritmical basis of 6 we get a scalation as follows:
|
|
The Toolbar's PageRank Files Even webmasters who do not want to use the Google Toolbar or the Internet Explorer permanently for security and privacy concerns have the possibility to check the PageRank values of their pages. Google submits PageRank values in simple text files to the Toolbar. In former times, this happened via XML. The switch to text files occured in August 2002. The PageRank files can be requested directly from the domain www.google.com. Basically, the URLs for those files look like follows (without line breaks): http://www.google.com/search? There is only one line of text in the PageRank files. The last cipher in this line is PageRank. The parameters incorporated in the above shown URL are inevitable for the display of the PageRank files in a browser. The value "navclient-auto" for the parameter "client" identifies the Toolbar. Via the parameter "q" the URL is submitted. The value "Rank" for the parameter "features" determines that the PageRank files are requested. If it is omitted, Google's servers still transmit XML files. The parameter "ch" transfers a checksum for the URL to Google, whereby this checksum can only change when the Toolbar version is updated by Google. Thus, it is necessary to install the Toolbar at least once to find out about the checksum of one's URLs. To track the communication between the Toolbar and Google, often the use of packet sniffers, local proxies an similar tools is suggested. But this is not necessarily needed, since the PageRank files are cached by the Internet Explorer. So, the checksums can simply been found out by having a look at the folder Temporary Internet Files. Knowing the checksums of your URLs, you can view the PageRank files in your browser and you do not have to accept Google's 36 years lasting cookies. Since the PageRank files are kept in the browser cache and, thus, are clearly visible, and as long as requests are not automated, watching the PageRank files in a browser should not be a violation of Google's Terms of Service. However, you should be cautious. The Toolbar submits its own User-Agent to Google. It is: Mozilla/4.0 (compatible; GoogleToolbar 1.1.60-deleon; OS SE 4.10) 1.1.60-deleon is a Toolbar version which may of course change. OS is the operating system that you have installed. So, Google is able to identify requests by browsers, if they do not go out via a proxy and if the User-Agent is not modified accordingly. Taking a look at IE's cache, one will normally notice that the PageRank files are not requested from the domain www.google.com but from IP addresses like 216.239.33.102. Additionally, the PageRank files' URLs often contain a parameter "failedip" that is set to values like "216.239.35.102;1111" (Its function is not absolutely clear). The IP addresses are each related to one of Google's seven data centers and the reason for the Toolbar querying IP-addresses is most likely to control the PageRank display in a better way, especially in times of the "Google Dance".
|
|
The PageRank Display at the Google Directory Webmasters who do not want to check the PageRank files that are used by the toolbar have another possibility to receive information about the PageRank of their sites by means of the Google Directory (directory.google.com). The Google Directory is a dump of the Open Directory Project (dmoz.org), which shows the PageRank for listed documents similarly to the Google Toolbar display scaled and by means of a green bar. In contrast to the Toolbar, the scale is from 1 to 7. The exact value is not displayed, but it can be determined by the divided bar respectively the width of the single graphics in the source code of the page if one is not sure by looking at the bar. By comparing the Toolbar PageRank of a document with its Directory PageRank, a more exact estimation of a pages PageRank can be deduced, if the page is listed with the ODP. This connection was mentioned first by Chris Raimondi (www.searchnerd.com/pagerank). Especially for pages with a Toolbar PageRank of 5 or 6, one can appraise if the page is on the upper or the lower end of its Toolbar scale. It shall be noted that for the comparison the Toolbar PageRank of 0 was not taken into account. It can easily be verified that this is appropriate by looking at pages with a Toolbar PageRank of 3. However, it has to be considered that for a verification pages of the Google Directory respectively the ODP with a Toolbar PageRank of 4 or lower have to be chosen, since otherwise no pages linked from there with a Toolbar PageRank of 3 will be found. |
|
The Effect of Inbound Links It has already been shown that each additional inbound link for a web page always increases that page's PageRank. Taking a look at the PageRank algorithm, which is given by PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) one may assume that an additional inbound link from page X increases the PageRank of page A by d × PR(X) / C(X) where PR(X) is the PageRank of page X and C(X) is the total number of its outbound links. But page A usually links to other pages itself. Thus, these pages get a PageRank benefit also. If these pages link back to page A, page A will have an even higher PageRank benefit from its additional inbound link. The single effects of additional inbound links shall be illustrated by an example. We regard a website consisting of four pages A, B, C and D which are linked to each other in circle. Without external inbound links to one of these pages, each of them obviously has a PageRank of 1. We now add a page X to our example, for which we presume a constant Pagerank PR(X) of 10. Further, page X links to page A by its only outbound link. Setting the damping factor d to 0.5, we get the following equations for the PageRank values of the single pages of our site: PR(A) = 0.5 + 0.5 (PR(X) + PR(D)) = 5.5 + 0.5 PR(D) Since the total number of outbound links for each page is one, the outbound links do not need to be considered in the equations. Solving them gives us the following PageRank values: PR(A) = 19/3 = 6.33 We see that the initial effect of the additional inbound link of page A, which was given by d × PR(X) / C(X) = 0,5 × 10 / 1 = 5 is passed on by the links on our site.
|
|
The Influence of the Damping Factor The degree of PageRank propagation from one page to another by a link is primarily determined by the damping factor d. If we set d to 0.75 we get the following equations for our above example: PR(A) = 0.25 + 0.75 (PR(X) + PR(D)) = 7.75 + 0.75 PR(D) Solving these equations gives us the following PageRank values: PR(A) = 419/35 = 11.97 First of all, we see that there is a significantly higher initial effect of additional inbound link for page A which is given by d × PR(X) / C(X) = 0.75 × 10 / 1 = 7.5 This initial effect is then propagated even stronger by the links on our site. In this way, the PageRank of page A is almost twice as high at a damping factor of 0.75 than it is at a damping factor of 0.5. At a damping factor of 0.5 the PageRank of page A is almost four times superior to the PageRank of page D, while at a damping factor of 0.75 it is only a little more than twice as high. So, the higher the damping factor, the larger is the effect of an additional inbound link for the PageRank of the page that receives the link and the more evenly distributes PageRank over the other pages of a site.
|
|
The Actual Effect of Additional Inbound Links At a damping factor of 0.5, the accumulated PageRank of all pages of our site is given by PR(A) + PR(B) + PR(C) + PR(D) = 14 Hence, by a page with a PageRank of 10 linking to one page of our example site by its only outbound link, the accumulated PageRank of all pages of the site is increased by 10. (Before adding the link, each page has had a PageRank of 1.) At a damping factor of 0.75 the accumulated PageRank of all pages of the site is given by PR(A) + PR(B) + PR(C) + PR(D) = 34 This time the accumulated PageRank increases by 30. The accumulated PageRank of all pages of a site always increases by (d / (1-d)) × (PR(X) / C(X)) where X is a page additionally linking to one page of the site, PR(X) is its PageRank and C(X) its number of outbound links. The formula presented above is only valid, if the additional link points to a page within a closed system of pages, as, for instance, a website without outbound links to other sites. As far as the website has links pointing to external pages, the surplus for the site itself diminishes accordingly, because a part of the additional PageRank is propagated to external pages. The justification of the above formula is given by Raph Levien and it is based on the Random Surfer Model. The walk length of the random surfer is an exponential distribution with a mean of (d/(1-d)). When the random surfer follows a link to a closed system of web pages, he visits on average (d/(1-d)) pages within that closed system. So, this much more PageRank of the linking page - weighted by the number of its outbound links - is distributed to the closed system. For the actual PageRank calculations at Google, Lawrence Page und Sergey Brin claim to usually set the damping factor d to 0.85. Thereby, the boost for a closed system of web pages by an additional link from page X is given by (0.85 / 0.15) × (PR(X) / C(X)) = 5.67 × (PR(X) / C(X)) So, inbound links have a far larger effect than one may assume.
|
|
The PageRank-1 Rule Users of the Google Toolbar often notice that pages with a certain Toolbar PageRank have an inbound link from a page with a Toolbar PageRank which is higher by one. Some take this observation to doubt the validity of the PageRank algorithm presented here for the actual ranking methods of the Google search engine. It shall be shown, however, that the PageRank-1 rule complies with the PageRank algorithm. Basically, the PageRank-1 rule proves the fundamental principle of PageRank. Web pages are important themselves if other important web pages link to them. It is not necessary for a page to have many inbound links to rank well. A single link from a high ranking page is sufficient. To show the actual consistance of the PageRank-1 rule with the PageRank algorithm several factors have to be taken into consideration. First of all, the toolbar PageRank is a logarithmically scaled version of real PageRank values. If the PageRank value of one page is one higher than the PageRank value of another page in terms of Toolbar PageRank, than its real PageRank can at least be higher by an amount which equals the logarithmical basis for the scalation of Toolbar PageRank. If the logarithmical basis for the scalation is 6 and the toolbar PageRank of a linking Page is 5, then the real PageRank of the page which receives the link can be at least 6 times smaller to make that page still get a toolbar PageRank of 4. However, the number of outbound links on the linking page thwarts the effect of the logarithmical basis, because the PageRank propagation from one page to another is devided by the number of outbound links on the linking page. But it has already been shown that the PageRank benefit by a link is higher than PageRank algorithm's term d(PR(Ti)/C(Ti)) pretends. The reason is that the PageRank benefit for one page is further distributed to other pages within the site. If those pages link back as it usualy happens, the PageRank benefit for the page which initially received the link is accordingly higher. If we assume that at a high damping factor the logarithmical basis for PageRank scalation is 6 and a page receives a PageRank benefit which is twice as high as the PageRank of the linking page devided by the number of its outbound links, the linking page could have at least 12 outbound links so that the Toolbar PageRank of the page receiving the link is still at most one lower than the toolbar PageRank of the linking page. A number of 12 outbound links admittedly seems relatively small. But normally, if a page has an external inbound link, this is not the only one for that page. Most likely other pages link to that page and propagate PageRank to it. And if there are examples where a page receives a single link from another page and the PageRanks of both pages comply the PageRank-1 rule although the linking page has many outbound links, this is first of all an indication for the linking page's toolbar PageRank being at the upper end of its scale. The linking page could be a "high" 5 and the page receiving the link could be a "low" 4. In this way, the linking page could have up to 72 outbound links. This number rises accordingly if we assume a higher logarithmical basis for the scalation of Toolbar PageRank.
|
|
The Effect of Outbound Links Since PageRank is based on the linking structure of the whole web, it is inescapable that if the inbound links of a page influence its PageRank, its outbound links do also have some impact. To illustrate the effects of outbound links, we take a look at a simple example. We regard a web consisting of to websites, each having two web pages. One site consists of pages A and B, the other constists of pages C and D. Initially, both pages of each site solely link to each other. It is obvious that each page then has a PageRank of one. Now we add a link which points from page A to page C. At a damping factor of 0.75, we therefore get the following equations for the single pages' PageRank values: PR(A) = 0.25 + 0.75 PR(B) Solving the equations gives us the following PageRank values for the first site: PR(A) = 14/23 We therefore get an accumulated PageRank of 25/23 for the first site. The PageRank values of the second site are given by PR(C) = 35/23 So, the accumulated PageRank of the second site is 67/23. The total PageRank
for both sites is 92/23 = 4. Hence, adding a link has no effect on the
total PageRank of the web. Additionally, the PageRank benefit for one
site equals the PageRank loss of the other.
|
|
The Actual Effect of Outbound Links As it has already been shown, the PageRank benefit for a closed system of web pages by an additional inbound link is given by (d / (1-d)) × (PR(X) / C(X)), where X is the linking page, PR(X) is its PageRank and C(X) is the number of its outbound links. Hence, this value also represents the PageRank loss of a formerly closed system of web pages, when a page X within this system of pages now points by a link to an external page. The validity of the above formula requires that the page which receives the link from the formerly closed system of pages does not link back to that system, since it otherwise gains back some of the lost PageRank. Of course, this effect may also occur when not the page that receives the link from the formerly closed system of pages links back directly, but another page which has an inbound link from that page. Indeed, this effect may be disregarded because of the damping factor, if there are enough other web pages in-between the link-recursion. The validity of the formula also requires that the linking site has no other external outbound links. If it has other external outbound links, the loss of PageRank of the regarded site diminishes and the pages already receiving a link from that page lose PageRank accordingly. Even if the actual PageRank values for the pages of an existing web site were known, it would not be possible to calculate to which extend an added outbound link diminishes the PageRank loss of the site, since the above presented formula regards the status after adding the link.
|
|
Intuitive Justification of the Effect of Outbound Links The intuitive justification for the loss of PageRank by an additional external outbound link according to the Random Surfer Modell is that by adding an external outbound link to one page the surfer will less likely follow an internal link on that page. So, the probability for the surfer reaching other pages within a site diminishes. If those other pages of the site have links back to the page to which the external outbound link has been added, also this page's PageRank will deplete. We can conclude that external outbound links diminish the totalized PageRank of a site and probably also the PageRank of each single page of a site. But, since links between web sites are the fundament of PageRank and indespensable for its functioning, there is the possibility that outbound links have positive effects within other parts of Google's ranking criteria. Lastly, relevant outbound links do constitute the quality of a web page and a webmaster who points to other pages integrates their content in some way into his own site
|
|
Dangling Links An important aspect of outbound links is the lack of them on web pages. When a web page has no outbound links, its PageRank cannot be distributed to other pages. Lawrence Page and Sergey Brin characterise links to those pages as dangling links. The effect of dangling links shall be illustrated by a small example website. We take a look at a site consisting of three pages A, B and C. In our example, the pages A and B link to each other. Additionally, page A links to page C. Page C itself has no outbound links to other pages. At a damping factor of 0.75, we get the following equations for the single pages' PageRank values: PR(A) = 0.25 + 0.75 PR(B) Solving the equations gives us the following PageRank values: PR(A) = 14/23 So, the accumulated PageRank of all three pages is 36/23 which is just over half the value that we could have expected if page A had links to one of the other pages. According to Page and Brin, the number of dangling links in Google's index is fairly high. A reason therefore is that many linked pages are not indexed by Google, for example because indexing is disallowed by a robots.txt file. Additionally, Google meanwhile indexes several file types and not HTML only. PDF or Word files do not really have outbound links and, hence, dangling links could have major impacts on PageRank. In order to prevent PageRank from the negative effects of dangling links, pages wihout outbound links have to be removed from the database until the PageRank values are computed. According to Page and Brin, the number of outbound links on pages with dangling links is thereby normalised. As shown in our illustration, removing one page can cause new dangling links and, hence, removing pages has to be an iterative process. After the PageRank calculation is finished, PageRank can be assigned to the formerly removed pages based on the PageRank algorithm. Therefore, as many iterations are needed as for removing the pages. Regarding our illustration, page C could be processed before page B. At that point, page B has no PageRank yet and, so, page C will not receive any either. Then, page B receives PageRank from page A and during the second iteration, also page C gets its PageRank. Regarding our example website for dangling links, removing page C from the database results in page A and B each having a PageRank of 1. After the calculations, page C is assigned a PageRank of 0.25 + 0.375 PR(A) = 0.625. So, the accumulated PageRank does not equal the number of pages, but at least all pages which have outbound links are not harmed from the danging links problem. By removing dangling links from the database, they do not have any negative effects on the PageRank of the rest of the web. Since PDF files are dangling links, links to PDF files do not diminish the PageRank of the linking page or site. So, PDF files can be a good means of search engine optimisation for Google.
|
|
The Effect of the Number of Pages Since the accumulated PageRank of all pages of the web equals the total number of web pages, it follows directly that an additional web page increases the added up PageRank for all pages of the web by one. But far more interesting than the effect on the added up PageRank of the web is the impact of additional pages on the PageRank of actual websites. To illustrate the effects of addional web pages, we take a look at a hierachically structured web site consisting of three pages A, B and C, which are joined by an additional page D on the hierarchically lower level of the site. The site has no outbound links. A link from page X which has no other outbound links and a PageRank of 10 points to page A. At a damping factor d of 0.75, the equations for the single pages' PageRank values before adding page D are given by PR(A) = 0.25 + 0.75 (10 + PR(B) + PR(C)) Solving the equations gives us the follwing PageRank values: PR(A) = 260/14 After adding page D, the equations for the pages' PageRank values are given by PR(A) = 0.25 + 0.75 (10 + PR(B) + PR(C) + PR(D)) Solving these equations gives us the follwing PageRank values: PR(A) = 266/14 As to be expected since our example site has no outbound links, after adding page D, the accumulated PageRank of all pages increases by one from 33 to 34. Further, the PageRank of page A rises marginally. In contrast, the PageRank of pages B and C depletes substantially.
|
|
The Reduction of PageRank by Additional Pages By adding pages to a hierarchically structured websites, the consequences for the already existing pages are nonuniform. The consequences for websites with a different structure shall be shown by another example. We take a look at a website constisting of three pages A, B and C which are linked to each other in circle. The pages are then joined by page D which fits into the circular linking structure. The regarded site has no outbound links. Again, a link from page X which has no other outbound links and a PageRank of 10 points to page A. At a damping factor d of 0.75, the equations for the single pages' PageRank values before adding page D are given by PR(A) = 0.25 + 0.75 (10 + PR(C)) Solving the equations gives us the follwing PageRank values: PR(A) = 517/37 = 13.97 After adding page D, the equations for the pages' PageRank values are given by PR(A) = 0.25 + 0.75 (10 + PR(D)) Solving these equations gives us the follwing PageRank values: PR(A) = 419/35 = 11.97 Again, after adding page D, the accumulated PageRank of all pages increases by one from 33 to 34. But now, any of the pages which already existed before page D was added lose PageRank. The more uniform PageRank is distributed by the links within a site, the more likely will this effect occur. Since adding pages to a site often reduces PageRank for already existing pages, it becomes obvious that the PageRank algorithm tends to privilege smaller web sites. Indeed, bigger web sites can counterbalance this effect by being more attractive for other webmasters to link to them, simply because they have more content. None the less, it is also possible to increase the PageRank of existing pages by additional pages. Therefore, it has to be considered that as few PageRank as possible is distributed to these additional pages.
|
|
The Distribution of PageRank Regarding Search Engine Optimisation Up to this point, it has been described how the number of pages and the number of inbound and outbound links, respectively, influence PageRank. Here, it will mainly be discussed in how far PageRank can be affected for the purpose of search engine optimisation by a website's internal linking structure. In most cases, websites are hierachically structured to a certain extend, as it is illustrated in our example of a web site consisting of the pages A, B and C. Normally, the root page is withal optimised for the most important search phrase. In our example, the optimised page A has an external inbound link from page X which has no other outbound links and a PageRank of 10. The pages B and C each receive a link from page A and link back to it. If we set the damping factor d to 0.5 the equations for the single pages' PageRank values are given by PR(A) = 0.5 + 0.5 (10 + PR(B) + PR (C)) Solving the equations gives us the follwing PageRank values: PR(A) = 8 It is generally not advisable to solely work on the root page of a site for the purpose of search engine optimisation. Indeed, it is, in most cases, more reasonable to optimise each page of a site for different search phrases. We now assume that the root page of our example website provides satisfactory results for its search phrase, but the other pages of the site do not, and therefore we modify the linking structure of the website. We add links from page B to page C and vice versa to our formerly hierarchically structured example site. Again, page A has an external inbound link from page X which has no other outbound links and a PageRank of 10. At a damping factor d of 0.5, the equations for the single pages' PageRank values are given by PR(A) = 0.5 + 0.5 (10 + PR(B) / 2 + PR(C) / 2) Solving the equations gives us the follwing PageRank values: PR(A) = 7 The result of adding internal links is an increase of the PageRank values of pages B and C, so that they likely will rise in search engine result pages for their targeted keywords. On the other hand, of course, page A will likely rank lower because of its diminished PageRank. Generally spoken, PageRank will distribute for the purpose of search engine optimisation more equally among the pages of a site, the more the hierarchically lower pages are interlinked
|
|
Well Directed PageRank Distribution by Concentration of Outbound Links It has already been demonstrated that external outbound links tend to have negative effects on the PageRank of a website's web pages. Here, it shall be illustrated how this effect can be reduced for the purpose of search engine optimisation by the systematic arrangement of external outbound links. We take a look at another hierarchically structured example site consisting of the pages A, B, C and D. Page A has links to the pages B, C and D. Besides a link back to page A, each of the pages B, C and D has one external outbound link. None of those external pages which receive links from the pages B, C and D link back to our example site. If we assume a damping factor d of 0.5, the equations for the calculation of the single pages' PageRank values are given by PR(A) = 0.5 + 0.5 (PR(B) / 2 + PR(C) / 2 + PR(D) / 2) Solving the equations gives us the follwing PageRank values: PR(A) = 1 Now, we modify our example site in a way that page D has all three external outbound links while pages B and C have no more external outbound links. Besides this, the general conditions of our example stay the same as above. None of the external pages which receive a link from pages D link back to our example site. If we, again, assume a damping factor d of 0.5, the equations for the calculations of the single pages' PageRank values are given by PR(A) = 0.5 + 0.5 (PR(B) + PR(C) + PR(D) / 4) Solving these equations gives us the follwing PageRank values: PR(A) = 17/13 As a result of our modifications, we see that the PageRank values for each single page of our site have increased. Regarding search engine optimisation, it is therefore advisable to concentrate external outbound links on as few pages as possible, as long as it does not lessen a site's usabilty.
|
|
Link Exchanges for the purpose of Search Engine Optimisation For the purpose of search engine optimisation, many webmasters exchange links with others to increase link popularity. As it has already been shown, adding links within closed systems of web pages has no effects on the accumulated PageRank of those pages. So, it is questionable if link exchanges have positive consequences in terms of PageRank at all. To show the effects of link exchanges, we take a look at an an example of two hierarchically structured websites consisting of pages A, B and C and D, E and F, respectively. Within the first site, page A links to pages B and C and those link back to page A. The second site is structured accordingly, so that the PageRank values for its pages do not have to be computed explicitly. At a damping factor d of 0.5, the equations for the single pages' PageRank values are given by PR(A) = 0.5 + 0.5 (PR(B) + PR(C)) Solving the equations gives us the follwing PageRank values for the first site PR(A) = 4/3 and accordingly for the second site PR(D) = 4/3 Now, two pages of our example sites start a link exchange. Page A links to page D and vice versa. If we leave the general conditions of our example the same as above and, again, set the damping factor d to 0.5, the equations for the calculations of the single pages' PageRank values are given by PR(A) = 0.5 + 0.5 (PR(B) + PR(C) + PR(D) / 3) Solving these equations gives us the follwing PageRank values: PR(A) = 3/2 We see that the link exchange makes pages A and D benefit in terms of PageRank while all other pages lose PageRank. Regarding search engine optimisation, this means that the exactly opposite effect compared to interlinking hierachically lower pages internally takes place. A link exchange is thus advisable, if one page (e.g. the root page of a site) shall be optimised for one important key phrase. A basic premise for the positive effects of a link exchange is that both involved pages propagate a similar amount of PageRank to each other. If one of the involved pages has a significantly higher PageRank or fewer outbound links, it is likely that all of its site's pages lose PageRank. Here, an important influencing factor is the size of a site. The more pages a web site has, the more PageRank from an inbound link is distributed to other pages of the site, regardless of the number of outbound links on the page that is involved in the link exchange. This way, the page involved in a link exchange itself benefits lesser from the link exchange and cannot propagate as much PageRank to the other page involved in the link exchange. All the influencing factors should be weighted up against each other bevor one trades links. Finally, it shall be noted that it is possible that all pages of a site benefit from a link exchange in terms of PageRank, whereby also the other site taking part in the link exchange does not lose PageRank. This may occur, when the page involved in the link exchange already has a certain number of external outbound links which don't link back to that site. In this case, less PageRank is lost by the already existing outbound links. |
|