pagerank

Can PDF, Flash and MS Office documents have PageRank?

The question today is – does Google assign PageRank to non-HTML files such as PDF files, Word documents or Flash files? Here is the definite answer.

Skip to start of post


Introduction

PageRank is just one of the many algorithms that Google uses to rank web pages. However, it is definitely the most well known and, due to the Google Toolbar, one of the most visible.

PageRank originally applied only to web pages, and not other types of files such as Adobe PDF files or Microsoft Office documents. However, Google has indexed these types of files for a long time now, so it would make perfect sense for Google to try and treat these in a similar way to web pages.

A caveat regarding the robots exclusion protocol

As with any test, it is important to ensure that there are no external factors which could affect the results. In this particular case, the Robots Exclusion Protocol is one such factor.

This quote from Matt Cutts sums the issue up nicely:

“a page that is blocked by robots.txt can still accrue PageRank. In the old days, ebay.com blocked Google in robots.txt, but we still wanted to be able to return ebay.com for the query [ebay], so uncrawled urls can accumulate PageRank and be shown in our search results.”

This means that we have to be careful to ensure that any files which we check are not blocked by robots.txt – rather than the non-HTML file itself having PageRank, it could simply be that the URL is blocked by robots.txt. To be sure that Google really does assign PageRank to a particular type of file we have to ensure that it is not blocked by robots.txt.

Note: Although the quote above applies to robots.txt, we have also checked that the files do not have an X-Robots-Tag HTTP header.

What types of files does Google index?

If you go to Google’s Advanced Search page, Google provides options to search for files in a number of formats:

Google Advanced Search supported file types

Google also has a list of supported file types on its file types FAQ page.

Note: We are not going to do an exhaustive list of different file types in this post, but the above list is a good place to start. Also note that we have not looked at images or videos, which have their own Google search verticals.

How we looked for non-HTML files to test

To find non-HTML files which might have PageRank as quickly as possible we used Google’s filetype: operator. We used this operator on its own, rather than combining it with a search query. For example, to search for PDF files we used the query [filetype:pdf].

Note that Google’s filetype: operator isn’t perfect – for example, it will return normal web pages ending with the same extension (for example, here’s a web page with a URL ending with .doc). Therefore, we also have to check each URL to make sure it’s actually the type of file we are looking for.

Results

Note that we are not interested in how high or low the PageRank scores are – what we are looking for here is simply whether they have any PageRank or not.


Adobe Portable Document Format (.pdf)

http://www.deetonline.org/brochure.pdf

PageRank 4 (PageRank 4)

Microsoft Word documents (.doc)

http://www.wvnn.com/privacy_policy.doc

PageRank 4 (PageRank 4)

Flash files (.swf)

http://www.uclalive.org/ucla_live_event_news.swf

PageRank 6 (PageRank 6)

Excel spreadsheets (.xls)

http://www.post.ch/pm_dp_jahresplan.xls

PageRank 3 (PageRank 3)

Plain text files (.txt)

http://www.rarlab.com/themes_new.txt

PageRank 5 (PageRank 5)

We also wanted to check whether Google gives PageRank to file types which aren’t on the list, so we checked a few additional file types:

Microsoft Word 2007 documents (.docx)

http://www.antor.com/EUROPEAN_TRADE_AND_CONSUMER_SHOWS_CALENDAR_2009.docx

PageRank 4 (PageRank 4)

"Comma-separated values" files (.csv)

(a format used for spreadsheets and storing data)

http://www.edeltutiyama.com/hayami2008.csv

PageRank 1 (PageRank 1)

Conclusion

Our research has shown that Google PageRank does not just apply to web pages – it also applies to a range of other documents.

Please note that proving that PageRank applies to the file types examined above only shows that it applies to these particular file types – to be absolutely certain that PageRank applies to a particular file type not listed above, you’d have to check it in the same way.

Tags: , , ,

0 comments Add This

What is PageRank?

Take an in-depth look at Google PageRank. We explain what PageRank is, how pages get it and share it (or don’t), how Google really uses it, and explore some tantalising hints for what else Google might be doing with it.

Named after Google co-founder Larry Page, PageRank is Google’s way of scoring the importance of a web page. The long definition from Google is:

“PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results.

PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value. We have always taken a pragmatic approach to help improve search quality and create useful products, and our technology uses the collective intelligence of the web to determine a page’s importance.”

The simple definition, as mentioned by Google employee Matt Cutts in a presentation to a WordPress users conference is "the number and importance of links pointing to you". In other words, Google takes the links to your web page as votes as to its quality.

Getting PageRank is not automatic. There are many sites on the web with no PageRank at all, in many cases due to the quality of the site. Google is also able to manually adjust PageRank if a site "breaks the rules". A well-publicised example of this was when Google reduced the PageRank of the Google Japan website, which had been using a paid blogging campaign in an effort to boost its market share against Yahoo, from PR9 to PR5.

The type of page that you link to and, therefore, the pages that you pass PageRank to are also important. Having links out to too many low quality pages can mark your page as low quality. To avoid this, Google (and other search engines) state that webmasters should block such links using some form of robots exclusion, such as the rel=nofollow attribute. This is a signal to Google that you do not vouch for the quality of the page that you are linking to and that you don’t want to pass any PageRank on to that page. This was reiterated in a recent post on Matt Cutts’ blog about PageRank sculpting. Note that the "noindex" directive (either in the meta robots tag or in the X-Robots-Tag HTTP header) does not prevent a page from passing PageRank, although a page with the attribute will not appear in Google’s index.

The problem with talking about PageRank is that there are different types of PageRank. Firstly, there is Google’s Toolbar PageRank. This appears as a little green bar graphic on the Google Toolbar. Secondly, there is the Google Directory PageRank. Google Directory is essentially results from the DMOZ (ODP) Directory with a representation of Google’s Toolbar PageRank displayed alongside the page listing. Google has also discussed using different types of PageRank in the past.

However, these are all subsets of Google’s internal PageRank. Toolbar PageRank, for example, is a 0 to 10 non-linear scale that represents internal PageRank. The relationship is probably logarithmic, although that is by no means certain. In answering a question about how PageRank is stored internally, Matt Cutts said:

“It’s more accurate to think of it as a floating-point number. Certainly our internal PageRank computations have many more degrees of resolution than the 0-10 values shown in the toolbar.”

Internal PageRank is not published and remains a closely guarded secret. Athough the original maths behind it has been well publicised, PageRank is calculated differently nowadays, as Matt Cutts mentioned in this blog post:

“Even when I joined the company in 2000, Google was doing more sophisticated link computation than you would observe from the classic PageRank papers. If you believe that Google stopped innovating in link analysis, that’s a flawed assumption. Although we still refer to it as PageRank, Google’s ability to compute reputation based on links has advanced considerably over the years. I’ll do the rest of my blog post in the framework of ‘classic PageRank’ but bear in mind that it’s not a perfect analogy”

Having a high PageRank is a good thing. A high PageRank means that Google will crawl a site more often, and will crawl a site deeper and earlier than pages with a lower PageRank. It also used to be an indicator as to whether or not a page was in Google’s supplemental index. Google now no longer labels any results as supplemental.

The problem is that we can never be certain what the actual PageRank of a page is. We can look at Toolbar PageRank but it is, at most, a useful barometer of the way in which Google views any given page. Actual PageRank is only one of more than 200 (and counting) factors that Google uses to score a page, and Toolbar PageRank does not directly reflect actual PageRank. Google’s internal PageRank is calculated continually, as Matt Cutts explained in this video:

“Some data refreshes happen all the time. For example, we compute PageRank continually and continuously, so there’s always a bank of machines refining PageRank based on incoming data, and PageRank goes out all the time, any time there’s an update in our index, which happens pretty much every day.”

Toolbar PageRank however, was at one time only updated every three to four months and so lagged behind the internal version. The updates are now more random and Google has not announced one in some time. There is, for example, an update to Toolbar happening as this post is being written, with as yet no official word from Google. Additionally, PageRank doesn’t always directly correlate with rankings, although there is a tendency for higher PageRank pages to rank more highly in the results.

So obsessing about PageRank is not really time well spent. Interestingly, Google stopped displaying PageRank in Google Webmaster Tools in the middle of October. Perhaps this was to send the message that PageRank is really not that important?

See all posts tagged "pagerank"

Tags: ,

0 comments Share

Google Toolbar PageRank update 23rd June

Google seems to be switching from 3-4 month gaps between Toolbar PageRank updates to a much shorter interval – we’ve just had another update, less than a month after the last one.

Google updated its “Toolbar PageRank” (the PageRank values shown in the Google Toolbar) on the 23rd of June. This is the second update in a row to come earlier than expected – the previous Toolbar PageRank update happened less than two months after the one preceding it. The gap this time around is even smaller – less than one month since the last such update.

Historically, Google has updated its Toolbar PageRank values roughly every 3-4 months (prior to the last update, which came earlier than expected, the previous PageRank updates happened on April 1st, 31st December and, before that, in September). Are we seeing the start of a shift towards more frequent updates?

Please note that the Toolbar PageRank does not necessarily reflect the current standing of a page – Google continually updates PageRank values internally (at least every day), but only provides a “snapshot” every so often.

Tags: ,

0 comments Share

How old are Toolbar PageRank values?

Google only updates the PageRank values seen in its toolbar every few months, but calculates new values internally much more frequently.

In this piece of research we try to answer the question “how old are the PageRank values shown when they are published?”, and uncover something surprising in the process.

Please note: The web pages used in this article are used for reference only. LBi does not endorse any of the pages linked to from this article.

Google updates the PageRank values shown in the Google Toolbar every 3-4 months (and sometimes more often). However, Google also calculates the PageRank values that it uses internally much more frequently (at least daily). The PageRank shown in the Google Toolbar is therefore a "snapshot" of values at some point in time.

A commonly asked question when Google updates the PageRank values displayed within its toolbar is "How old are these new PageRank values?" – are they fresh, up-to-date values which have just been calculated, or are they several months old? Although, in general, we would recommend not obsessing about Google’s green bar too much, knowing the answer to this question has several implications – for example, if you know how recent the values are, you can determine whether any recent linkbuilding activity is being accounted for within the new PageRank values.

Methodology

The methodology for this experiment is fairly simple – to know how old the values are, we need to establish what what the length of time was between pages last being given PageRank and the PageRank update. Therefore, we need to find:

  • The most recent page possible which has a PageRank value
  • The earliest possible mention of the recent PageRank update

Oldest mentions of PageRank update

For the purposes of finding the earliest possible date that a PageRank update was mentioned, we have looked at a number of different SEO discussion sites in order to find the earliest mention by a member of their community. We’ll convert all times into British Summer Time (GMT+1) for comparison.

  • Digital Point forums – many posts here, but the earliest is dated "May 28th 2009, 1:01 am". Times are GMT-7, so this is 9:01 BST on May 28
  • High Rankings Forum – there is a post at "7:38pm" – the forum appears to be 6 hours behind BST, so the time of the post is 01:38 BST on May 28
  • SEORoundTable – the first forum post is 06:12 AM on 28th May – as this time is GMT-5, the time is 12:12 BST on May 28
  • WebmasterWorld – the earliest post is "10:12pm UTC" – this is 23:12 BST on May 27

There are lots of other sites, but we’ve picked a selection of the earliest posts. The earliest one seems to be the WebmasterWorld thread, with a time of 23:12 BST on May 27th.

Newest articles with PageRank

The next step requires finding the most recent page possible which has a PageRank value. Please note that this does not mean the most recent page with a PageRank of 1 or more – a PageRank value of "zero" also constitutes a page having a PageRank value assigned to it. A PageRank of zero simply means that, on the sliding scale used by Google, the page falls into the set of pages with the lowest PageRank values. This is different from having no PageRank value at all.

The best place to look for recent pages which may have PageRank is to look for a high-PageRank, high-traffic site which is frequently updated and which uses web feeds to ensure that new pages are rapidly indexed. News sites are ideal for this. We’ve picked The Guardian because the website includes detailed date information, including both the original publication date and the date that the articles were last updated, whereas many other online newspapers don’t include the original article publication dates.

Here are a few of the most recent articles found, along with their dates. These articles are all PageRank zero.

We have not listed articles with no PageRank values at all (to narrow down the interval further) as Google may have simply not crawled these pages yet.

Hang on… what’s this?

Having looked around a number of articles, we suddenly stumbled across this article, which has a PageRank value assigned (zero). The "article history" says:

"This article was first published on guardian.co.uk at 00.01 BST on Thursday 28 May 2009. It appeared in the Guardian on Thursday 28 May 2009 on p35 of the Editorials & reply section. It was last updated at 00.05 BST on Thursday 28 May 2009."

This poses something of a puzzle – here we have an article which has a PageRank score and which was apparently posted 49 minutes after the PageRank update started happening. Thinking caps on! Here are the possible causes of this seemingly paradoxical situation.

Theory 1 – The dates are wrong

This is the simplest explanation. Either the date on the WebmasterWorld thread is wrong, or the date on the rogue Guardian article is wrong.

Theory 2 – Datacenters, datacenters, datacenters

"Datacenters" – the standard fall-back answer to many a Google puzzle. As we know that different datacenters will start showing updated PageRank values at different times, it could be that the datacenter currently serving up the PageRank values that we are seeing is different to the one which first served new results to the poster who started the WebmasterWorld thread listed above.

This theory has interesting implications – given the time gap it would mean that different datacenters calculate PageRank independently of each other.

Theory 3 – Rolling PageRank update

Another possibility is that the PageRank update happens in a number of stages or over a period of time – this would mean that the update had begun when it was first noticed but had not yet been completed by the time that Google found the aforementioned Guardian article.

Conclusion

When Google performs a Toolbar PageRank update it would appear that the values are fresh and up-to-date.

Additionally, there may be an additional mechanism at work which can sometimes result in PageRank values being assigned to some pages shortly after the Toolbar PageRank update has occurred.

Got any comments about this research piece? Let us know in the comments field below!

Tags: , ,

3 comments Share

Google Toolbar PageRank update 27th May

Google has updated its Toolbar PageRank on the 27th of May. This is slightly unusual as it comes around a month earlier than expected.

Google has updated its “Toolbar PageRank” (the PageRank values shown in the Google Toolbar) on the 27th of May. The timing of the recent update has caught many by surprise as it comes less than two months after the last update.

Typically, Google updates its Toolbar PageRank values roughly every 3-4 months (the previous PageRank updates happened on April 1st, 31st December and before that, in September).

Please note that the Toolbar PageRank does not necessarily reflect the current standing of a page – Google continually updates PageRank values internally (at least every day), but only provides a “snapshot” every so often.

Tags: ,

2 comments Share