flash

Can PDF, Flash and MS Office documents have PageRank?

The question today is – does Google assign PageRank to non-HTML files such as PDF files, Word documents or Flash files? Here is the definite answer.

Skip to start of post


Introduction

PageRank is just one of the many algorithms that Google uses to rank web pages. However, it is definitely the most well known and, due to the Google Toolbar, one of the most visible.

PageRank originally applied only to web pages, and not other types of files such as Adobe PDF files or Microsoft Office documents. However, Google has indexed these types of files for a long time now, so it would make perfect sense for Google to try and treat these in a similar way to web pages.

A caveat regarding the robots exclusion protocol

As with any test, it is important to ensure that there are no external factors which could affect the results. In this particular case, the Robots Exclusion Protocol is one such factor.

This quote from Matt Cutts sums the issue up nicely:

“a page that is blocked by robots.txt can still accrue PageRank. In the old days, ebay.com blocked Google in robots.txt, but we still wanted to be able to return ebay.com for the query [ebay], so uncrawled urls can accumulate PageRank and be shown in our search results.”

This means that we have to be careful to ensure that any files which we check are not blocked by robots.txt – rather than the non-HTML file itself having PageRank, it could simply be that the URL is blocked by robots.txt. To be sure that Google really does assign PageRank to a particular type of file we have to ensure that it is not blocked by robots.txt.

Note: Although the quote above applies to robots.txt, we have also checked that the files do not have an X-Robots-Tag HTTP header.

What types of files does Google index?

If you go to Google’s Advanced Search page, Google provides options to search for files in a number of formats:

Google Advanced Search supported file types

Google also has a list of supported file types on its file types FAQ page.

Note: We are not going to do an exhaustive list of different file types in this post, but the above list is a good place to start. Also note that we have not looked at images or videos, which have their own Google search verticals.

How we looked for non-HTML files to test

To find non-HTML files which might have PageRank as quickly as possible we used Google’s filetype: operator. We used this operator on its own, rather than combining it with a search query. For example, to search for PDF files we used the query [filetype:pdf].

Note that Google’s filetype: operator isn’t perfect – for example, it will return normal web pages ending with the same extension (for example, here’s a web page with a URL ending with .doc). Therefore, we also have to check each URL to make sure it’s actually the type of file we are looking for.

Results

Note that we are not interested in how high or low the PageRank scores are – what we are looking for here is simply whether they have any PageRank or not.


Adobe Portable Document Format (.pdf)

http://www.deetonline.org/brochure.pdf

PageRank 4 (PageRank 4)

Microsoft Word documents (.doc)

http://www.wvnn.com/privacy_policy.doc

PageRank 4 (PageRank 4)

Flash files (.swf)

http://www.uclalive.org/ucla_live_event_news.swf

PageRank 6 (PageRank 6)

Excel spreadsheets (.xls)

http://www.post.ch/pm_dp_jahresplan.xls

PageRank 3 (PageRank 3)

Plain text files (.txt)

http://www.rarlab.com/themes_new.txt

PageRank 5 (PageRank 5)

We also wanted to check whether Google gives PageRank to file types which aren’t on the list, so we checked a few additional file types:

Microsoft Word 2007 documents (.docx)

http://www.antor.com/EUROPEAN_TRADE_AND_CONSUMER_SHOWS_CALENDAR_2009.docx

PageRank 4 (PageRank 4)

"Comma-separated values" files (.csv)

(a format used for spreadsheets and storing data)

http://www.edeltutiyama.com/hayami2008.csv

PageRank 1 (PageRank 1)

Conclusion

Our research has shown that Google PageRank does not just apply to web pages – it also applies to a range of other documents.

Please note that proving that PageRank applies to the file types examined above only shows that it applies to these particular file types – to be absolutely certain that PageRank applies to a particular file type not listed above, you’d have to check it in the same way.

Tags: , , ,

0 comments Add This

To Flash or not to Flash

It was announced recently that Google will now be crawling .swf Flash files more efficiently.

This could be viewed as a major technological step-forward. However, is this good or bad for Search?

The generally consensus amongst the SEO community has been that Flash should be avoided for website architecture – except for displaying video content – due to search engines inability to effectively index Flash content. With Google’s announcement, we may well see Yahoo and MSN following suit in the coming months, thus increasing the likelihood of Flash-based sites being indexed more effectively.

What’s good about Flash being crawled & indexed?

The good news for website owners who have Flash-heavy sites is that such sites are now more likely to be crawled and subsequently indexed. Where Flash sites are seen to be relevant, they may now be returned for search queries.

Google have moved on from simply parsing Flash for text and are now using robots which actively navigate the file, ‘clicking’ buttons and following links. Additionally it is a distinct possibility that Adobe may make it possible for Google to use parameters to find specific points within a Flash file, thus freeing up the most important content and taking the user to the right point within the Flash file.

The bad news

There’s also some bad news with this new move by Google. Carefully-crafted Flash files could be setup to contain large amounts of spam, and thus could pollute search results with irrelevant content.

At present users finding a Flash file in their search results will be taken to the beginning of the Flash file, rather than the specific area of the file that they searched for.

Some designers will likely take the view that “Flash is indexed now”, and optimization of Flash files will stop.

This may lead to some clients being persuaded over to a shiny-new Flash website – however there will be a distinct advantage to those who don’t follow the hype and stick to HTML and well-founded SEO principles!

It is highly unlikely that Flash files will outrank well-crafted HTML websites on merit alone.

Although Flash files will be crawled more efficiently, it doesn’t solve usability and accessibility issues that occur with using Flash for web page design. In most cases Flash fails Disability Discrimination Act (DDA) compliance.

In conclusion

The reality of this new move by Google is that Flash should still only be used for certain content (I.e. video) and the website itself should be designed using HTML.

Optimization of Flash files won’t be easy and in the early days, results are likely to be varied in terms of search engine ranking.

Ideally, Flash files should still be excluded from search engine spiders, and a page should have noscript tags that give details about the embedded Flash file whilst provide branding and links back to the main site.

Tags: ,

4 comments Share