The top three search engines have jointly announced a new meta tag to help combat the issue of duplicate content. We look through the potential uses of this new tag, show the places where it shouldn’t be used and illustrate where it’s a fantastic new addition to the SEO toolbox.
A common problem for search engines when indexing websites is that of duplicate content. Having multiple pages with identical or very similar content can create numerous problems for the search engines, such as wasting resources on unnecessary spidering and attempting to determine which version of a page is most relevant. Search engines are interested in finding and indexing unique content, not hundreds of identical pages!
If your site contains multiple pages of duplicate content with little or no variations, this can lead to a number of potential issues. When confronted with multiple pages of duplicate content a search engine will attempt to identify a canonical page and then display this within the search results. This can lead to an undesired page being identified and chosen as preferential, rather than the page you might prefer.
An additional issue is that if links are made to these different pages, the benefits of these links might potentially be split between different page variants, as search engines are not always able to identify duplicate pages. The inherent value that comes from anchor texts and link weight, through both internal and external links to the page, will be diluted by having multiple pages with duplicate content.
Many websites, especially e-commerce sites, have multiple ways to navigate to individual items of content. This in turn leads to multiplication of pages with little or no variation. For example, here are some common variations of a site’s homepage:
- http://example.com
- http://www.example.com
- https://www.example.com
- http://www.example.com/index.html
- http://www.example.com/?referer=page.html
Each of these URLs would show a visitor an identical copy of the site’s homepage.
Generally these sorts of duplicate URLs are dealt with by using 301 redirects to send visitors to the correct version of a page. However, there are some instances where you might actually want to have multiple pages which are very similar, and therefore a 301 redirect is not suitable. For example, on many e-commerce sites, you can often sort lists of products in numerous ways. In these instances, the Robots Exclusion Protocol is usually called upon to block these duplicate pages from the search engines.
Yesterday Google, Yahoo! and Microsoft jointly announced a new HTML tag in an effort to help site designers and search engines more accurately define a website’s canonical pages. The tag provides search engines with a suggestion from the site’s owner that a specific page should be considered as the canonical version and therefore more authoritative than its’ duplicate brothers. This new tag is as follows:
<link rel="canonical" href="http://example.com" />
This is inserted within the <head> element of duplicate pages and enables search engine spiders to accurately identify which URL is the site owner’s preferred canonical page. This new tag transfers search ‘signals’ such as Google PageRank, to the appropriate preferred canonical URL rather than dispersing it across multiple URLs.
This new tag should be considered as a new tool in the SEO toolbox, and does not necessarily mean that other methods of reducing duplicate content are no longer useful. In general, it is still going to be better to use 301 redirects in the majority of cases. Here are several reasons why:
- This tag only provides a hint to search engines – they will consider it as part of their algorithm, but it is still by no means certain that a search engine will pick your choice. This will vary between the search engines, leading to differing behaviour from different search engines.
- It requires pages to be identical or almost-identical. Exactly how “identical” will be down to each search engine to decide, which again leads to different results in different search engines, and it won’t work at all if the pages are significantly different.
- It doesn’t work in any other search engines (at least yet). This is more important internationally, where different search engines may have different market shares, and local players may have a significant presence.
- Search engines have to crawl the duplicate URLs, leading to increased server load and potentially reduced coverage of the rest of your site
- It doesn’t work across different domains (although it does work across different subdomains on the same domain).
There are also instances of duplicate content where it may not be the most appropriate tool to use, for example:
- Anywhere you would usually use a 301 redirect (see above for why), such as dealing with the HTTPS version of a site, the non-www site version or index.html pages. A 301 redirect is definitely the best solution in these situations.
- Accessible text-only versions of a site – contrary to popular belief these are actually recommended against by the W3C, and should not be used at all. It is quite possible to make your site accessible while retaining all the bells and whistles
- Printer-friendly pages – a better solution is to create CSS style sheets for print media, which will make any page on your site printable.
- Duplicate pages caused by use of session IDs or referrer tracking in URLs – These cause spidering problems and should not be used. Additionally, as URLs may be shared, these are not reliable.
However, this tool definitely introduces a number of new ways of dealing with certain difficult types of duplicate content issues. Here are some great uses for this new tag:
- As mentioned earlier, many e-commerce sites allow you to sort lists of products in numerous ways. In these instances, the Robots Exclusion Protocol is usually used to block these pages. However, it is now possible to use this tag to point to the canonical version of a page, thus effectively merging these URLs in the eyes of the search engines, whilst leaving the site experience unaltered.
- In the same way as sorting, this is potentially a suitable tool to use with pagination, although if the pages are not similar enough the search engines may potentially fail to follow this directive. We would generally recommend allowing the search engines to spider at least one paginated form, however.
- One common method of split testing involves using URL parameters to differentiate pages (although this isn’t the only way of doing it). Using this new tag allows you perform split testing in this way without causing duplicate content issues. A side note – if you’re not doing split testing, you should be!
- Another issue common to e-commerce sites is multiple methods of navigating to a particular product. For example, a product may be included in a general category and also be accessible in a category for the product brand. In some instances a 301 redirect may be possible, but sometimes the branding and navigation may need to be kept intact, and this tag is a suitable tool in this instance.
- Where you want to handle people using invalid URL parameters. In fact, you could argue that every single page on a site should use this new link tag for this reason alone. With this tag in place, you no longer have to worry about links to your site being made with weird URL parameters – they’ll all be consolidated for you!
A
final, if somewhat niche way of using this tag is where you are showing multiple revisions of a document. A good example of this is a wiki, where documents can be edited and each revision of a page is stored for posterity. In fact, Wikia was the partner for the search engines to help them test this new tag out.
The exact impact of this new HTML tag is yet to be accurately measured and, in general, it will still be much more beneficial to use a traditional 301 redirect. However, for those instances where a 301 is not the appropriate solution, the rel=canonical link tag provides an invaluable new addition to the SEO toolbox.
Update: Ask.com are also going to support the canonical tag.
Additional research by John Trivett
