There are very few occasions when duplicate content is a good thing. The search engines are not fond of it; it takes up unnecessary space in their indices and, in some cases, stops them from showing the right page.
There are very few occasions when duplicate content is a good thing. The search engines are not fond of it; it takes up unnecessary space in their indices and, in some cases, stops them from showing the right page. It should be made clear, however, that there is no such thing as a "duplicate content penalty", at least where Google is concerned.
Nonetheless, duplicate content is something that really should be avoided. It is possible that link authority could be split if people link to different duplicate pages. It can also skew any visitor tracking as people click on different copies that have made into the search results. It can be irritating for a user to click on a link and find content identical to something they have previously looked at. Having lots of duplicate pages on a site makes spidering less efficient, as search engine spiders will spend time downloading what is essentially the same page over and over again, rather than spidering other new or changed pages.
The problem is that there are many ways in which duplicate content can inadvertently be created. I’ll discuss just a few.
Perhaps the most common causes are from the way in which a site’s URLs are set up in the first place. A web server often has a root page and a page with a default document type. http://www.example.com/ would be the root, for example, and the page with the default document type would be http://www.example.com/index.html. The added default document type doesn’t have to be index.html. It could be any one of a number of things, including index.htm, index.php, index.asp. index.aspx, default.aspx, etc.
There are also situations in which a site can be found for the "non www." version (for example, http://example.com/) and the site has a duplicate https: version (for example, https://www.example.com/).
If a site had both the "non-www" and "https:" duplicate content issues there would be four copies of every page on the site, and if the issue affected a large site, the total number of duplicates would increase rapidly. Add printer-friendly pages and dynamic URLs to the possible causes and you can see that duplicate content can easily get out of hand.
Just as there are many ways in which duplicate content to be created, there are many ways in which the problem can be alleviated. The most obvious solution is not to create it in the first place – avoid session IDs in URLs, for example.
Another tried and trusted method of dealing with duplicate content, especially where the actual pages are identical, is the 301 redirect. This will prevent the site from displaying duplicate pages on different URLs.
Where the pages are not exactly identical, the rel=canonical tag can be used to indicate to the search engines that one particular page is the definitive version. As an aside, Google has recently announced that it will be adding cross-domain support for the canonical tag in the near future, and both Bing and Yahoo have said that they will be adding support for canonicalization across the same domain by the end of the year.
As mentioned in a recent blog post, Google now offers a parameter handling tool which can be used to help with duplicate content from dynamic URLs. It is probably best used if the solutions given above are not feasible and possibly only in situations where session IDs and similar issues are causing the problem. In fact, Google indicated in its official blog that there are situations in which using a rel="canonical" tag is wiser, especially as it is supported by many other search engines as well as by Google.
Tags: canonical, duplicate content, redirects, SEO
