That’s why it is good to consecrate topic of duplicate content in the context of 2011. Here you will find detailed information about duplicate content, about appears, how to find it and how to get rid of it.
1. What is duplicate content?
Let’s start with the basics. Duplicate content occurs when any two (or more) pages have the same content. For example:
It's simple, isn’t it? Why such a simple concept creates so many problems? One problem is that people often make the mistake thinking that the page – it is a file or document, which lies on a Web server. For the spider (crawler) (or Googlebot `a), page - this is any unique URL, which it was lucky to find. It usually occurs with the help of internal or external links. Especially in large, dynamic sites to create two URL `and that focus on the same content is surprisingly easy (and often by chance).
2. Why duplicate content is so important?
The problem of duplicate content had appeared long before the update algorithm "panda" and has taken many forms, depending on how the algorithm has changed. Here's a quick look at some of the major issues with duplicate content for many years ...
Supplemental index
In the early days of Google `a indexing the Web was a big computational task. To cope with this problem, some pages that had duplicates or were just poor quality, stored in a secondary index. This index was called «Supplemental» index. This page will automatically become a second-class in terms of SEO. It also lost her ability to compete with other pages.
At the end of 2006, Google has integrated additional results back to the main index, but these results are still often filtered. You can see that pages have been filtered out, when you see the following inscription in the bottom of the Google issue:
Despite the fact that the index was uniform, the results still were "left out", with obvious implications for SEO. Of course, in many cases, these pages were actually doubles, or have little value in terms of search engine. The practical effect of SEO has been slight, but not always.
Spider “budget”
There are always severe limitations when it comes to Google, because people want to hear the absolute numbers. There is no absolute budget or a fixed number of pages that Google indexes on a site. However, there is a point of view at which Google can stop to scan your site for a while, especially if you send Google spiders down deep in the site structure.
Also, the budget is not absolute, even for a given site, you can get an idea of how is distributed spider power in your Google Webmaster Tools.
What happens when Google goes through so many duplicate addresses and pages? In practice, the pages you wish to be indexed even can’t be scanned. At best, they probably will not be scanned so often.
Indexing "volume"
Similarly, there is not a certain amount of pages to be indexed by Google. Apparently there is some dynamic limit, however, this limitation is related to credibility of the site. If you fill your index with useless duplicate pages, you can displace more important, deep pages. For example, if you download thousands of results of internal search engine, Google can’t index all your pages. Many people make the mistake thinking that the more pages in the index, the better it is. I have seen many situations where the opposite worked. The other – is equivalent, bloated code dilutes the ability to rank.
3. Three types of duplicates
Before we start consider the examples of duplicate content and tools for working with them, I’d like to identify three main categories of duplicates. These are: real duplicates, partial duplicates, and cross-domain duplicates. I will refer to these three main types of duplicates further in this post.
1. Real duplicates
Real duplicates - is any page that 100% identical (in content) to another page. These pages differ from the URL:
2. Partial duplicates
Partial duplicates differ from others with a small volume – it can be a block of text, picture or even output content order.
3. Cross-domain duplicates
Cross-domain duplicates appear when two sites derive the same content.or partial. The opposite of what most people think, cross-domain duplication may be a problem even for legitimate and legal content.
4. Tools struggle against duplicates
It may seem that the tools struggle against duplicates do not have to go now, but I’d like to discuss tools for working with duplicates before moving on to specific examples.
1. 404 error
Of course, the easiest ways to deal with duplicate pages – just remove it and give it 404 error. If the content is really hasn’t benefit to visitors or search, and when there are no strong incoming links and traffic, then complete removal – is the ideal solution.
2. 301 redirect
Another way to remove the page using 301 redirect. In contrast to the 404 - 301 tells visitors (humans and bots) that the page is moved to another address on a permanent basis. People will get straight to a new page. For SEO, the majority of the incoming reference authority also passed new page. If your duplicate content has a permanent address URL, and double brings traffic and inbound links, then 301 redirect – is the perfect solution.
3. Robots.txt
It is another way to get rid of duplicates and leave it to public. It is the oldest and easiest way to get rid of duplicates. It looks like this:
One advantage of robots.txt – is easy to lock sub-folders and categories, and even the parameters of URL. The disadvantage of this method - is the unreliability of this method. robots.txt using for and for blocking content that is not indexed, but it's not the best solution for already indexed content. In general, the main substation does not recommend robots.txt for struggle against duplicates.
4. Meta Robots
Also, you can control the behavior of the search robot at the page level, with the directive at the level known as the header tag "Meta Robots". It looks like this:
Directive tells robot doesn’t index certain pages or doesn’t follow the links on these pages.
Another additional option may be the next version that is Meta robots - it's "NOINDEX, FOLLOW", which allows the bot count ways on the page without the addition of the page in the search index. This can be useful for internal search pages. When you want to block certain content variations, but still want robot follow the path to the pages with the goods.
There is no need to add "INDEX, FOLLOW", as this directive is the default.
5. Rel = Canonical
In 2009, the search engines came together to create a directive rel = Canonical. It allows webmasters to specify the canonical version of each page. Tag is placed in the header page (as meta robots), and here's a simple example:
When search engines are linking with the tag canonical, they ascribe the property of the page on which is the canonical URL.
6. URL Google removal tool
In Google Webmaster, you can submit a request to delete a single page (or directory) in your hand from the index.
It is important to know that before you apply for the removal of pages you should comply with one of these requirements:
- The page must give the 404 error;
- To be closed in robots.txt;
- Blocked by Meta Noindex.
7. Blocking of URL parameters in Google Webmasters
8. Rel=Prev & Rel=Next
This year introduced a new tool to struggle against duplicate content part to deal with partial duplicates in case of web pages pagination (page numbers).
In this example, the search robot has come to page 3 of the results, so you need two tags: 1. Rel-Prev, pointing to page 2, and 2. Rel-Next, pointing to page 4. You will almost always have to generate these tags dynamically, as the results will be likely issued using the same template.
These tags are ignored by the search engine Bing, and in fact there is not enough information of their effectiveness. I will briefly tell about other ways to work with indexed content in the next section.
9. Syndication-Source
In November 2010, Google introduced a set of tags for the publishers of syndicated content. Meta tag Syndication-Source directive can be used to determine the original source to re publish article such as:
Even Google's own advice about when to use this tag, and when to use the cross-domain tag is not well understood. Google has launched the tag as "an experimental". It is interesting to know, but do not count on it.
Useful links about the meta tag:
It can be used when you want to publish one article a few resources. When there is an access to add data tags in <head> page.
10. Internal linking
It is important to remember that the best way to deal with duplicates – it means preventing of duplicate content. Unfortunately it is not always possible, but if you can remove most problems, you probably need to double-check your internal linking of the site structure.
When you correct duplicate content problem, using a redirect 301 or a tag canonical, for example. It is also important to react to these actions in other parts of the site. It often happens when someone puts 301 redirect or canonical tag on one page version, and then continues to refer from the internal pages to not the main version of this page and fill in your map XML with not the main URL. Internal links - are strong signals, and sending mixed signals, you only get the problem.
Examples of duplicate content:
1. "Www" and without-www
The most common mistake, that creates full site duplication.
www.site.com
site.com
To solve this problem, use 301 redirect as it is the best solution in this case.
Also, you can set the preferred domain to your panel Google Webmasters. To do this you must add both www and without www domains in your panel Google Webmasters.
2. Stage of site development
At the stage of site development are often created subdomains for testing Web site.
For example:
site.com
test.site.com
Do not forget to close such a subdomain using robots.txt. If it is already indexed about, you should probably stick together data pages with redirect 301 or use the Meta tag Noindex.
3. Slash ("/") at the end of URL
site.com
site.com/
Technically, the protocol HTTP - it's a different addresses. Now, in most cases, browsers automatically add a slash at the end of this path. Matt Cutts in a video said that Google automatically recognized such URLs in most cases.
4. HTTPS
5. Duplicates of the main pages
6. ID sessions
7. International duplicates
8. Search sorting
9. Filters in the search
10. Search pagination
11. Online magazine options
12. Stolen content
How to find duplicates
1. Google Webmaster Tools
2. A require site: in Google
3. Looking through the site
No comments yet