Google spoke about how it detects duplicate content and conducts canonicalization
The developers talked about this in the new episode of the Search Off The Record podcast.
Google employees John Mueller, Martin Splitt, Gary Ilsh, and Lizzie Harvey have elaborated on duplicate content and Google’s canonicalization. We have chosen the most important.
How Google detects duplicate pages
Everything turned out to be quite simple: there is a metric called checksum for each page. This is a unique cipher based on the text of the page. If two pages match the checksums, then Google counts them as duplicates. In practical applications, the checksum is also used to check the integrity of data during transmission.
To calculate the checksum, the main indicator is used – the central element of the page – which includes the main content (except for headers and footers and sidebars), and after calculating it, a cluster of duplicates is created. Of these, Google will choose one, which will appear in the SERP. Thus, the search engine can select not only full duplicates, but also partial ones.
Martin Splitt on Partial Duplicate Detection:“We have several algorithms that detect and ignore the template part of the pages. For example, this is how we exclude navigation from the checksum calculation, remove the footer. We are left with what we call the central element – the central content of the page, something like the very essence of the page.After calculating and comparing checksums, those that are strongly or partially similar to each other, we combine into a duplicate cluster. “
The process of reducing the page to the checksum is necessary to simplify the work: developers simply do not see the point in scanning all pages. It will take more resources with the same result.
How Google selects a canon page
In this podcast, the main difference between duplicates and canonicalization was determined: first, duplicates of pages are determined and grouped together, and then the main one is found – this is canonicalization.
Canonicalization is the process of selecting a home page in a cluster. For an objective selection of the canonical page, Google uses more than 20 signals. The neural network assigns the weight to them. When one signal decreases, the weight of the other increases and vice versa.
Martin Splitt on signals:“Obviously one of them is the content of the page. But there may be other signals: which page has a higher PageRank, on which page protocol (http or https), is the page included in the sitemap, is it redirected to another page, is the rel = canonical attribute set … Each of these signals has your weight, we use machine learning to calculate.After comparing all signals for all pairs of pages, we are approaching the actual definition of canonical. “
Finally, the developers noted that canonicalization has nothing to do with ranking.