Google Algorithm Leak – What Can We Learn From It?

5min.

Comments:0

Google Algorithm Leak – What Can We Learn From It?d-tags
28 May 2024
Google uses search engine click-through data and Chrome browser data to re-rank SERPs. It also extracts dates or authors' names from content and pays more attention to the texts written in larger fonts. These are just some of the insights to be found in Google's recently leaked documentation!

5min.

Comments:0

Nobody expects an algorithm leak. Yet, due to a tiny configuration error, a part of Google’s internal documentation describing, in simple terms, how the search engine works has seen light. This was thoroughly described by Mike King, whose article I sincerely recommend: https://ipullrank.com/google-algo-leak. It’s a long yet compelling read (at least for SEO enthusiasts).

There is a lot of material, but not all of it has been analyzed yet; hence, we should expect new insights and analysis to surface in the upcoming days (and weeks). You should also remember that not all of the documentation has been leaked and that, often, the context can change a lot in the interpretation of individual records. So, what can we find there?

  • The leaked documentation includes over 2,500 modules containing a total of 14,000 attributes (features) – that is, factors that Google’s algorithm can take into account. However, these are not explicitly “ranking factors.”
  • The leaked modules apply not only to Google’s main search engine but also to YouTube, video search, Google Books, Google Assistant, and page crawling infrastructure.

Revealed Algorithm Data vs. Official Google Representatives’ Statements

The leaked documentation is evidence of a common industry belief that not every word of Google representatives should be believed. After all, there’s a reason why Matt Cutts, who for years was something of a Google search spokesman, lived up to the nickname Matt the Liar. Statements by Google officials often were at odds with the experience and results of tests conducted by SEO specialists. Now, it’s as clear as day that the official narrative diverges from how the algorithm actually works.

Here are some examples:

  • We don’t have anything like domain authority – Gary Ilyes, Google Search Team analyst. Meanwhile, there is a site:Authority parameter in the documentation. It’s not clear how exactly it works, nor should it be equated in any way with Domain Authority from Moz, Domain Rating from Ahrefs, or analogical parameters from other third-party tools.
  • We don’t use clicks for rankings – This is not new; evidence that clicks in organic results are being used to change rankings has already been revealed during the US Department of Justice antitrust lawsuit against Google. Rand Fishkin, founder of Moz, had also previously spoken about the use of organic clicks, but this was denied by Google representatives. Chapeau bas for Rand. In fact, the NavBoost ranking system, part of Google’s algorithm that largely focuses on click signals, is supposed to be one of the strongest ranking factors.
  • There is no sandbox The leaked data confirms the theory that new domains need time before Google starts displaying them more widely. John Muller denied the existence of a sandbox; meanwhile, the documentation includes attributed related to the age of the host (i.e., domain).
  • We don’t use anything from Chrome for rankingThe leaked documentation also shows attributes related to site traffic data recorded by the Chrome browser. This indicates that Google’s browser collects data on user activity on individual sites (most likely related to engagement), which is then used to reevaluate search results.

Google officials like John Muller or Gary Ilyes are severely limited regarding what they can and cannot say in public. Their maneuvering between corporate guidelines and the complexity of Google’s algorithm makes it necessary to take their words with a pinch of salt.

Google’s Ranking System Architecture

The disclosed documentation also confirms that Google’s algorithm is not a single system but rather a conglomerate of microservices operating simultaneously. Below, you’ll find the most important of them, divided according to their tasks.

Crawling

  • Trawler – Web searching system. It has a search queue, maintains the scanning rate, and analyzes how often pages change.

Indexing

  • Alexandria – The core indexing system.
  • SegIndexer – A system that classifies documents by relevance in an index.
  • TeraGoogle – Secondary indexing system for documents stored long-term on disk.

Rendering

  • HtmlrenderWebkitHeadless – JavaScript page rendering system.

Processing

  • LinkExtractor – Extracts links from pages.
  • WebMirror – System for managing canonicalization and duplication.

Ranking

  • Mustang – Main scoring, ranking, and serving system.
  • Ascorer – The primary ranking algorithm that ranks pages prior to any re-ranking adjustments.
  • NavBoost – Re-ranking system based on click logs of user behavior.
  • FreshnessTwiddler – Re-ranking system for documents based on freshness.
  • WebChooserScorer – Defines feature names used in snippet scoring.

Serving

  • Google Web Server – The server that Google’s frontend interacts with. It receives data to display to the user.
  • SuperRoot – The brain of the Google search engine; SuperRoot sends messages to Google’s servers and manages the post-processing system for re-ranking and presentation of results.
  • SnippetBrain – The system responsible for generating snippets.
  • Glue – The system gluing universal results together using user behavior.
  • Cookbook – The system generating signals. There are indications that indications are created in runtime.

Practical Tips Based on the Leaked Documentation

  • To successfully rank up in search results, a website must gain clicks from an increasing number of phrases and constantly acquire links.
  • Quality organic traffic matters – you need to polish the UX layer and match the site’s content with the search intent.
  • Google can and attempts to extract information about the author of the content. Thus, signing articles and clearly marking authorship actually makes sense.
  • Links must be linked to the target page; topical backlinks are better.
  • Fresh content is treated as more important – it’s worth planning a content update process.
  • Google is able to catch a mass influx of spammy links and ignore them, and the whole of the leaked documentation does not mention the Disavow Tool even once. Disavowing links seems to make no sense.
  • Google also stores information about what was at a particular URL in the past, but for re-ranking, it “only” analyzes the last 20 versions. In practice, it’s worth it to repeatedly modify (update, optimize, expand, etc.) the content.
  • Font size matters – it’s actually natural that if we make the text larger and more visible to the user, the algorithm should also pay more attention to it. After all, extracting info about the size of a given text from the CSS isn’t complicated.
  • Homepage is important – The algorithm extracts the information about the credibility of the site from it. Google will use the info from it for new URLs about which it has not yet collected behavioral data. So, this is the most crucial page on the site for optimization.
  • Google tokenizes the content on a page and examines the number of unique tokens. It also has a limit on the number of tokens it can process, so the most important information should be high up in the page code.
  • Short content is scored mostly for its originality. Thus, whether we are dealing with thin content does not depend on its length.
  • It’s important to include the targeted keywords at the beginning of the title tag. However, the algorithm does not count the length of this tag or meta descriptions. So, if we have titles or meta descriptions that are too long, shortening them won’t do us any good unless the shorter version is more inviting to click.
  • Dates are super important. If they are not marked, Google will try to pull them from the content. It’s essential that the dates match (e.g., the date in the URL or title and the publication date).

Naturally, the leaked documentation on the Google algorithm provides much more information. It’s also worth reviewing the list of modules and attributes created by IPullRank based on the leaked data. We can certainly expect more analysis, studies, and tests based on these findings soon. If you’d like to stay up to date with this kind of information, sign up for our newsletter!

Author
Wojciech Urban - Senior SEO R&D Specialist
Author
Wojciech Urban

Senior SEO R&D Specialist

R&D specialist in SEO and web analytics. He feels most comfortable in the area of technical SEO, and his main task is to ensure that websites are optimized for search engines and achieve high rankings in search results.

Author
Wojciech Urban - Senior SEO R&D Specialist
Author
Wojciech Urban

Senior SEO R&D Specialist

R&D specialist in SEO and web analytics. He feels most comfortable in the area of technical SEO, and his main task is to ensure that websites are optimized for search engines and achieve high rankings in search results.