How the massive Google SEO leak is impacting the search marketplace

Utsav Gandhi discusses the findings of the May 2024 Google SEO leak, which gave analysts a new, albeit speculative, look at how Google might choose to promote and demote content. The findings have potential implications for businesses and news organizations struggling to compete for views, and suggest that transparency could become an increasing factor in the future search market as new, AI-powered competitors enter the market.

From the launch of Google in 1998 to the subsequent widespread adoption of mobile technology, the rise of voice assistants and today’s rapidly evolving artificial intelligence landscape, the experience of searching for information online seems to have remained consistent and reliable. But perhaps for the first time, Google Search, which has dominated the internet search market for over two decades, is at a crucial crossroads and could turn upside down before our eyes in the next handful of years.

Google faces several pitfalls in the burgeoning market for its core product: It has come under intense scrutiny for monopolizing the search market, reducing clicks to news sites in favor of AI-generated summaries and playing loose with data privacy. In addition, users, especially businesses that rely on Google Search to promote and disseminate content, have long been frustrated by the secretive structure of Google Search’s underlying architecture and the signals it must send to help users find their content, or “search engine optimization” (SEO).

SEO has played a critical role in how information is ranked, presented and displayed to consumers, and has spawned a $74 billion industry, mostly in consulting and marketing practices that leverage Google’s rules to organize the web’s content. The rules and algorithms that determine Google’s SEO often result in make-or-break moments for small businesses or media organizations. Over the years, Google has remained mostly tight-lipped about how SEO works, but a recent leak of internal documents (code dated March 2024) could provide some of the most sweeping revelations yet. While the leak only has variable names used in the raw data for SEO, but nothing about the underlying algorithms themselves, it could have implications for the future of Google Search in a market flirting with the introduction of new and serious competitors . Google, for its part, has confirmed the authenticity of the leak, but expressed that some documents may be outdated or incomplete.

The insights from the leak can be thought of as the ingredients of how Google drives SEO rather than the recipe for how these different ingredients are weighed in the final result. More than 2,500 modules (or pages representing different components of SEO) were leaked in API code documentation from Google’s internal “Content API Warehouse”, shedding light on more than 14,000 attributes (features or signals that Google can use to determine ranking). Documentation like this “repository” exists at almost every technology company, helping to familiarize the internal staff of a project with the data available to it. However, it is rarely seen by the public. SEO experts Rand Fishkin and Mike King first disclosed information about the leak and independently published analyzes of the documents and their content.

The new information reveals properties that marketers and SEO experts suspected existed, as well as items they didn’t even know could be tracked:

1. It confirmed the existence of “twiddlers” – reranking algorithms that can boost or demote content with penalties or rewards. It seems intuitive that these algorithms define how the Internet is structured and presented to us, but the problem here is the lack of further transparency into the nature, scope and impact of these algorithms. If these are “reordering” algorithms, when and how are they implemented exactly? When exactly will content be downgraded or boosted?

2. Some attributes revealed in the leak suggest that Google detects how commercial a page or document is, and this can be used to prevent a page from being considered for an informational query. Take, for example, a user searching for “Stanley Cup” (the trophy awarded to the winner of the American National Hockey League championship) versus searching for Stanley cups (giant tumblers that have gone viral on TikTok). This seems useful, but additional data on error rates (false positives and false negatives) would be useful, especially for researchers.

3. The leak confirmed the importance of previously known ranking factors, such as content quality (“EAT” or experience-authority-trust as described in the SEO world), backlinks (hyperlinks from one website to another), regularly updated content and user interaction metrics (click , time spent on the site, etc.) The higher the number of these factors that a website contains, the higher its ranking. The leak also showed that Google keeps track of the topics that a web page publishes (e.g. ProMarket publishes extensively on antitrust), and how much each side (on ProMarket, for example) is distinct from the larger topic (antitrust). Again, these factors are where the value of the leak is notable, although it is limited to the “ingredients” but not the “recipe”.

The leak also reveals information contradicting several SEO claims Google has made over the years:

1. Historically, Google has denied that click-through rates are important in SEO ranking (ie if the third result on a page is clicked more often than the first, over time it will rise to the second or first result). The leak, SEO analysts say, suggests otherwise. This can have implications (at least in the pre-AI search era) for “clickbait” because what users click on on a search results page is the title of web pages.

2. Google has claimed that site ranking does not follow a “sandbox” pattern – that there is no rule that newer sites wait to be ranked higher. The leak suggests otherwise by revealing a metric called “hostAge”. Why would Google collect data about a site’s age if it doesn’t use it?

3. Google has also claimed that if many people click on a website, its webpage will rank high and that it does not use data from Google Chrome for ranking. The leak suggests otherwise: That there have been plenty of mechanisms in place for Google to collect Chrome data for years, raising questions about the purpose of collecting the data, if not using it. For example, the original motivation behind the launch of Google Chrome was to collect more clickstream data, a detailed log of a user’s activity, including the pages they visit, how long they spend on each page, and where they go. Recent research has also shed more light on how Chrome is helping Google strengthen its dominance.

4. Perhaps more importantly for smaller sites, the leaked documents indicate that while Google isn’t necessarily passionate about their visibility, it doesn’t go out of its way to value them highly either. In a piece published recently, a company called HouseFresh, which evaluates and reviews air purifiers, describes how it has “virtually disappeared” from search results: search traffic has dropped 91 percent in recent months, from about 4,000 visitors a day in October 2023 to 200 a. day in 2024. This drop in traffic to HouseFresh has coincided with a series of Google algorithm changes, after which HouseFresh reviews began to be buried under recommendations from brand-name publications. “It seemed like media companies were grabbing affiliate revenue without the expertise that sites like HouseFresh had worked hard to cultivate — and it looked like Google was rewarding them for doing so,” this analysis explains .

Other researchers have raised important concerns about Google’s stranglehold on political and health information simply because of its dominance in search. Relatedly, the leak revealed that during the Covid-19 pandemic, Google used whitelists for websites that could appear high in the results for Covid-related searches. Similarly, during democratic elections, Google has employed whitelists of sites to display (or demote) for election-related information. There are several references to flags for “isCovidLocalAuthority” and “isElectionAuthority” in the documentation. Again, the leak does not provide additional information on how these authorities are determined.

The leak underscores the complexity and opacity that small business owners and media organizations must navigate to maintain an online presence and generate revenue. Seam the search engine (with over 90% of the US search market), Google determines what news content and companies people see. Google’s SEO is one of the most important determinants of competition in the online and offline economy. Without access to the rules, companies play a guessing game to compete with each other.

The leak also raises concerns about taking the company’s messaging at face value and the need for marketers to continue experimenting in coordination with user experience design and content communication. That raises questions about whether the same rules apply to Google’s own web properties, such as Travel, Shopping and Flights.

The leak also has implications for Google’s precarious position in the rapidly evolving search landscape. A group of new rivals has emerged, running on new AI software. These include OpenAI’s recently announced direct competitor to Google (“SearchGPT”), Microsoft’s long-ignored search engine Bing (now powered by ChatGPT) and its AI assistant Copilot, and Perplexity, a high-profile AI-powered chatbot. The quality of these new search engines varies, but they add competition to a long-stagnant market. Google’s lack of transparency over its search SEO rules has frustrated its users. For over a decade, users didn’t have much choice in their search engine, and Google could potentially manipulate its SEO without losing consumers. That may no longer be the case.

At the recent Stigler Center Antitrust and Competition Conference, author and attorney Cory Doctorow revealed that he has begun paying $10 monthly to use a new search engine called Kagi. Instead of monetizing users for targeted marketing, Kagi offers three monthly pricing tiers that offer an ad-free, personalized search experience where “your information provider’s incentives are aligned to what’s best for you, not what’s best for advertisers.” Doctorow said at the conference: “The problem isn’t that Google is scraping us. The problem is that we can’t scrape Google.” In other words, we don’t know why Google is showing us the information it does. Google’s SEO leak has raised questions about transparency. Even if Google ignores the leak and the questions it raises, it will have a lot harder to ignore the new search competitors who may offer higher quality and perhaps more transparent services to consumers.

Author information: the author reports no conflicts of interest. You can read our information policy here.

Articles represent the opinions of their authors, not necessarily those of the University of Chicago, Booth School of Business or its faculty.