- What Are Common Robots.txt Mistakes?
- Not Placing the Robots.txt File in the Root Directory
- Wrong Use of Wildcards
- Putting ‘NoIndex’ in Robots.txt
- Blocking Scripts and Style Sheets
- Not Including the Sitemap URL
- Unnecessary Use of Trailing Slash
- Ignoring Case Sensitivity
- Using One Robots.txt File for Different Subdomains
- Not Blocking Access to Sites Under Construction
- How Can I Recover from a Robots.txt Mistake?
- An Optimized Site Relies on Proper Robots.txt Files
What Are Common Robots.txt Mistakes?
Looking for every error in robots.txt will take time, but there are some common areas to focus on. By seeing what common errors exist, you can avoid them much more easily. The disasters that can happen with a corrupted robots.txt file are significant and can drastically affect your website. Robots.txt error correction can be a laborious process, but with our help, you’ll be able to identify and rectify anything that might arise! Fixing issues and allowing robots.txt to work in your favor, and not break your domain, is critical in order to succeed. Below are the common issues that you may face moving forward.1. Not Placing the Robots.txt File in the Root Directory
To start the list, understanding the correct location of the robots.txt file is essential. If you place the file anywhere else, it will corrupt your site and create many issues. The robots.txt file should always be on the root directory of your site. This means that it should immediately follow the website URL. If you neglect this step and place the file elsewhere, the web crawlers will be unable to locate it and therefore unable to perform its function. An example of proper placement:
placeholder.com/files/robots.txt – INCORRECT
placeholder.com/robots.txt – CORRECT
2. Wrong Use of Wildcards
Wildcards are the characters specifically used by directives defined for web crawling robots that are used within the robots.txt file. Specifically, there are two wildcards that need to be called to attention – the * and the $ symbols. The * character is shorthand for “every instance of,” or “0 or more individual valid characters.” The $ character is used to illustrate the end of a website URL. Using these two characters properly in your robots.txt file is essential. Examples of correct implementation include: To represent every type of user agent:User-Agent: *
Disallow: /assets*
Disallow: *.pdf$
3. Putting ‘NoIndex’ in Robots.txt
An outdated strategy that no longer needs to be considered, putting the “NoIndex” directive in your robots.txt file will no longer work. In fact, Google discontinued the practice in 2019. At best, this means you may potentially have a lot of useless code in your robots.txt file, but at worse it may create chaos. Proper practice nowadays is to use the meta robots tag instead for this type of use case. The following code can be placed into the page code of the URLs you want to block Google from indexing.<meta name =”robots” content=”noindex”/>
4. Blocking Scripts and Style Sheets
The web is run on scripts and style sheets, so blocking them is a bad idea. In order for Google’s crawlers to rate your site’s page efficiency, they need to be able to access and run these scripts. It is imperative to not block any scripts or style sheets in your robots.txt file for this reason. Blocking these scripts will cause them to not be rendered by the crawlers and will drastically drop, if not negate entirely, your domain’s rank.5. Not Including the Sitemap URL
The sitemap location for your domain will allow the crawler to discover your sitemap easily, which directly translates to a better ranking. Making it easier on the algorithms that dictate the rank of your domain will always be a bonus for optimization purposes. For this reason, putting the location in the robots.txt file is a very useful thing to do. Here is an example of how to place your sitemap’s URL:Sitemap: https://www.placeholder.com/sitemap.xml
6. Unnecessary Use of Trailing Slash
Trailing slashes (slashes that trail after the end of a word: /example/), can give incorrect information to the bots scanning your site. Giving Google the proper information in the correct way is essential for proper crawling interaction and ranking. If you were looking to block a specific URL in your robots.txt file, it needs to be formatted correctly. For example, if you wanted to block placeholder.com/category but wrote the following command:
User-Agent: *
Disallow: /category/
User-Agent: *
Disallow: /category7. Ignoring Case Sensitivity
A simple yet important fact that can be easily overlooked is that URLs are case-sensitive for SEO crawlers. placeholder.com/Test and placeholder.com/test are two different websites as far as the crawler is concerned! This means your robots.txt file needs to reflect this reality. If you are using your robots.txt file to define various directives concerning URLs, case-sensitivity matters. For example, if you wanted to block placeholder.com/test: This would be INCORRECT:
User-Agent: *
Disallow: /Test
User-Agent: *
Disallow: /test8. Using One Robots.txt File for Different Subdomains
In order to get the most precise data to Google, you should have a unique robots.txt file for every single sub-domain of your website, including staging sites. If you do not, the Google crawler may decide to index a particular domain you do not wish to (such as a new and still-under-construction location). Creating efficiencies is important for Google to properly index your content the way you want it to be. Taking the time to categorize all of your domains carefully will pay off in the long run!9. Not Blocking Access to Sites Under Construction
Staging sites, or sites that are under construction, are a crucial aspect of web development. As such, you want to make sure you have control over the creation process as much as possible. All fully functional websites were previously staged and then deployed, but were not snapshotted by Google. Getting a page that’s under construction indexed can be very detrimental to the overall growth of your domain – having your traffic go to an unfinished page instead of a finished one won’t help you! Blocking crawlers from crawling your creation pages is important to ensure they aren’t ranked. Add the following commands to the construction page’s robots.txt file to do so:
User-Agent: *
Disallow: /