Get email alerts when you get new backlinks with our SEO tool Start your free trial now

Press enter to see results or esc to cancel.

Robots.txt: The Tiny Website File That Can Make or Break Your SEO

When my washing machine stopped working about a month ago, I tried everything to fix it.

I made it run an extra wash cycle, cleaned the drum, repeat.

But when I went to empty the filter with the help of my aunt, we managed to do it only after we found a €2 coin.

With it, we managed to open the filter door and empty it for good.

That little coin saved me, and my washing machine.

The robots.txt file impacts SEO just like that little coin had a huge impact on making my washing machine function again.

Robots.txt is a tiny text file located at the root folder of your website, and it can make a ton of difference to your overall SEO.

It’s a pity that sometimes web hosts don’t create a robots.txt file by default.

But, you can still create (and optimize) one yourself.

That’s exactly what this guide is for—from the basics of robots.txt SEO to the technical details made easy to understand.

Let’s dive in!

Robots.txt: The Tiny Website File That Can Make or Break Your SEO

Introducing the Robots.txt File

Pretty much, a robots.txt file is what you see below—a small text file located at the root of your website:

 

robots-txt-seo
 

It states, on three lines:

  • the User-agent you want to work with
  • a Disallow field to tell search engine bots and crawlers what not to crawl
  • an Allow field to let them know what to crawl instead

The syntax is simple:

After each field name (or directive), you add a colon followed by the value that you want the robots to consider.

While field names are case-insensitive, value names are not. So, for example, if your folder is named “/My-Work/,” you can’t put “/my-work/” in your robots.txt file.

It won’t work correctly.

Let me explain robots.txt fields and values more in detail below.

User-agent

This field declares what user agent we want to work with. In the language of robots.txt, a user agent is a spider bot or crawler.

The syntax is:

 

User-agent: (value)

 

For example, if I wanted to make the subsequent rules (values) apply to all user agents, I would enter the following:

 

User-agent: *

 

And if I wanted to say that the rules apply to a specific agent, it would look like this (replacing “AgentName” with the name of the user agent you want to work with):

 

User-agent: AgentName

 

Examples of commonly used user agents for search engines and social networks are:

  • Googlebot
  • Googlebot-Images
  • Bingbot
  • Slurp (Yahoo!’s web crawler)
  • DuckDuckBot
  • YandexBot
  • Baiduspider
  • YandexBot
  • Facebot (Facebook’s crawler)
  • Twitterbot (Twitter’s crawler)
  • ia_archiver (Alexa’s crawler)

Here you can find the complete list of user agents used by Google.

Disallow

Disallow is the blacklisting directive in the robots.txt language.

The basic usage is as follows:

 

User-agent: *
Disallow:

 

Simply writing “Disallow:” without having it followed by any value means that you want all robots to crawl your website.

On the other hand, if you don’t want robots to crawl your website at all (not even a small portion of it), you’d enter:

 

User-agent: *
Disallow: /

 

You can also use the Disallow field when you want bots to crawl your entire website minus one or more specific files or areas.

For example:

 

User-agent: Googlebot
Allow: /public.jpg
Disallow: /private.jpg

 

Allow

The robots.txt whitelisting directive!

This is a good way to tell robots that you want one or more specific files to be crawled when they’re located inside an area of your site that you’ve previously disallowed with another rule.

For example, you may want to have Googlebot crawl only one image in a private area of your site, but not the rest of the private area.

To honor this intention, you can use this syntax:

 

User-agent: Googlebot
Disallow: /private/
Allow: /private/the-only-image-you-can-see.jpg

 

Robots.txt File Comments and Length

To add a comment to your robots.txt file, simply place a hash symbol (#) before the line you’re writing.

For example:

 

# This rule blocks Bingbot from crawling my blog directory
User-agent: Bingbot
Disallow: /blog/

 

A robots.txt file can be any length, there’s no set maximum.

Wanna take a look at Google’s?

(You might have to scroll a little bit there.)

How Robots.txt Can Serve Your SEO Efforts

As I previously mentioned, the robots.txt file can hugely impact SEO. Particularly, it affects page indexing and the indexing of other content types (such as media and images).

Here’s how you can use the robots.txt file to better your SEO outcomes.

Using “User-agent” for SEO

As you’ve seen, when you write the User-agent field, you have the option to apply certain rules to all search engines and crawlers (with the asterisk *), or to single robots.

Or both, when you want to handle a mix of different behaviors.

Take a look at this example from one of my websites:

 

robots-txt-seo
 

Here I wanted to exclude Google Images from indexing my images after I found out that some of my artwork from this and a similar website was scraped years ago. I also wanted to deny Alexa’s web crawler from scanning my site.

I applied this SEO and reputation management decision to the robots.txt file by simply writing down Google Images’ and Alexa’s user agents and applying a Disallow rule to both of them, one per line.

As an SEO, you know what search engines (or parts of search engines) you want to appear in, for whatever reason.

Robots.txt lets you tell web services what you allow and what you don’t, de facto determining the way your site appears (or doesn’t appear) on each platform.

Another common application of this field is when you don’t want Wayback Machine (Archive.org) to save snapshots of your website.

By adding these two lines to your robots.txt file:

 

User-agent: archive.org_bot
Disallow: /

 

You can exclude the internet archive from crawling and snapshotting your website.

Using “Disallow” and “Allow” Directives for SEO

The Disallow and Allow directives are powerful tools to tell search engines and web mining tools exactly what to crawl and index.

So far, you’ve seen how to use them to exclude (or include) files and folders from being scanned and indexed. If you use these directives properly, you can optimize your crawl budget to leave out duplicate pages and service pages that you don’t want to rank in the SERPs (for example, thank you pages and transactional pages).

Here’s how I’d do that for a thank you page:

 

User-agent: Googlebot
Disallow: /thank-you-for-buying-heres-your-guide/

 

(Heck, can you imagine how many sales you could lose if a page like that gets indexed?)

The Dangers of Not Taking Care of Your Robots.txt

In a case study for Search Engine Land, Glenn Gabe reports how a company’s badly written robots.txt file led to URL leaks and index drop outs.

The kind of bad things you definitely don’t want to happen!

The company in question found themselves with a case sensitivity issue when disallowing category folders (“/CATEGORY/” instead of “/Category/”), and had disallowed their entire website by using “Disallow: /” instead of “Disallow:” (without the trailing slash).

Because blocked URLs don’t drop out altogether but over a slow leak, the company had witnessed their rankings decline over a period of time.

Gabe also wrote a longer article about what happened when another company mistakingly disallowed their entire site.

It’s easy to follow that a regular audit (and good maintenance) of your robots.txt file for SEO is critical to preventing such disastrous issues.

Robots.txt Hacks for SEO and File Security

In addition to basic robots.txt usage, you can implement a few more hacks to help support and boost your SEO strategy.

Add a Sitemap Rule to Your Robots.txt File

You can add a sitemap to your robots.txt file—even more than one, actually!

The screenshot below shows how I did this for my business website:

 

robots-txt-seo
 

I added three sitemaps, one for my main site and two from subsites (blogs) that I want counted as part of the main site.

While adding a sitemap to your robots.txt file is no guarantee that it’ll help better site indexing, it’s worked for some webmasters so it’s worth giving a try!

Hide Files That You Don’t Want Search Engines or Users to See

It could be that .PDF e-book you’re selling on your blog for your most loyal readers only.

Or it might be a subscriber-only page that you don’t want common mortals to get their hands on.

Or a legacy version of a file that you no longer want findable except through private exchange.

Whatever the reason for not wanting a file to be available to the public, you have to remember this common sense rule:

Even though search engines will ignore a page or file stated in your robots.txt file, human users will not.

As long as they’re able to load the robots.txt file in their browser, they can read your blocked URLs, copy and paste them into their browser, and get full access to them.

So when it comes robots.txt, SEO and common usage isn’t enough. You also have to ensure that human users keep their hands off the confidential material that you’ve entrusted the robots.txt to keep out of search engines!

Now the question is: How do you do it?

I’m happy to tell you it only takes three steps:

1. Create a specific folder for your secret files

2. Add index protection to that folder (so nobody browsing it can see its contents)

3. Add a Disallow rule to that folder (not to the files under it because they’ll inherit the rule)

Let’s get to putting that into practice.

Step #1: Create a specific folder for your secret files

First, log in to your website administration panel and open the file manager that comes with it (e.g. File Manager in Cpanel). Alternately, you can use a desktop-based FTP client such as FileZilla.

This is how I created the folder “/secret-folder/” in my website using cPanel’s File Manager:

 

robots-txt-seo
 

Step #2: Add index protection to the folder

Secondly, you have to add protection for the index of that folder.

If you use WordPress, you can protect all folders by default by downloading and activating the Protect Uploads free plugin from the repository.

In all other cases, including if you want to protect this one folder only, you can use two methods (following on from my example above):

A. .htaccess 403 Error Method

Create a new .htaccess file under “/secret-folder/” and add this line to it:

 

Options -Indexes

 

This line tells browsers to deny the listing of directory files for browsing.

If that doesn’t work on your web server, use:

 

Deny from all

 

instead.

B. Index.html File Method

Create an index.html (or default.html) under “/secret-folder/.”

This file should be empty or contain a small string of text to remind users who are browsing that this directory is inaccessible (e.g. “Shoo away. Private stuff here!”).

Step #3: Add a Disallow rule to the folder

As the third and last action, go back to your robots.txt file at the root of your website and Disallow the entire folder.

For example:

 

User-agent: *
Disallow: /secret-folder/

 

And you’re done!

 

As you can see, doing robots.txt SEO is not wasting your time on a minor SEO factor.

Your robots.txt file might seem as small and insignificant as the coin I used to “fix” my washing machine, be it can be just as powerful and critical to the good standing of your website in search engines.

So take good care of it!

Comments

Comments are disabled for this post

Monitor Backlinks

Free 30-day trial with Monitor Backlinks:

Discover your backlinks
Discover your competitors' backlinks
See Keyword rankings
Get E-mail alerts when you lose or gain new links

Use our free Backlink Checker to instantly get a list of backlinks for any website.