This post is part of The Changing Web series - articles noting discoveries or announcements from across the web. We keep our ear to the ground to call out the fads, report on the good stuff and celebrate the awesome.
Google has announced plans to turn Robots Exclusion Protocol (REP) - better known as robots.txt - into an official internet standard.
For 25 years, the Robots Exclusion Protocol (REP) has been a critical component of shaping the web as we know it.
Chances are you might not have heard of it, but if you’ve ever been involved in launching a website, you will have come across a robots.txt file. It's key to making sure your website is properly indexed by Search Engines.
What is a robots.txt file?
A file that lives in the main directory (typically your root domain or homepage) that instructs web robots (often search engine crawlers) how to crawl the pages of your website.
Robots.txt files indicate whether certain web-crawling software can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behaviour.
Search engines like Google and Bing rely on crawlers to know what content to index and therefore display in their search results. But for website owners, the purpose of a robots.txt file hasn’t always been clear, which creates a nervousness around how best to use it - you don’t want to inadvertently deny search engines access to key information on your website!
Why do you need a robots.txt file?
Because the robots.txt file handles which content on your website is crawled, the file is useful if you…
- Want to prevent search engines from indexing a particular area of your website
- Want to prevent search engines from indexing certain asset files (PDF’s, images, etc)
- Want to prevent search engines from indexing internal search result pages
- Want to tell search engines to ignore any duplicated pages on your website
- Want to tell search engines where your sitemap is located
They are the brains behind what search engines know of your website - they allow you to determine what is crawled and indexed, which influences what you appear for when potential customers search online.
Why does Google want to create a new internet standard?
Google recognise that despite being in use for 25 years, REP has never become an official internet standard, which means developers have interpreted the protocol differently.
To remove the ambiguity surrounding robots.txt files, and to make it easier for you to create and maintain a successful website, Google is open-sourcing its own robots.txt parser. The parser is the foundation of Googlebot (Google’s web crawler) that determines which URLs may be accessed based on the rules set out in the robots.txt file.
In open sourcing the parser, developers can continue to ensure that the instructions given to search engines are as effective as possible.
Google’s decision to open-source their robots.txt parser is a win for internet standards that will help to ease concerns about whether your website is being accurately represented in search engines. The standard and the parser is great news for both developers and website owners!