In order to convince the developer community to use the Robots Exclusion Protocol (REP) as an industry standard, Google has decided to encourage their interest by making them open source with your own set of robots.txt instructions.
The Robots Exclusion Protocol, which was proposed as a standard by the Dutch software engineer Martijn Koster in 1994, has become the most used by websites to tell automated crawlers which parts of a website should not be processed.
The Googlebot Google crawler, for example, analyzes the robots.txt file when indexing websites to verify special instructions on which sections it should ignore and, if there is no such file in the root directory, it assumes that it is correct to scan ( and indexing) of the entire site. These files are not always used to provide direct scan instructions, as they can also be filled with certain keywords to improve search engine optimization.
While the Robots Exclusion Protocol is often referred to as a “standard”, it has never become a true Internet standard, as defined by the Internet Engineering Task Force (IETF) – the non-profit open organization that deals with regulating Internet protocols.
But there are also lots of typos in robots.txt files. Most people miss colons in the rules, and some misspell them. What should crawlers do with a rule named “Dis Allow”? pic.twitter.com/nZEIyPYI9R
— Google Webmasters (@googlewmc) July 1, 2019
Google reported that the REP, as it is, is open to interpretation and may not always cover every single aspect of the websites ( for example, Internet Archive managers haven’t used it for several years now). For this reason, it wants there to be very specific rules. This will allow its tools to index web pages even better, making its search engine even more complete.