What are Robots.txt File and how can I add it to my Website?
What is a robots text file?
A robots.txt is simply an ASCII or plain text file that tells the search engines where they are not allowed to go on a site – also known as the Standard for Robot Exclusion. Any files or folders listed in this document will not be crawled and indexed by the search engine spiders. Having a robots.txt, even a blank one, shows you acknowledge that search engines are allowed on your site and that they may have free access to it. We recommend adding a robots text file to your main domain and all sub-domains on your site.
What program should I use to create /robots.txt?
You can use anything that produces a text file.
On Microsoft Windows, use notepad.exe, or wordpad.exe (Save as Text Document), or even Microsoft Word (Save as Plain Text)
On the Macintosh, use TextEdit (Format->Make Plain Text, then Save as Western)
On Linux, vi or emacs
How to Check Robots.txt file on my website?
You can access it by typing: " yourwebsiteurl.com/robots.txt"
How to create a robots.txt file
Where to put it
The short answer: in the top-level directory of your web server.
How to add a robots.txt file to your site
A robots text
Robots.txt options for formatting
Writing a robots.txt is an easy process. Follow these simple steps:
Open Notepad, Microsoft Word or any text editor and save the file as ‘robots,’ all lowercase, making sure to choose .txt as the file type extension (in Word, choose ‘Plain Text’ ).
Next, add the following two lines of text to your file:
User-agent: *
Disallow:
‘User-agent’ is another word for robots or search engine spiders. The asterisk (*) denotes that this line applies to all of the spiders. Here, there is no file or folder listed in the Disallow line, implying that every directory on your site may be accessed. This is a basic robots text file.
Blocking the search engine spiders from your whole site is also one of the robots.txt options. To do this, add these two lines to the file:
User-agent: * Disallow: / |
||
If you’d like to block the spiders from certain areas of your site, your robots.txt might look something like this:
User-agent: * Disallow: /database/ Disallow: /scripts/ |
||
The above three lines
Sitemap: http://www.mydomain.com/sitemap.xml
Once complete, save and upload your robots.txt file to the root directory of your site. For example, if your domain is www.mydomain.com, you will place the file at www.mydomain.com/robots.txt.
Once the file is in place, check the robots.txt file for any errors.
Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /
What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:
To exclude all robots from the entire server
User-agent: * Disallow: / |
||
To allow all robots complete access
User-agent: * Disallow: (or just create an empty "/robots.txt" file, or don't use one at all) |
||
To exclude all robots from part of the server
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/ |
||
To exclude a single robot
User-agent: BadBot Disallow: / |
||
User-agent: Google Disallow: User-agent: * Disallow: / |
||
User-agent: * Disallow: /~joe/stuff/ |
||
User-agent: * Disallow: /~joe/junk.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html |
||
You can check your robots through this tool: http://tool.motoricerca.info/robots-checker.phtml NOTE: We can only correctly detect the XML Sitemap if it complies with the standard names, that is, that it is hosted for example in https://sitename.com/sitemap.xml .We also detect non-standard XML Sitemap if it is referenced in robots.txt, for example: robots.txt: ------- User-agent: * sitemap: https://sitename.com/1_en_0_sitemap.xml ------- |