Robot.txt Introduction
What is robot.txt?
You might have heard of robot.txt before when you are trying to build a website or looking over a webpage’s root directory, but what exactly is this robot.txt? Robot.txt protocol, also known as the robots exclusion standard, is a file to instruct the search engine robots which pages and files to be indexed or ignored. Robot.txt can prevent disclosure of privacy from search engines and reduce unnecessary bandwidth load to the server. A web administrator can decide what instructions to give to the robots by filling in the robot.txt. Just like the META tags, robot.txt is not a necessary component of the webpage, but it doesn’t hurt to create one, as robot.txt will give the search engines an idea how should they treat your webpage, which might help your page ranking a bit!
Create your own Robot.txt
You can create robot.txt with any text editors, such as notepad and wordpad. There are only two fields in the text file. They are user-agent and disallow. User-agent represents the name of the spider/bot, while disallow stands for the directory or file you want to exclude from crawling. The basic format of robot.txt is as follow:
| User-agent: *
Disallow: / |
“*” is a wildcard character, If you don’t want to specify which robot to exclude, just put 「*」 after user-agent. All pages of this site will be forbidden to index.
To disallow a certain directory, fill in the directory name.
| User-agent: *
Disallow: / file01/ |
To disallow a certain bot, fill in the bot name.
| User-agent: Googlebot-Image
Disallow: / |
To disallow multiple directories and files.
| User-agent: *
Disallow: /file01/ Disallow: /file02/ Disallow: /file03/test.htm |
There are some non-standard extensions which is support by some major crawlers :
Allow – Allow bots to crawl on a file within a disallowed directory.
| User-agent: *
Disallow: /file01/ Allow: /file01/test.doc |
Crawl-delay directive – Number of seconds to wait between successive requests to the same server.
| User-agent: *
Disallow: / Crawl-delay: 10 |
Sitemap – Provides the sitemap page for the crawlers.
| User-agent: *
Disallow: /search Sitemap: http://www.google.com/hostednews/sitemap_index.xml Sitemap: http://www.google.com/sitemaps_webmasters.xml |