Home > Howtos > Robot.txt Introduction

Robot.txt Introduction

robot-txtWhat is robot.txt?

You might have heard of robot.txt  before when you are trying to build a website or looking over a webpage’s root directory, but what exactly is this robot.txt? Robot.txt protocol, also known as the robots exclusion standard, is a file to instruct the search engine robots which pages and files to be indexed or ignored. Robot.txt can prevent disclosure  of privacy from search engines and  reduce unnecessary bandwidth load to the server. A web administrator can decide what instructions to give to the robots by filling in the robot.txt. Just like the META tags, robot.txt is not a necessary component of the webpage, but it doesn’t hurt to create one, as robot.txt will give the search engines an idea how should they treat your webpage, which might help your page ranking a bit!


Create your own Robot.txt

You can create robot.txt with any text editors, such as notepad and wordpad. There are only two fields in the text file. They are user-agent and disallow. User-agent represents the name of the spider/bot, while disallow stands for the directory or file you want to exclude from crawling. The basic format of robot.txt is as follow:

User-agent: *

Disallow: /

“*” is a wildcard character, If you don’t want to specify which robot to exclude, just put 「*」 after user-agent. All pages of this site will be forbidden to index.

To disallow a certain directory, fill in the directory name.

User-agent: *

Disallow: / file01/

To disallow a certain bot, fill in the bot name.

User-agent: Googlebot-Image

Disallow: /

To disallow multiple directories and files.

User-agent: *

Disallow: /file01/

Disallow: /file02/

Disallow: /file03/test.htm

There are some non-standard extensions which is support by some major crawlers :

Allow – Allow bots to crawl on a file within a disallowed directory.

User-agent: *

Disallow: /file01/

Allow: /file01/test.doc

Crawl-delay directive – Number of seconds to wait between successive requests to the same server.

User-agent: *

Disallow: /

Crawl-delay: 10

Sitemap – Provides the sitemap page for the crawlers.

User-agent: *

Disallow: /search

Sitemap: http://www.google.com/hostednews/sitemap_index.xml

Sitemap: http://www.google.com/sitemaps_webmasters.xml

Categories: Howtos Tags:
  1. No comments yet.
  1. No trackbacks yet.
Please leave these two fields as-is:

Protected by Invisible Defender. Showed 403 to 1,578 bad guys.