What is a Web robot:
In basic, a web robot is simply a program. The robot is used to automatically and recursively traverses a Web site to retrieve document content and information. Search engine spiders are the most common types of Web robots. These robots visit Web sites and follow the links to add more information to the search engine database.
Web robots often go by different names. You may hear them called:
- spiders
- bots
- crawlers
Commonly Web robots is used to index a site for a search engine. But robots can be used for other purposes as well. Some of more common uses are:
- Link validation – Robots can follow all the links on a site or a page, testing them to make sure they return a valid page code. The advantage to doing this programmatically is inherently obvious, the robot can visit all the links on a page in a minute or two and provide a report of the results much quicker than a human could do manually.
- HTML validation – Similar to link validation, robots can be sent to various pages on your site to evaluate the HTML coding.
- Change monitoring – There are services available on the Web that will tell you when a Web page has changed. These services are done by sending a robot to the page periodically to evaluate if the content has changed. When it is different, the robot would file a report.
- Web site mirroring – Similar to the change monitoring robots, these robots evaluate a site, and when there is a change, the robot will transfer the changed information to the mirror site location.
How To Write a robots.txt File:
We can learn easily writing a valid robots.txt file that is used for search engine spiders. Below steps may help you to learn about it:
In a text editor, open a file named robots.txt. Note that the name must be all lower case, even if your Web pages are hosted on a Windows Web server. You’ll need to save this file to the root of your Web server. For example:
http://blogspalace.com/robots.txt
The format of the robots.txt file is –
User-agent: robot Disallow: files or directories
You can use wildcards to indicate all robots or all robots of a certain type. For example to specify all robots:
User-agent: *
To specify all robots that start with the letter A we can use the below text:
User-agent: A*
The disallow lines can specify files or directories. For example below text can disallow all file and directories:
Disallow: /
To disallow robots not to view the index.html file, use the below text:
Disallow: /index.html
If you leave the Disallow blank, that means that all files can be retrieved, for example, you might want the Googlebot to see everything on your site:
User-agent: Googlebot Disallow:
If you disallow a directory, then all files below it will be disallowed as well.
Disallow: /norobots/
You can also use multiple Disallows for one User-agent, to deny access to multiple areas:
User-agent: * Disallow: /cgi-bin/ Disallow: /images/
You can include comments in your robots.txt file, by putting a pound-sign (#) at the front of the line to be commented:
# Allow Googlebot anywhere User-agent: Googlebot Disallow:
Robots follow the rules in order. For example, if you set googlebot specifically in one of your first directives, it will then ignore a directive lower down that is set to a wildcard.
# Allow Googlebot anywhere User-agent: Googlebot Disallow: # Allow no other bots on the site User-agent: * Disallow: /
Finally You should in mind:
- Find robot User-agent names in your Web log
- Always follow the capitalization of the agent names and the file and directories. If you disallow /IMAGES the robots will spider your /images folder
- Put your most specific directives first, and your more inclusive ones (with wildcards) last