The Search Engine Professionals at Rank for $ales.com --- In business since 1997.

Search this site

The Robots.txt file

January 16, 2004

(Updated from my June 2002 article).
A search engine crawler or spider is a Web robot and, as such, normally chooses to follow the robots.txt file, if present. The robots exclusion protocol per se was developed at the end of 1993 and even today, still remains the Web's standard for controlling how search engine robots actually access a particular Web site.

Most major search engines claim to support it, but no robot, including a search engine spider, has to support it.

The purpose of the robots.txt protocol is to provide a mechanism for web servers to indicate to search engine crawlers which parts of their server should not be accessed, in other words, to prevent robots from reading certain parts of their server wich could contain sensitive or confidential information.

How does this purpose relate to preventing a search engine from indexing a particular resource? Unfortunately, the general answer to this question is "It doesn�t".

If the robots.txt file can be used to prevent access to certain parts of a web site, it can also prevent access to the whole site too ! Rank for Sales, on more than one instance has found the robots.txt file to be the main culprit of why a site wasn't listed in certain search engines. If it isn't written correctly, it can cause all kinds of problems and, the worst part is, you will probably never find out about it just by looking at your actual HTML code.

When a client asks us to analyse a web site that has been online for about a year and is not listed in certain engines, the first place we look is the robots.txt file. Once we have corrected that and have optimized his most important keywords and keyphrases, usually the rankings go way up within the next thirty to sixty days thereafter.

More on the robots.txt file
The Disallow line in a robots.txt file means "disallow reading", but that does not mean "disallow indexing". In other words a disallowed resource may be listed in a search engine�s index, even if the search engine follows the protocol. The most obvious demonstration of this is the Google search engine. Google can add files to its index without reading them, merely by considering links to those files.

In theory, Google can build an index of an entire Web site without ever visiting that site or ever retrieving its robots.txt file. In so doing it is not breaking the robots.txt protocol because it is not reading any disallowed resources, it is simply reading other web sites' links to those resources, wich Google constantly uses in its page rank algorithm.

A resource does not necessarily need to be read in order to be indexed. To the question of how the robots.txt file can be used to prevent a search engine from listing a particular resource in its index, in practice, most search engines have placed their own interpretation on the robots.txt file which allows it to be used to prevent them adding resources to their index.

Most search engines interpret a resource being disallowed by the robots.txt file as meaning they should not add it to their index, and if it is already in their index (placed there by previous spidering activity) they remove it. This last point is important, and an example will illustrate that important subject.

The anomalies and inadequacies of the robots.txt file and robots meta tag properties are indicative of what sometimes could be a bigger problem. It is impossible to prevent any directly accessible resource on a site from being linked to by external sites, be they partner sites, competitive sites or, search engines. Even with the robots.txt file, there is no legal nor technical reason why they should be used, least of all by humans creating links, for whom the standards were not even written.

This may not seem a bad thing, but there are many instances when a site owner would rather a particular page was never linked to from any other site on the Web. If such is the case, the robots.txt file will, to a certain degree help the site owner achieve his or her goals.

Article written by Serge Thibodeau,
President & CEO,
Rank for $ales
Copyright (c) Serge Thibodeau 2003

Unless otherwise specified, all content and material on this site is copyrighted by Serge Thibodeau of rankforsales.com and may not be reproduced by any means without express written permission. Using my content without permission is a theft of my work. Please contact sthibodeau@rankforsales.com to discuss certain reprint options that would be acceptable.

You can read some of Serge Thibodeau's exclusive comments that are not posted on this website. Visit his personal blog by clicking here. For hardware, software or IT-related technology questions, it is recommended you visit www.techblog.org

We strongly suggest you bookmark our web site by clicking here.

Tired of receiving unwanted spam in your in box? Get SpamArrest� and put a stop to all that SPAM. Click here and get rid of SPAM forever!

Get your business or company listed in the Global Business Listing directory and increase your business. It takes less then 24 hours to get a premium listing in the most powerful business search engine there is. Click here to find out all about it.

Rank for $ales strongly recommends the use of WordTracker to effectively identify all your right industry keywords. Accurate identification of the right keywords and key phrases used in your industry is the first basic step in any serious search engine optimization program. The keywords you think are the best may be totally different than the ones recommended by WordTracker. Click here to start your keyword and key phrase research.

Pay Rank for $ales securely with your Visa, MasterCard, Discover, or American Express credit card through the secure PayPal network. (Note: PayPal is an eBay company, and maintains a net free capital of US $ 50 Million).

You can link to the Rank for Sales web site as much as you like. Read our section on how your company can participate in our reciprocal link exchange program and increase your rankings in all the major search engines such as Google, AltaVista, Yahoo and all the others.

Powered by Sun Hosting Protected by Proxy Sentinel� Traffic stats by Site Clicks�

Site design by GCIS SEO enhanced by Pagina+� Online sales by Web Store�

Call Rank for Sales toll free from anywhere in the US or Canada: 1-800-631-3221
email: info@rankforsales.com