
Robots.txt is common
name of a text file that
is uploaded to a Web site's root
directory of the Web site. The robots.txt file is used to
provide instructions about the Web site to Web robots and spiders. Web authors can
use robots.txt to keep cooperating Web robots from accessing all or parts of a
Web site that you want to keep private.
For
example:
http://www.yourwebsite.com/robots.txt
Here’s a
simple robots.txt file: 
User-agent: *
Allow: /wp-content/uploads/
Disallow: /
1. The first line explains which agent (crawler) the
rule applies to. In this case, User-agent: * means the rule applies to
every crawler.
2. The subsequent lines set what paths can (or cannot) be indexed. Allow:
/wp-content/uploads/ allows crawling through your uploads folder (images)
and Disallow: / means no file or page should be indexed aside from
what’s been allowed previously. You can have multiple rules for a given
crawler.
3. The rules for different crawlers can be listed in sequence, in the same
file.
Robots.txt Examples
This rule lets crawlers index everything. Because nothing is blocked, it’s like
having no rules at all:
User-agent: *
Disallow:
This rule lets crawlers index everything under the
“wp-content” folder, and nothing else:
User-agent: *
Allow: /wp-content/
Disallow: /
This lets a single crawler (Google) index everything, and blocks the site for
everyone else:
User-agent: Google
Disallow:
User-agent: *
Disallow: /
Some hosts may have default entries that block system files (you don’t want
bots kicking off CPU-intensive scripts):
User-agent: *
Disallow: /tmp/ 
Disallow: /cgi-bin/ 
Disallow: /~uname/
Block all crawlers from a specific file:
User-agent: *
Disallow: /dir/file.html
Block Google from indexing URLs with a query parameter (which is often a
generated result, like a search):
User-agent: Google
Disallow: /*?
Google’s Webmaster tools can help you
check your robots.txt rules.

 
No comments:
Post a Comment