Sunday, 15 March 2015

What are robot.txt file?


Robots.txt is common name of a text file that is uploaded to a Web site's root directory of the Web site. The robots.txt file is used to provide instructions about the Web site to Web robots and spiders. Web authors can use robots.txt to keep cooperating Web robots from accessing all or parts of a Web site that you want to keep private.

For example:

http://www.yourwebsite.com/robots.txt

Here’s a simple robots.txt file: 









User-agent: *

Allow: /wp-content/uploads/
Disallow: /


1. The first line explains which agent (crawler) the rule applies to. In this case, User-agent: * means the rule applies to every crawler.


2. The subsequent lines set what paths can (or cannot) be indexed. Allow: /wp-content/uploads/ allows crawling through your uploads folder (images) and Disallow: / means no file or page should be indexed aside from what’s been allowed previously. You can have multiple rules for a given crawler.



3. The rules for different crawlers can be listed in sequence, in the same file.



Robots.txt Examples

This rule lets crawlers index everything. Because nothing is blocked, it’s like having no rules at all:


User-agent: *
Disallow:



This rule lets crawlers index everything under the “wp-content” folder, and nothing else:


User-agent: *
Allow: /wp-content/
Disallow: /


This lets a single crawler (Google) index everything, and blocks the site for everyone else:


User-agent: Google
Disallow:
User-agent: *
Disallow: /


Some hosts may have default entries that block system files (you don’t want bots kicking off CPU-intensive scripts):


User-agent: *
Disallow: /tmp/ 
Disallow: /cgi-bin/ 
Disallow: /~uname/


Block all crawlers from a specific file:


User-agent: *
Disallow: /dir/file.html


Block Google from indexing URLs with a query parameter (which is often a generated result, like a search):


User-agent: Google
Disallow: /*?



Google’s Webmaster tools can help you check your robots.txt rules.

No comments:

Post a Comment