Robots Exclusion Protocol

Introduction

REP stands for Robots Exclusion Protocol. It involves robots.txt, XML Sitemaps, robots meta tags, X-Robot-Tag(s), and the nofollow link attribute.

Crawling Versus Indexing

Crawling is the process of retrieving web documents initiated by a web spider. Crawling has no say in how pages are ranked.
Indexing is the algorithmic process of analyzing and storing the crawled information in an index.

robots.txt Configurations

Disallowing image crawling

User-agent: Yahoo-MMCrawler
Disallow: /images/
Allow: /images/public/
User-agent: msnbot-media
Disallow: /images/
Allow: /images/public/
User-Agent: Googlebot-Image
Disallow: /images/
Allow: /images/public/
User-Agent: *
Disallow: /images/

Blocking Office documents

User-agent: *
Disallow: /*.doc$
Disallow: /*.xls$
Disallow: /*.ppt$
Disallow: /*.mpp
Disallow: /*.mdb$

Robots Meta Directives

There are two types of meta directives: those that are part of the HTML page, and those that the web server sends as HTTP headers.

The nofollow Link Attribute

Links marked with the nofollow attribute will not pass any link juice

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.