Robots Exclusion Protocol
Introduction
REP stands for Robots Exclusion Protocol. It involves robots.txt, XML Sitemaps, robots meta tags, X-Robot-Tag(s), and the nofollow link attribute.
Crawling Versus Indexing
Crawling is the process of retrieving web documents initiated by a web spider. Crawling has no say in how pages are ranked.
Indexing is the algorithmic process of analyzing and storing the crawled information in an index.
robots.txt Configurations
Disallowing image crawling
User-agent: Yahoo-MMCrawler
Disallow: /images/
Allow: /images/public/
User-agent: msnbot-media
Disallow: /images/
Allow: /images/public/
User-Agent: Googlebot-Image
Disallow: /images/
Allow: /images/public/
User-Agent: *
Disallow: /images/
Blocking Office documents
User-agent: *
Disallow: /*.doc$
Disallow: /*.xls$
Disallow: /*.ppt$
Disallow: /*.mpp
Disallow: /*.mdb$
Robots Meta Directives
There are two types of meta directives: those that are part of the HTML page, and those that the web server sends as HTTP headers.
The nofollow Link Attribute
Links marked with the nofollow attribute will not pass any link juice