X
X
X
X

Knowledge Base

HomepageKnowledge BaseSecurityCrawling robots, blocking bad bots

Crawling robots, blocking bad bots

Sites are crawled around the clock by bots (robots), most often search engines such as Googlebot, msnbot, YandexBot, bingbot and others. These robots index the content of sites in order to provide more accurate and up-to-date search results. They strive to crawl as much as possible and, if possible, all pages of your site. You can choose which directories are not crawled by these bots using a robots.txt file in the root directory of the site.

Before crawling the site, the robot checks the robots.txt file to find out which directories can be indexed and which cannot. The syntax of the file is quite simple:

User-agent: *
Disallow: /

The User-agent describes which bot the listed restrictions apply to, and Disallow lists the forbidden directories. In the example given by wildcard (*) it is stated that the rules apply to absolutely all bots, and Disallow: / prohibits access to the root directory, ie. and to all its subdirectories. If you skip "/" in Disallow, it means that there is no directory limit and bots are free to crawl all directories in your hosting account.

To restrict Googlebot's access to the / admin directory, for example, the robots.txt file should look like this:

User-agent: Googlebot
Disallow: / admin

You can also deny access to a specific file:

User-agent: Googlebot
Disallow: /DirectoryName/FileName.html

If you are not sure about the exact name of the bot you want to restrict, you can see it in the Awstats statistics or in the Raw Access Log on the site. Detailed information about the robots.txt file and how to use it can be found at the following address:

https://www.robotstxt.org/robotstxt.html

The site also includes a list of a large number of robots and a brief description of each.
Bad bots

There are other robots whose indexing does not bring positives for better positioning of the site on the web, but on the contrary - the site is scanned to try to abuse it. This includes security breaches, posting SPAM in contact forms, collecting email addresses to which SPAM is then sent, and more. We call such robots bad bots. We can use the .htaccess file for them if we want to restrict their access.

An effective method for blocking bad robots is to use the User-agent that introduces the robot. You can restrict this User-agent to Rewrite rules in .htaccess:

RewriteEngine On
RewriteCond% {HTTP_USER_AGENT} ^ (. *) Surfbot [NC, OR]
RewriteCond% {HTTP_USER_AGENT} ^ (. *) ChinaClaw [NC, OR]
RewriteCond% {HTTP_USER_AGENT} ^ (. *) Zeus [NC]
RewriteRule. * - [F]

In this example, the Surfbot, ChinaClaw, and Zeus robots will receive a 403 Forbidden message when they try to access the contents of the directory where the .htaccess file is placed. You can add more robots by adding [OR] (or) at the end of each line, except the last User-agent. (However, let's mention that adding too many rules to the .htaccess file can slow down the loading of the site in some cases.)

With such a block, it is recommended that the 404 Not Found and 403 Forbidden error pages exist. If these pages are dynamically generated by your system, this can lead to additional overload.

Another way to block the User-agent is to use SetEnvIfNoCase again in the .htaccess file. Here is an example:

SetEnvIfNoCase User-Agent "^ (. *) Surfbot" bad_bot
SetEnvIfNoCase User-Agent "^ (. *) ChinaClaw" bad_bot
SetEnvIfNoCase User-Agent "^ (. *) Zeus" bad_bot

<Limit GET POST HEAD>
Order Allow, Deny
Allow from all
Deny from env = bad_bot
</Limit>

The first part defines User-agents that will be recognized as bad, and the second part blocks all requests (GET, POST, HEAD) from such robots.

Can't find the information you are looking for?

Create a Support Ticket
Did you find it useful?
(6874 times viewed / 4145 people found it helpful)

Top