It is easy to write Python crawlers, but to write Python crawlers safely, you need to know more, not only technically, but also legally. The Robots protocol is one of them. If you don’t understand the Robots protocol and crawl something you shouldn’t crawl, you may face jail time! 1. Introduction to Robots Protocol The Robots Protocol is also called the Crawler Protocol or the Robot Protocol. Its full name is the Robots Exclusing Protocol, which is used to tell crawlers and search engines which pages can be crawled and which cannot be crawled. The content of this protocol is usually placed in a text file called robots.txt, which is usually located in the root directory of the website. Note that the content in the robots.txt file only tells the crawler what it should crawl and what it should not crawl, but it does not prevent the crawler from crawling those prohibited resources through technical means. It only notifies the crawler. Although you can write a crawler without following the description of the robots.txt file, as an ethical, educated and disciplined crawler, you should try your best to follow the rules described in the robots.txt file. Otherwise, it may cause legal disputes. When a crawler visits a website, it will first check whether there is a robots.txt file in the root directory of the URL. If so, the crawler will crawl web resources according to the crawling range defined in the file. If the file does not exist, the crawler will crawl all directly accessible pages of the website. Let's take a look at an example of a robots.txt file:
This crawling rule first tells the crawler that it is valid for all crawlers, and that any resources other than the test directory are not allowed to be crawled. If this robots.txt file is placed in the root directory of a website, the search engine crawler will only crawl the resources in the test directory, and we will find that the search engine can no longer find resources in other directories. The User-agent above describes the name of the crawler. Setting it to * here means that it is valid for all crawlers. We can also specify certain crawlers, such as the following setting that explicitly specifies the Baidu crawler.
There are two important authorization instructions in the robots.txt file: Disallow and Allow. The former means prohibiting crawling, and the latter means running crawling. In other words, Disallow is a blacklist and Allow is a whitelist. For example, the following are some examples of Robots protocols. 1. Prohibit all crawlers from crawling all resources on the website
2. Prevent all crawlers from crawling resources in the /private and /person directories of the website
3. Only prohibit Baidu crawler from crawling website resources
Many search engine crawlers have specific names. Table 1 lists some commonly used crawler names. Table 1 Common crawler names 2. Analyze Robots protocol We do not need to analyze the Robots protocol ourselves. The robotparser module of the urllib library provides the corresponding API to parse the robots.txt file, which is the RobotFileParser class. The RobotFileParser class can be used in many ways. For example, you can set the URL of the robots.txt file through the set_url method and then analyze it. The code is as follows:
The can_fetch method is used to obtain whether a URL on the website is authorized to be crawled according to the Robots protocol. If it is allowed to be crawled, it returns True, otherwise it returns False. The constructor of the RobotFileParser class can also accept a URL and then use the can_fetch method to determine whether a page can be fetched.
The following example uses the parse method to specify the data of the robots.txt file and output whether different URLs are allowed to be crawled. This is another way to use the RobotFileParser class.
The results are as follows:
This article is reprinted from the WeChat public account "Geek Origin", which can be followed through the following QR code. To reprint this article, please contact the Geek Origin public account. |
<<: The impact of 5G technology on these 20 industries
In the week since the black hole photo was releas...
The development of 5G networks is in full swing. ...
The solution uses new Bluetooth Low Energy (BLE) ...
Recently, a piece of news about the slowdown of d...
5G is accelerating. 3GPP has completed the non-st...
The tribe has been sharing that there are a few C...
As a new paradigm of industrial Internet, the Met...
Gartner, the world's leading information tech...
Megalayer's 618 promotion officially started ...
VIRPUS has released a December discount, offering...
The wonderful opening ceremony of the 2022 Beijin...
1.SPI hardware SPI: Serial Peripheral Interface, ...
As IoT devices become more common, edge computing...
1. Briefly describe the composition of optical fi...
5G has become a household name, but its new WiFi ...