Writing crawlers has become "prison programming" because you don't understand the Robots protocol

[[386960]]

It is easy to write Python crawlers, but to write Python crawlers safely, you need to know more, not only technically, but also legally. The Robots protocol is one of them. If you don’t understand the Robots protocol and crawl something you shouldn’t crawl, you may face jail time!

1. Introduction to Robots Protocol

The Robots Protocol is also called the Crawler Protocol or the Robot Protocol. Its full name is the Robots Exclusing Protocol, which is used to tell crawlers and search engines which pages can be crawled and which cannot be crawled. The content of this protocol is usually placed in a text file called robots.txt, which is usually located in the root directory of the website.

Note that the content in the robots.txt file only tells the crawler what it should crawl and what it should not crawl, but it does not prevent the crawler from crawling those prohibited resources through technical means. It only notifies the crawler. Although you can write a crawler without following the description of the robots.txt file, as an ethical, educated and disciplined crawler, you should try your best to follow the rules described in the robots.txt file. Otherwise, it may cause legal disputes.

When a crawler visits a website, it will first check whether there is a robots.txt file in the root directory of the URL. If so, the crawler will crawl web resources according to the crawling range defined in the file. If the file does not exist, the crawler will crawl all directly accessible pages of the website. Let's take a look at an example of a robots.txt file:

 User -agent:*
 Disallow:/
 Allow:/test/

This crawling rule first tells the crawler that it is valid for all crawlers, and that any resources other than the test directory are not allowed to be crawled. If this robots.txt file is placed in the root directory of a website, the search engine crawler will only crawl the resources in the test directory, and we will find that the search engine can no longer find resources in other directories.

The User-agent above describes the name of the crawler. Setting it to * here means that it is valid for all crawlers. We can also specify certain crawlers, such as the following setting that explicitly specifies the Baidu crawler.

 User -agent: BaiduSpider

There are two important authorization instructions in the robots.txt file: Disallow and Allow. The former means prohibiting crawling, and the latter means running crawling. In other words, Disallow is a blacklist and Allow is a whitelist. For example, the following are some examples of Robots protocols.

1. Prohibit all crawlers from crawling all resources on the website

 User -agent:*
 Disallow:/

2. Prevent all crawlers from crawling resources in the /private and /person directories of the website

 User -agent: *
 Disallow: /private/
 Disallow:/person/

3. Only prohibit Baidu crawler from crawling website resources

 User -agent: BaiduSpider
 Disallow:/

Many search engine crawlers have specific names. Table 1 lists some commonly used crawler names.

Table 1 Common crawler names

Crawler Name	Search Engines	website
Googlebot	Google	www.google.com
BaiduSpider	Baidu	www.baidu.com
360Spider	360 Search	www.so.com
Bingbot	Bing	www.bing.com

2. Analyze Robots protocol

We do not need to analyze the Robots protocol ourselves. The robotparser module of the urllib library provides the corresponding API to parse the robots.txt file, which is the RobotFileParser class. The RobotFileParser class can be used in many ways. For example, you can set the URL of the robots.txt file through the set_url method and then analyze it. The code is as follows:

 form urllib.robotparser import RobotFileParser
 robot = RobotFileParser()
 robot.set_url( 'https://www.jd.com/robots.txt' )
 robot.read ()
 print(robot.can_fetch( '*' , 'https://www.jd.com/test.js' ))

The can_fetch method is used to obtain whether a URL on the website is authorized to be crawled according to the Robots protocol. If it is allowed to be crawled, it returns True, otherwise it returns False.

The constructor of the RobotFileParser class can also accept a URL and then use the can_fetch method to determine whether a page can be fetched.

 robot = RobotFileParser( 'https://www.jd.com/robots.txt' )
 print(robot.can_fetch( '*' , 'https://www.jd.com/test.js' ))

The following example uses the parse method to specify the data of the robots.txt file and output whether different URLs are allowed to be crawled. This is another way to use the RobotFileParser class.

 from urllib.robotparser import RobotFileParser
 from urllib import request
 robot = RobotFileParser()
 url = 'https://www.jianshu.com/robots.txt'  
 headers = {
 'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36' ,
 'Host' : 'www.jianshu.com' ,
 }
 req = request.Request(url=url, headers=headers) 
  
 # Grab the contents of the robots.txt file and submit it to the parse method for analysis
 robot.parse( request.urlopen(req) .read ().decode( 'utf-8' ).split( '\n' ))
 # Output True  
 print(robot.can_fetch( '*' , 'https://www.jd.com' ))
 # Output True  
 print(robot.can_fetch( '*' , 'https://www.jianshu.com/p/92f6ac2c350f' ))
 # Output False  
 print(robot.can_fetch( '*' , 'https://www.jianshu.com/search?q=Python&page=1&type=note' ))

The results are as follows:

 True  
 True  
 False

This article is reprinted from the WeChat public account "Geek Origin", which can be followed through the following QR code. To reprint this article, please contact the Geek Origin public account.

<<: The impact of 5G technology on these 20 industries

>>: With the advent of the 5G era, there will be three major changes in the future, and the retail industry will be the biggest beneficiary

Facebook and Google plan to lay submarine cables from the United States to Southeast Asia

2018 Edge Computing Industry Alliance starts a new journey! Huawei promotes technological innovation and accelerates commercial implementation

Blog

From trials to use cases, the big 5G stories of 2017

Sharktech: Denver high-end 1Gbps unlimited traffic server $129/month, Los Angeles 1Gbps unlimited traffic starting at $59/month

Blog

Four factors driving 100Gbps network upgrades

Blog

12% off on all CUBECLOUD products, Los Angeles special annual payment starting from 230 yuan

Blog

80VPS: Hong Kong/Japan/Korea VPS annual payment starts from 299 yuan, Los Angeles VPS annual payment starts from 199 yuan

The tribe has been sharing that there are a few C...

Writing crawlers has become "prison programming" because you don't understand the Robots protocol

Facebook and Google plan to lay submarine cables from the United States to Southeast Asia

5G and IoT set off a revolutionary wave and provide new value

my country's 5G enters a substantial acceleration phase and is ready for commercial use

2018 Edge Computing Industry Alliance starts a new journey! Huawei promotes technological innovation and accelerates commercial implementation

From trials to use cases, the big 5G stories of 2017

PostMessage can also be used like this

Adhere to innovation and sink into the scene Ruijie takes action to keep up with the education informatization 2.0

Sharktech: Denver high-end 1Gbps unlimited traffic server $129/month, Los Angeles 1Gbps unlimited traffic starting at $59/month

Four factors driving 100Gbps network upgrades

12% off on all CUBECLOUD products, Los Angeles special annual payment starting from 230 yuan

Recommend

Are there many pitfalls when porting your number? These users can't even do it

5G applications with over 100 million users have crossed the starting line

Aruba Launches First-Ever Asset Tracking Solution to Improve Workforce Efficiency and Customer Experience at a Low Cost

Across four districts in Beijing, we tested whether the "network speed reduction" is true

A chart showing the first phase 5G deployment schedule of the four major US operators

80VPS: Hong Kong/Japan/Korea VPS annual payment starts from 299 yuan, Los Angeles VPS annual payment starts from 199 yuan

Metaverse: What are the four pillars?

Gartner predicts that global IT spending will reach $4 trillion in 2021

[6.18] Megalayer: Hong Kong CN2 independent server from 400 yuan/month, US cluster server from 888 yuan/month

VIRPUS: 40% off on all VPS, Seattle data center, XEN architecture, monthly payment of $2, annual payment of $20

Exclusive reveal! How 5G can help secure large-scale events

SPI subsystem SPI spec

Why Manufacturing is an Excellent Use Case for Edge Computing

Forty-five kinds of traditional knowledge about optical fiber and optical cable

WiFi 6 is not suitable for individual users yet