When we write a crawler, we may need to generate a new URL based on the current URL in the crawler. For example, the following pseudo code:
The running effect is shown in the figure below: But sometimes, the page turning parameter is not necessarily a number. For example, on some websites, visit a URL: https://xxx.com/articlelist?category=technology&after=asdrtJKSAZFD When you access this URL, it returns a JSON string, and this JSON contains the following fields:
This situation is more common in information flow websites. It can only scroll down infinitely to view the next page, and cannot jump directly by page number. The parameter after the next page is returned every time a request is made. When you want to access the next page, use this parameter to replace the parameter after after= in the current URL. In this way, replacing the parameters in the URL is not a simple task. Because the URL may have 4 situations:
You can try to cover these 4 situations and generate the URL of the next page using regular expressions. In fact, we don't need to use regular expressions. Python's built-in urllib module already provides a solution to this problem. Let's take a look at a piece of code:
The running effect is shown in the figure below: As can be seen from the figure, in these four cases, we can successfully add the parameter after = 0000000 for the next page. There is no need to consider how regular expressions can adapt to all situations. Among them, urlparse and urlunparse are a pair of opposite functions. The former converts the URL into a ParseResult object, and the latter converts the ParseResult object back to a URL string. The .query property of the ParseResult object is a string, which is the content after the question mark in the URL, in the following format: parse_qs and urlencode are also a pair of opposite functions. The former converts the string output by .query into a dictionary, while the latter converts the field into a string in the form of .query: After using parse_qs to convert the query into a dictionary, you can modify the parameter value and then convert it back again. Since the .query property of the ParseResult object is a read-only property and cannot be overwritten, we need to call an internal method ._replace to replace the new .query field and generate a new ParseResult object. Finally, convert it back to a URL. The above is what we introduced today, how to use the functions that come with urllib to replace the fields in the URL. |
<<: GSMA: 5G networks will cover two-fifths of the world's population by 2025
>>: 5G in numbers: 5G trends revealed by statistics in the first half of 2021
As a product of the deep integration and applicat...
Recently, the Guangzhou Municipal Bureau of Indus...
Aoyou Host is a long-established foreign VPS host...
Software-defined data center is a data management...
The recent maturation of technologies such as hig...
Aruba, a Hewlett Packard Enterprise (NYSE: HPE) c...
As the demand for high-speed, reliable networks c...
IDC believes that the acceptance and adoption of ...
SSH (Secure Shell) is an end-to-end encrypted net...
Hello everyone, I am Xiaolin. A reader was asked ...
At Cheap Windows VPS, we are always innovating ou...
This year is a period of large-scale 5G construct...
Wireless networking is truly part of the culture ...
On April 23, 2021, the 79th China Educational Equ...
HostXen is a DIY-configurable cloud hosting platf...