One skill a day: You can make a mistake in splicing a URL, and write a crawler

When writing crawlers, we often need to parse the list pages of a website. For example, the following example:

 <html>
 <head>
        <meta charset= "utf-8" >
 <title>Test relative path</title>
 </head>
 <body>
 <div>
 <h1>Book List</h1>
 <ul>
 <li> <a href="http://127.0.0.1:8000/book/1.html" > First Book</a></li>
 <li> <a href="http://127.0.0.1:8000/book/2.html" > The Second Book</a></li>
 <li> <a href="http://127.0.0.1:8000/book/3.html" > The third book</a></li>
 <li> <a href="http://127.0.0.1:8000/book/4.html" > The fourth book</a></li>
 <li> <a href="http://127.0.0.1:8000/book/5.html" > Fifth Book</a></li>
 </ul>
 </div>
 </body>
 </html>

The running effect is shown in the figure below:

In this case, I think it is very simple to get the URL of each item. Just write an XPath, as shown below:

If you look closely, you will find that each URL starts with http://127.0.0.1:8000. The address of the current list page is also http://127.0.0.1:8000. So for simplicity, you can use relative paths in the tags:

 <html>
 <head>
        <meta charset= "utf-8" >
 <title>Test relative path</title>
 </head>
 <body>
 <div>
 <h1>Book List</h1>
 <ul>
 <li> <a href="/book/1.html" > First Book</a></li>
 <li> <a href="/book/2.html" > The Second Book</a></li>
 <li> <a href="/book/3.html" > The Third Book</a></li>
 <li> <a href="/book/4.html" > Book 4</a></li>
 <li> <a href="/book/5.html" > Fifth Book</a></li>
 </ul>
 </div>
 </body>
 </html>

The running effect is shown in the figure below. Only half of the URL can be extracted using XPath:

But the browser can correctly recognize such relative addresses, and when you click, it can automatically jump to the correct address:

If the relative path starts with /, the main domain name of the website will be added to the front of the relative path.

But what if the address of the current list page partially overlaps with the relative path of the link? As shown in the following figure:

The address of the current page is http://127.0.0.1:8000/book. The relative address is /book/1.html. In this case, you can simplify it further by not adding a slash in front of the relative path and changing the HTML to:

 <html>
 <head>
        <meta charset= "utf-8" >
 <title>Test relative path</title>
 </head>
 <body>
 <div>
 <h1>Book List</h1>
 <ul>
 <li> <a href="1.html" > First Book</a></li>
 <li> <a href="2.html" > The Second Book</a></li>
 <li> <a href="3.html" > The third book</a></li>
 <li> <a href="4.html" > The fourth book</a></li>
 <li> <a href="5.html" > Fifth Book</a></li>
 </ul>
 </div>
 </body>
 </html>

The running effect is shown in the figure below:

In this case, the browser can still correctly identify it, as shown in the following figure:

The browser knows that if the relative path does not start with /, it will concatenate the URL of the current page with the relative path. But it should be noted that when concatenating, the part to the left of the rightmost slash will be taken. The part to the right will be discarded. It is equivalent to concatenating the file address with the folder where the file is located. As shown in the figure below:

If you can't remember how to distinguish them, you can use Python's own urllib.parse.urljoin to connect, as shown below:

Seeing this, you may think that I have written another article today. Is such a simple thing worth writing an article?

So let's look at the following example:

The domain name is http://127.0.0.1:8000/book/index.html, and the relative domain name is 1.html, but why is the URL automatically recognized by the browser www.kingname.info/1.html?

The key to this problem lies in the tags in the source code:

 <html>
 <head>
        <meta charset= "utf-8" >
 <title>Test relative path</title>
        <base href= "http://www.kingname.info" >
 </head>
 <body>
 <div>
 <h1>Book List</h1>
 <ul>
 <li> <a href="1.html" > First Book</a></li>
 <li> <a href="2.html" > The Second Book</a></li>
 <li> <a href="3.html" > The third book</a></li>
 <li> <a href="4.html" > The fourth book</a></li>
 <li> <a href="5.html" > Fifth Book</a></li>
 </ul>
 </div>
 </body>
 </html>

If there is a tag at the head of the HTML code, the value of its href attribute will be used to concatenate an absolute path with the relative path, instead of using the URL of the current page.

If you don't know this, your crawler may have problems when splicing sub-page URLs. The website can also use this mechanism to construct a honeypot. The URL spliced according to the tag is the real sub-page address, and the URL spliced with the current page URL is the honeypot address. When the crawler accesses it, it will capture false data or be blocked immediately.

For a detailed description of the tag, please read: The Document Base URL element[1].

References

[1] The Document Base URL element: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base

<<: How powerful is WiFi7? Three times faster than WiFi6, as fast as lightning

>>: Xi'an Yimatong previously reported: It took two days and two nights to optimize a 1M image to 100kb

5G Industry In-depth Report 1: China's main technology becomes the 5G communication coding standard

Blog

WiFi signal is strong but speed is slow? Here’s how to fix it!

Blog

The number of 5G base stations has reached 1.159 million. The Gigabit optical network has the capacity to cover more than 200 million households.

Blog

F5 Releases 2020 State of Application Services Report (APAC Edition) Interpreting Five Major Findings in Enterprise Digital Transformation

Blog

Better connections enable faster, more flexible networks

The AI dual competitions will enter the final round in August. Huawei invites you to witness the birth of the champions of the DIGIX Geek Artificial Intelligence Campus Innovation Competition and the Algorithm Elite Competition!

Blog

Recommend

Wang Wei of SINO-Info: Five years of CDM efforts have made “using and managing data well” a reality

The famous writer Spencer Johnson once said, &quo...

Cisco unveils future networks that can self-learn, self-adjust and self-evolve

Cisco's next-generation network can continuou...

The Trump administration changes its policies overnight: the study visa rules are changed back to the old ones, and the new policy was aborted in less than a week

This article is reprinted with permission from AI...

[11.11] CUBECLOUD 15% off all items, Los Angeles special monthly payment starting from 20 yuan, annual payment buy one get one free

CUBECLOUD is a Chinese hosting company founded in...

Rather than calling it a skill, it is better to call it a history of blood and tears. Do you really know how to choose a router?

Whenever I am praised for my shopping skills, I c...

China Telecom and Huawei establish a joint business innovation center to achieve win-win business through innovative cooperation models

China Telecom and Huawei jointly announced the es...

5G network speed is not as fast as 4G. Is this a trick of the operators?

Do you often hear descriptions like “5G Internet ...

One skill a day: You can make a mistake in splicing a URL, and write a crawler

5G Industry In-depth Report 1: China's main technology becomes the 5G communication coding standard

WiFi signal is strong but speed is slow? Here’s how to fix it!

The number of 5G base stations has reached 1.159 million. The Gigabit optical network has the capacity to cover more than 200 million households.

F5 Releases 2020 State of Application Services Report (APAC Edition) Interpreting Five Major Findings in Enterprise Digital Transformation

Better connections enable faster, more flexible networks

How to address network automation risks and tasks

Debunking three myths about edge computing

VMISS 30% off all items starting from 18 yuan/month, Korea/Japan/Hong Kong CN2/Los Angeles CN2 GIA/9929 optional

Lenovo Debuts at Microsoft IoT Conference, Driving Business Intelligence Innovation with Smart IoT Devices

The AI dual competitions will enter the final round in August. Huawei invites you to witness the birth of the champions of the DIGIX Geek Artificial Intelligence Campus Innovation Competition and the Algorithm Elite Competition!

Recommend

Wang Wei of SINO-Info: Five years of CDM efforts have made “using and managing data well” a reality

Cisco unveils future networks that can self-learn, self-adjust and self-evolve

The Trump administration changes its policies overnight: the study visa rules are changed back to the old ones, and the new policy was aborted in less than a week

After working for more than 6 years, I still don’t understand the principles and techniques of coroutines

How is the SSH protocol? Why does a normal connection suddenly report an Identification error?

A400: 199 yuan/year Hong Kong & US VPS - dual core, 2G memory, 30G-50G hard disk, 1TB monthly traffic

LOCVPS Korea/Germany/Netherlands VPS 30% off, Korea VPS (2GB/60GB/600GB) monthly payment starts from 38.5 yuan

[11.11] CUBECLOUD 15% off all items, Los Angeles special monthly payment starting from 20 yuan, annual payment buy one get one free

Rather than calling it a skill, it is better to call it a history of blood and tears. Do you really know how to choose a router?

WOT Cheng Chao: Alibaba's monitoring development path from automation to intelligence

HostYun: Los Angeles CN2 GIA line cheap version online, 1GB memory package monthly payment starts at 15 yuan

RAKsmart: Japan/Korea independent server replenishment, large bandwidth, unlimited traffic, CN2+BGP line

SDN reshapes enterprise networks and changes the role of network managers

China Telecom and Huawei establish a joint business innovation center to achieve win-win business through innovative cooperation models

5G network speed is not as fast as 4G. Is this a trick of the operators?