When writing crawlers, we often need to parse the list pages of a website. For example, the following example:
The running effect is shown in the figure below: In this case, I think it is very simple to get the URL of each item. Just write an XPath, as shown below: If you look closely, you will find that each URL starts with http://127.0.0.1:8000. The address of the current list page is also http://127.0.0.1:8000. So for simplicity, you can use relative paths in the tags:
The running effect is shown in the figure below. Only half of the URL can be extracted using XPath: But the browser can correctly recognize such relative addresses, and when you click, it can automatically jump to the correct address: If the relative path starts with /, the main domain name of the website will be added to the front of the relative path. But what if the address of the current list page partially overlaps with the relative path of the link? As shown in the following figure: The address of the current page is http://127.0.0.1:8000/book. The relative address is /book/1.html. In this case, you can simplify it further by not adding a slash in front of the relative path and changing the HTML to:
The running effect is shown in the figure below: In this case, the browser can still correctly identify it, as shown in the following figure: The browser knows that if the relative path does not start with /, it will concatenate the URL of the current page with the relative path. But it should be noted that when concatenating, the part to the left of the rightmost slash will be taken. The part to the right will be discarded. It is equivalent to concatenating the file address with the folder where the file is located. As shown in the figure below: If you can't remember how to distinguish them, you can use Python's own urllib.parse.urljoin to connect, as shown below: Seeing this, you may think that I have written another article today. Is such a simple thing worth writing an article? So let's look at the following example: The domain name is http://127.0.0.1:8000/book/index.html, and the relative domain name is 1.html, but why is the URL automatically recognized by the browser www.kingname.info/1.html? The key to this problem lies in the tags in the source code:
If there is a tag at the head of the HTML code, the value of its href attribute will be used to concatenate an absolute path with the relative path, instead of using the URL of the current page. If you don't know this, your crawler may have problems when splicing sub-page URLs. The website can also use this mechanism to construct a honeypot. The URL spliced according to the tag is the real sub-page address, and the URL spliced with the current page URL is the honeypot address. When the crawler accesses it, it will capture false data or be blocked immediately. For a detailed description of the tag, please read: The Document Base URL element[1]. References [1] The Document Base URL element: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base |
<<: How powerful is WiFi7? Three times faster than WiFi6, as fast as lightning
Under the epidemic, what is tested is not only hu...
Infovista welcomes TM Forum’s new industry survey...
Over the past few decades, mobile communication t...
Today's article is the opening. In fact, I ha...
HostingViet has released the promotion for Octobe...
An ambitious new smart home networking standard i...
[Beijing, China, February 8, 2018] On February 8,...
Recently, Yun Xiaochun, deputy director and chief...
Today, at the MWCS 2021 media analyst pre-communi...
The tribe shared news about FantomNetworks twice ...
Some people say that the most profound change tha...
What is HTTP? The full name of HTTP protocol is H...
[Original article from 51CTO.com] Recently, Star ...
introduction In recent years, with the developmen...
October 13, 2020, Beijing - The COVID-19 pandemic...