PrefaceThe daily bug troubleshooting series is all about troubleshooting simple bugs. The author will introduce some simple tricks for troubleshooting bugs here and accumulate materials at the same time. Bug sceneI recently encountered a problem where the number of connections on a machine suddenly dropped to a few hundred after reaching a certain number of connections (about 45,000). The manifestation in the application was a large number of connection errors and the system lost response, as shown in the following figure: Ideas Idea 1: The first step is to suspect that the code is wrong. The author took a look and found that a mature framework is used instead of a connection operated by myself, so the problem with the code should be minor. Monitoring informationWith the above ideas, we can observe the relevant monitoring information. CPU monitoring: The CPU consumption is very high, reaching nearly 70%, but the failure to obtain the CPU will generally only lead to a slower response, which does not match the problem phenomenon. Bandwidth monitoring: The bandwidth utilization rate has reached 50%, which is not high. Memory monitoring: A large amount of memory is indeed used, and the RSS has reached 26G, but compared with 128G of memory, this consumption is obviously not likely to become a bottleneck. Well, after looking at these three data, it is found that the system's resource consumption has not yet reached a bottleneck. However, the author suspected from the beginning that the use of memory may have triggered a special bottleneck. Because only when the memory resources cannot be applied for, the TCP connection may directly report an error and then drop the connection. TCP monitoring informationWhen traditional monitoring is no longer sufficient to analyze our problems, the author directly takes out the most effective statistical command for TCP problems and uses the magic weapon: I carefully observed the TCP and TCP memory related output items in the output of this command, and found something very unusual: This output is exactly what I guessed about the memory limit. The TCP memory is insufficient, which causes the memory request to fail when reading or writing data, and then the TCP connection itself is dropped. Modify kernel parametersBecause I have read the Linux TCP source code and all its adjustable kernel parameters in detail, I have an idea of TCP memory limit. With GPT, I just need to know the general direction. I can directly ask GPT and it will give me the answer, which is the parameter tcp_mem. These three values represent different memory usage strategies of TCP under different thresholds. The unit is page, which is 4KB. You can ask GPT for a detailed explanation, so I won’t go into details here. The core is that when the total memory consumed by TCP is greater than the third value, which is 3144050 (12G, accounting for 9.35% of 128G memory), TCP starts to drop the connection because it cannot apply for memory. The corresponding application does consume a lot of memory for each TCP connection because each request is as high as several MB. picture Knowing this is the problem is very simple, just increase tcp_mem: The system remains stable after adjustmentAfter the corresponding kernel adjustment, the number of system connections exceeded 5W and remained stable. At this time, we observed the output of the relevant TCP memory consumption page: From this output, we can see that after the system runs smoothly, the number of memory pages mem used normally is 4322151, which is much larger than the previous 3144050. This also verifies the author's judgment from the side. Corresponding kernel stackRecord the corresponding Linux kernel stack here It can be seen that when the allocated memory is greater than the relevant memory limit, the Linux Kernel will directly drop the TCP connection. SummarizeAfter understanding the bug scene clearly, I spent about 20 minutes to locate the TCP memory bottleneck problem, and then found the relevant solution very quickly with the help of GPT. It has to be said that GPT can greatly speed up our search process. I personally feel that it can replace search engines to a large extent. However, the prompts fed to GPT still need to be constructed through the bug scene and certain experience. It cannot replace your thinking, but it can greatly speed up the retrieval of information. |
<<: Confessions of a "colorful light": the road to change after entering 100,000 rooms
>>: From UML to SysML: The language journey of describing complex systems
HostSlick recently released some special packages...
According to the survey results of Cisco and a th...
SDN was born in 2006. It was a campus innovation ...
[[350382]] At 14:00 on the afternoon of October 3...
Affected by the epidemic, I have participated in ...
A relevant person in charge of the Ministry of In...
On February 26, Sogou held an online launch event...
Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN ...
In a January 2017 survey, research firm IHS Marki...
[51CTO.com original article] On June 26, the Wi-F...
In response to rumors that "the first tens o...
Recently, Baidu has once again been at the center...
On December 6, the Ministry of Industry and Infor...
According to statistics, 20% of the sites in hots...
With the advent of the Internet of Everything era...