Daily Bug Troubleshooting-All Connections Suddenly Closed

Preface

The daily bug troubleshooting series is all about troubleshooting simple bugs. The author will introduce some simple tricks for troubleshooting bugs here and accumulate materials at the same time.

Bug scene

I recently encountered a problem where the number of connections on a machine suddenly dropped to a few hundred after reaching a certain number of connections (about 45,000). The manifestation in the application was a large number of connection errors and the system lost response, as shown in the following figure:

Ideas

Idea 1: The first step is to suspect that the code is wrong. The author took a look and found that a mature framework is used instead of a connection operated by myself, so the problem with the code should be minor.
Idea 2: Then I began to suspect that it was a kernel limitation, such as the file descriptor limit, but there is a contradiction. If the kernel limits the number of connections, the number of connections should not increase after reaching a certain level, rather than a sharp drop in the number of connections.
Idea 2.1: Going further, I began to think that it is very likely that the limitation of some indirect resource caused all connections to fail to obtain this resource after reaching this bottleneck, resulting in all errors. Combined with the fact that the resources consumed by TCP connections are nothing more than CPU/memory/bandwidth.

Monitoring information

With the above ideas, we can observe the relevant monitoring information. CPU monitoring: The CPU consumption is very high, reaching nearly 70%, but the failure to obtain the CPU will generally only lead to a slower response, which does not match the problem phenomenon. Bandwidth monitoring: The bandwidth utilization rate has reached 50%, which is not high. Memory monitoring: A large amount of memory is indeed used, and the RSS has reached 26G, but compared with 128G of memory, this consumption is obviously not likely to become a bottleneck. Well, after looking at these three data, it is found that the system's resource consumption has not yet reached a bottleneck. However, the author suspected from the beginning that the use of memory may have triggered a special bottleneck. Because only when the memory resources cannot be applied for, the TCP connection may directly report an error and then drop the connection.

TCP monitoring information

When traditional monitoring is no longer sufficient to analyze our problems, the author directly takes out the most effective statistical command for TCP problems and uses the magic weapon:

 # 这条命令详细的输出了tcp连接的各种统计参数，很多问题都可以通过其输出获得线索netstat -s

I carefully observed the TCP and TCP memory related output items in the output of this command, and found something very unusual:

 ... TcpExt: TCP ran low on memoery 19 times ......

This output is exactly what I guessed about the memory limit. The TCP memory is insufficient, which causes the memory request to fail when reading or writing data, and then the TCP connection itself is dropped.

Modify kernel parameters

Because I have read the Linux TCP source code and all its adjustable kernel parameters in detail, I have an idea of TCP memory limit. With GPT, I just need to know the general direction. I can directly ask GPT and it will give me the answer, which is the parameter tcp_mem.

 cat /proc/sys/net/ipv4/tcp_mem 1570347 2097152 3144050

These three values represent different memory usage strategies of TCP under different thresholds. The unit is page, which is 4KB. You can ask GPT for a detailed explanation, so I won’t go into details here. The core is that when the total memory consumed by TCP is greater than the third value, which is 3144050 (12G, accounting for 9.35% of 128G memory), TCP starts to drop the connection because it cannot apply for memory. The corresponding application does consume a lot of memory for each TCP connection because each request is as high as several MB.
Once the memory consumption exceeds the limit, the TCP connection will be forced to drop by the kernel, which explains why almost all connections drop in a short period of time, because they are constantly applying for memory, and when the critical threshold is reached, all errors are reported, and then all connections in the entire system are closed, causing the system to lose response. As shown in the following figure:

picture

Knowing this is the problem is very simple, just increase tcp_mem:

 cat /proc/sys/net/ipv4/tcp_mem 3570347 6097152 9144050

The system remains stable after adjustment

After the corresponding kernel adjustment, the number of system connections exceeded 5W and remained stable. At this time, we observed the output of the relevant TCP memory consumption page:

 cat /proc/net/sockstat TCP: inuse xxx orphan xxx tw xxx alloc xxxx mem 4322151

From this output, we can see that after the system runs smoothly, the number of memory pages mem used normally is 4322151, which is much larger than the previous 3144050. This also verifies the author's judgment from the side.

Corresponding kernel stack

Record the corresponding Linux kernel stack here

 tcp_v4_do_rcv |->tcp_rcv_established |->tcp_data_queue |->tcp_data_queue |->tcp_try_rmem_schedule |->sk_rmem_schedule |->sk_rmem_schedule |->__sk_mem_raise_allocated |-> /* Over hard limit. */ if (allocated > sk_prot_mem_limits(sk, 2)) goto suppress_allocation; |->goto drop: tcp_drop(sk,skb)

It can be seen that when the allocated memory is greater than the relevant memory limit, the Linux Kernel will directly drop the TCP connection.

Summarize

After understanding the bug scene clearly, I spent about 20 minutes to locate the TCP memory bottleneck problem, and then found the relevant solution very quickly with the help of GPT. It has to be said that GPT can greatly speed up our search process. I personally feel that it can replace search engines to a large extent. However, the prompts fed to GPT still need to be constructed through the bug scene and certain experience. It cannot replace your thinking, but it can greatly speed up the retrieval of information.

<<: Confessions of a "colorful light": the road to change after entering 100,000 rooms

>>: From UML to SysML: The language journey of describing complex systems

Maxthon Host Double 11 25% off, top up 311 yuan and get 111 yuan free, Hong Kong/Korea/Netherlands/Germany/USA CN2 line

Tencent Cloud 618 Procurement Season, 2C4G6M lightweight server starts at 128 yuan/year, 2C2G4M starts at 18 yuan for three months

Blog

The main tasks of 5G in the 13th Five-Year Plan are determined

Blog

Kunpeng spreads its wings in Guangdong and the Bay Area | Kunpeng and his friends propose new computing to empower government smart office

[[350382]] At 14:00 on the afternoon of October 3...

China Unicom and China Telecom refuted the rumor that the first batch of 5G users were abandoned: it is impossible to take a radical one-size-fits-all approach

In response to rumors that "the first tens o...

How does Baidu Netdisk steal your traffic?

Recently, Baidu has once again been at the center...

The Ministry of Industry and Information Technology issued an urgent warning: fraud using "number portability" has been launched

On December 6, the Ministry of Industry and Infor...

How to truly experience 1G internet speed in the 5G era? WiFi has become an important help!

According to statistics, 20% of the sites in hots...

Edge computing has reached its climax! See how the three major operators fight the edge war

With the advent of the Internet of Everything era...

Daily Bug Troubleshooting-All Connections Suddenly Closed

Preface

Bug scene

Ideas

Monitoring information

TCP monitoring information

Modify kernel parameters

The system remains stable after adjustment

Corresponding kernel stack

Summarize

Maxthon Host Double 11 25% off, top up 311 yuan and get 111 yuan free, Hong Kong/Korea/Netherlands/Germany/USA CN2 line

Recommend an information collection tool written in Python

Tech Neo October Issue: Concurrency Optimization

Wind River Alex Wilson: Integrated modularity is the development direction of avionics systems

In addition to the ping command, these network commands are also very useful

Threat attacks targeting home routers increased fivefold

Five major events in global 5G communication technology and deployment in the past two weeks

Can SD-WAN trigger a comprehensive telecom NFV transformation?

Tencent Cloud 618 Procurement Season, 2C4G6M lightweight server starts at 128 yuan/year, 2C2G4M starts at 18 yuan for three months

The main tasks of 5G in the 13th Five-Year Plan are determined

Recommend

HostSlick: AMD Ryzen/EPYC+NVMe series Netherlands VPS starts at 19.99 euros per year

Hybrid office becomes a trend. Cisco uses intelligent technology to improve office experience

Where is the way out for SDN?

Kunpeng spreads its wings in Guangdong and the Bay Area | Kunpeng and his friends propose new computing to empower government smart office

Summary information: PIGYun/Wuluoyun/Tewang Technology/Hengai Network/GCCCloud

The first batch of 5G mobile phones are coming. What are the "killer" applications?

Sogou launches new AI voice recorder to lead the “new voice” in the voice recorder industry

Huawei's Yu Chengdong: This is a truly global mobile Internet cloud service

What will 5G rely on to disrupt data centers?

Wi-Fi Alliance launches next-generation WPA3 security certification program

China Unicom and China Telecom refuted the rumor that the first batch of 5G users were abandoned: it is impossible to take a radical one-size-fits-all approach

How does Baidu Netdisk steal your traffic?

The Ministry of Industry and Information Technology issued an urgent warning: fraud using "number portability" has been launched

How to truly experience 1G internet speed in the 5G era? WiFi has become an important help!

Edge computing has reached its climax! See how the three major operators fight the edge war