How to quickly troubleshoot data center networks

When the network scale of a data center becomes large, it is necessary to add network devices and implement multi-layer cascading. Today's data centers are often tree-shaped structures, with several devices with large forwarding capacity placed at the core, and then multiple layers of devices hanging below (due to insufficient port numbers, multiple layers may be required). Dozens or even hundreds of network devices are cascaded together. Once a fault occurs, how to quickly find the faulty device often troubles many network operation and maintenance personnel.

The network equipment in the data center is redundant. When a network failure occurs, as long as the faulty device is found and isolated, the service can be restored, and then the cause of the failure can be slowly investigated. However, it is not easy to find the specific faulty device among hundreds of devices. Network failures often get fault feedback from the application side first, and then start troubleshooting. At this time, the application personnel often only describe an application access failure phenomenon. They will not tell you which specific addresses are not connected to which addresses, and sometimes even wrong information, which greatly delays the problem location time. Most of the time for problem location is spent on the process of sorting out the fault phenomenon. What should I do? How can the data center network be quickly troubleshooted? This article will give the answer.

[[238068]]

If you want to analyze the network fault from the fault phenomenon reported by the application side, it is too late, and it is easy to be misled by the application personnel. Some application personnel report only the phenomenon they see, which is likely to be a local phenomenon and cannot reflect the fault of the entire network. Therefore, you have to rely on yourself, do a good job of network monitoring, discover problems through monitoring, and quickly find the faulty device, isolate the device or solve the fault.

Early network monitoring mainly monitored some logs and port traffic of devices. More often than not, this information was not enough and problems could not be discovered in time. Many network equipment manufacturers say that their equipment logs are very complete, but in actual use, there are still some extreme cases or software bugs that result in no log output when a fault occurs. At this time, it is necessary to locate the traffic. At this time, network personnel need to find application personnel to understand the fault phenomenon, find some packet loss or unreachable IP addresses on site, and then conduct network traffic, and conduct traffic on all devices through which the fault traffic passes to find the faulty device. Since it is a tree-shaped network, there are many devices at each layer, and the traffic volume is quite large. Moreover, not all devices can support statistics on all characteristic traffic. If there are unsupported devices, the statistics will be inaccurate, which increases the difficulty of finding faulty devices. This is how I have persisted in network operation and maintenance over the years.

Obviously, the previous network troubleshooting methods are effective but inefficient, take a long time to locate faults, and have a great impact on business. Today's network monitoring is all about data flow, monitoring specific data flows in the network, so that once the data flow is interrupted, the fault location can be immediately found. Here, we should mention several emerging network monitoring methods, also known as network visualization technology, which are the most effective methods for rapid troubleshooting.

The first is INT (In-band Network Telemetry) technology. INT monitors the network status by collecting and reporting the network status at the data level. When a data packet enters the first network device, the sampling method is set on the device to sample and mirror the service flow packet. INT encapsulates an INT header based on the packet and fills the switch information to be collected into the INT data segment. All network devices that the packet passes through are processed in this way until the network device connected to the first server strips off the INT header. Each device that the packet passes through sends the collected INT message to the remote monitoring server through the gRPC message for parsing and presentation. The INT message carries the delay of message forwarding, device congestion, etc., which can be presented to the monitoring server. Once the data packet is lost or unreachable, the monitoring server immediately senses it and can determine the scope of the problem and the faulty device in a few seconds.

The second is ERSPAN (Encapsulated Remote Switch Port Analyzer, a remote network traffic monitoring technology across three-layer IP transmission). ERSPAN's messages are based on GRE encapsulation and forwarded via Ethernet to any place reachable by IP routing. ERSPAN copies the source port message and sends it to the destination server for analysis via GRE (Generic Routing Encapsulation). The physical location of the collection server is not restricted. In this way, we can forward the key traffic of the entire network to the monitoring server through ERSPAN, and it is clear at a glance which part of the network has been discarded.

The third is sFlow and Netstream. Both are data sampling technologies. Netstream collects more complete data, but it requires dedicated hardware to complete. After deploying sFlow and Netstream in the network, the monitoring data can be sent to the server through gRPC, which is calculated and sorted by the monitoring server and the results are displayed graphically. Once there is a problem in any part of the network, it can be immediately displayed on the monitoring server. sFlow and Netstream collect the main features of the message header, not the entire content of the message. This is quite different from INT and ERSPAN. They can handle most network troubleshooting without any problems, unless the application message features are special and Netstream cannot capture them. In this case, you can only ask for help from INT and ERSPAN. In a network, it doesn't matter if all three monitoring solutions are deployed. In this way, when a fault occurs, you can analyze the problem from the data collected from multiple angles. Another important point is to try to send these data collections to the monitoring server through the management network. Otherwise, once there is a problem with the data network, the monitoring data may not be able to reach the monitoring server normally. In most cases, data network failures rarely affect the management network, and all devices can still be accessed normally. If many devices cannot be accessed through the management network during a failure, it can be basically determined that this device is the fault point.

With the above network monitoring methods, it is not difficult to find faults in the first place, and it can be fully automated. When a fault is found, the monitoring server automatically sends an isolation command to isolate the faulty device and automatically restore it. In this way, before the application reports the fault, the network fault location can be found, the faulty device can be isolated in time, and the business can be restored. This can greatly shorten the fault analysis time, have little impact on the business, and even the business part cannot perceive the fault at all. The actual application effect of network monitoring technologies such as INT and ERSPAN is still unknown. They are all technologies that have been mentioned recently and need to be tested in practice. SFLOW and Netstream technologies are relatively mature, but they are not used much in network troubleshooting, and they need to be promoted in this regard. Relying on these monitoring technologies, network faults can be quickly eliminated, which is of great significance to data center operation and maintenance, and greatly improves operation and maintenance efficiency.

<<: Why choose NB-IoT when there are so many standards?

>>: The United States will cut off China's Internet in a minute? This is a popular science article certified by the Chinese Academy of Sciences

What 5G means for the real-time data market

Blog

Sharktech: Los Angeles high-defense 1Gbps unlimited traffic server starting at $79/month, 10Gbps unlimited traffic starting at $379/month

Blog

Using 5G may require changing SIM cards, causing controversy: Industry chain insiders say it is mainly for vertical industries

Blog

Clouveo: AMD EYPC cloud server launched with 40% discount, starting from $2.7 per month, 1Gbps unlimited traffic, Netherlands data center

At the beginning of the year, the tribe shared in...

51% of companies said that lack of appropriate technical infrastructure and IT systems is a major challenge to digitalization

According to a survey conducted by the NASDAQ, wi...

How to quickly troubleshoot data center networks

What 5G means for the real-time data market

Sharktech: Los Angeles high-defense 1Gbps unlimited traffic server starting at $79/month, 10Gbps unlimited traffic starting at $379/month

Process control, all in one place

The latest version of WeChat has been updated to fix these problems

Five changes that 5G will bring to operators

CrownCloud: $5/month-4 cores/2GB/30GB/2TB@1Gbps/Los Angeles & Miami & Atlanta & Netherlands data centers

Using 5G may require changing SIM cards, causing controversy: Industry chain insiders say it is mainly for vertical industries

Why is HTTP 2.0 designed this way?

How does 5G use spectrum? This article tells you everything!

Inspur Networks launches new Wi-Fi 6 products to enable the era of fully “wireless” IoT

Recommend

What is missing for blockchain to be used commercially on a large scale?

DotdotNetwork: $19/month - 2 cores, 16G memory, 30G SSD, 4TB/10Gbps bandwidth, Los Angeles data center

Justg: Japanese native IP hosting is available with annual payment starting from US$49.99

Smart Encyclopedia | Why are optical cables better than copper cables?

Follow WeChat! Weibo launches new emojis: they can also “split”

How many people are using invalid 5G? The price has doubled, and the experience has become worse

Overcoming the Security Challenges of Software-Defined Networking

2017 year-end planning: Five sharp words to help you review the IT operation and maintenance of the year

What are the applications of machine learning in network management?

edgeNAT Los Angeles 4837 dual ISP host simple test

Clouveo: AMD EYPC cloud server launched with 40% discount, starting from $2.7 per month, 1Gbps unlimited traffic, Netherlands data center

Several thinking patterns that need to be changed in the 6G era

iWebFusion: Los Angeles 4G memory package starts at $9.38/month, and you can upgrade to 10Gbps bandwidth for $5 more

Misaka: $44/year KVM-2GB/32G NVMe/2TB/Germany (optional CN2)

51% of companies said that lack of appropriate technical infrastructure and IT systems is a major challenge to digitalization