The difficulty of operation and maintenance has reached a new level - it does not exist!

The difficulty of operation and maintenance has reached a new level - it does not exist!

What is the data center most afraid of?

Power outage, network damage...

What do data center operators fear most?

Downtime, unusual failures, upgrades and expansions...

As the scale of data center construction continues to expand and new technologies are iterated, the network that carries data center services has become extremely complex. In order to adapt to the development of data center services, data center networks are also constantly updating and changing, which brings great difficulty to operation and maintenance work. Data center downtime accidents are inevitable, which not only increases the workload of data center operation and maintenance personnel, but more importantly, brings huge losses to the data center. Even world-renowned Internet giants often enjoy such "treatment".

Internet giants are experiencing constant downtime, and operation and maintenance work has become a problem

In the early morning of March 3, Alibaba Cloud experienced a system outage, which caused the websites of enterprises or apps of Internet companies that purchased Alibaba Cloud services to be unable to function normally. A large number of programmers, operators, and maintenance personnel had to get out of bed and work. Regarding the Alibaba Cloud outage, Shen Jian, a senior architect at 58, said that the accident lasted about 3 hours and was observed for 2 hours afterwards.

Starting at 3:43 a.m. on May 3, Microsoft Azure experienced a large-scale outage around the world, which lasted for nearly 2 hours and was not fully restored until 5:30 a.m. Affected by the Azure outage, Microsoft's major services including Microsoft 365, Dynamics and DevOps all had usage problems.

Starting at 2:58 a.m. on June 3, Google suffered a massive outage worldwide, affecting many Google services based on Google Cloud architecture services, including Gmail, YouTube and Google Drive. Users accessing Google services received various error alerts, and were blocked from accessing emails, uploading YouTube videos, etc.

On June 25, Amazon confirmed on its official website that its cloud computing services had been down, affecting the network connections of some network users and multiple AWS regions. The failed node was in AWS US East 1, and a total of 33 services were affected, of which 9 were completely out of service.

Frequent downtime incidents make operation and maintenance more difficult

Time and again, downtime incidents have proven the importance of data center operation and maintenance, but it seems unavoidable. Today, with the advancement of technology and the advent of the era of the Internet of Everything, data centers play an important role as important infrastructure. Although data centers have only been developed in China for more than ten years, they have evolved from ordinary computer rooms with only UPS, air conditioners and IT equipment to a new era that includes all-round services such as the Internet, big data, AI, and cloud services, with tens of thousands of cabinets, and new technologies such as natural cooling, wind walls, underwater data centers, and liquid-cooled servers are constantly being created and applied. As a result, operation and maintenance management faces greater challenges, and the difficulty of operation and maintenance has also "reached a new level."

First, the ultra-large-scale data centers have brought about changes in personnel, organization, and efficiency. In the past, data centers within 10,000 square meters took 2-4 hours to conduct manual inspections. Now, with hundreds of thousands of square meters, more operation and maintenance personnel are needed to be distributed in different areas of responsibility, which increases the difficulty and cost of management. Secondly, the voltage level has increased, and the safety risks have increased. In the past, operation and maintenance personnel were exposed to low voltage, but now power supply equipment, generators, and refrigerators are all powered by high voltage, and the maintenance safety requirements have increased. In addition, the concentration of scale has led to concentrated risks and greater impacts of accidents. For example, the data center downtime mentioned above has caused large-scale service and application interruptions around the world, resulting in heavy losses, so the pressure on operation and maintenance management has advanced.

Reduce human errors and improve professional skills of operation and maintenance management

According to data surveys, 70% of data center downtime accidents are caused by human errors. Therefore, as the scale of data centers continues to expand, operation and maintenance personnel must improve their skills and professional level to cope with unexpected events in data centers:

  • Establishing a complete personnel skill evaluation system to assess the skills and abilities of operation and maintenance personnel from multiple aspects can effectively help operation and maintenance personnel improve their operation and maintenance skills and promote their active learning and automatic improvement.
  • Learn operation and maintenance experience online, establish an operation and maintenance experience database, realize an online operation and maintenance experience sharing and exchange platform, and provide channels for online internship and learning of operation and maintenance knowledge.
  • The online simulation of the practical operation environment provides an operation and maintenance simulation practice environment, effectively isolates operational risks, and helps quickly improve the actual level of operation and maintenance.
  • Online assessment of theoretical skills, relying on a massive IT cloud platform component question bank, regular assessments, and random questions, to achieve online real-time automatic assessment of operation and maintenance theoretical capabilities.
  • Online assessment of practical skills, building a lightweight online operation and maintenance, and online programming environment, to achieve online real-time automatic assessment of operation and maintenance skills and R&D skills.
  • Improve efficiency through automatic assessment, realize online scientific and automatic assessment of operation and maintenance theoretical skills and practical skills, improve assessment efficiency, and ensure objective and fair reflection of capabilities.

To make up for the lack of manual operation and maintenance, intelligent operation and maintenance came into being

Today, the digital age has arrived. The scale and capacity of data centers are growing exponentially, and the complexity and difficulty of operation and maintenance management are also increasing. From script operation and maintenance, tool operation and maintenance to platform operation and maintenance, manpower has reached its limit, and intelligent operation and maintenance has emerged. Nowadays, more data center companies such as Tencent, Huawei, and JD.com have begun to increase their R&D efforts and invest in the wave of intelligent operation and maintenance, combining artificial intelligence with operation and maintenance, and improving operation and maintenance efficiency through machine learning methods based on existing operation and maintenance data (logs, monitoring information, application information, etc.), thereby gradually replacing manual operation and maintenance. I believe that data centers will become more and more intelligent in the future.

<<:  The Socket and TCP connection process you must know

>>:  What is 6G and when will it be launched?

Recommend

Cloud empowers new life and Wind River IoT genes are upgraded again

There is a wind power plant abroad that mainly us...

Virtono: €11.21/year KVM-512MB/15G SSD/1TB/San Jose & Dallas & Romania, etc.

Virtono is a foreign VPS hosting company founded ...

What types of single-mode optical fiber are used?

What is single mode fiber? In fiber optic technol...

With a downlink rate of over 100Mbps, can Starlink really replace 5G?

According to Mobile World Live, Ookla's lates...

A big competition among operators’ 5G strengths!

Recently, according to the latest news from the M...

5G bidding is finalized, and competition is changing again

[[417538]] 2021 is the third year of 5G commercia...

China's fourth largest telecommunications operator is here

Chinese people are already familiar with the thre...

What is AWG? Why is AWG an important parameter when choosing cables?

When buying cables, there is an important paramet...

A super simple TCP communication package in C#: step by step guide

Hey, fellow developers! Today we are going to tal...

Do you know which city has the fastest Wi-Fi speed in the world?

Since the coronavirus crisis, fast internet has b...