Online troubleshooting guide: The ultimate way to bring your server back to life

Online troubleshooting guide: The ultimate way to bring your server back to life

Have you ever encountered these headache-inducing scenarios?

  • I was woken up by a call from operations at 3am: "Online service response is extremely slow!"
  • A big promotional event suddenly crashed: "CPU 100%, the server can't hold up!"
  • Received a user complaint: "Why does the system become slower and slower as time goes by?"

As a server-side engineer, these are challenges we must face and solve. But don't worry!

Through this practical guide, you will learn:

  • Practical tips for quickly locating performance bottlenecks
  • Practical experience in dealing with high-concurrency scenarios
  • Essential tools for system tuning and troubleshooting
  • Solutions to common problems such as memory leaks

Let's start practicing and turn each skill into your "killer skill"!

Log analysis tips: Safety first!

Hello everyone! Today I'm going to teach you a super important tip - "take your temperature" before touching your log!

Let's first take a look at how heavy the log file is:

 $ ls -lh /var/log/nginx/access.log │ │ │ │ │ └── 要查看的文件路径📁 │ └──── h表示human readable,让文件大小更易读👀 └────── l表示long format,显示详细信息📋 -rw-r--r-- 1 nginx nginx 6.5M Mar 20 15:00 access.log # 哇!这个日志有点重量级!🏋️♂️

Output explanation:

  • -rw-r--r-- : File permissions (read and write permissions)
  • nginx nginx: File owner and group
  • 6.5M: File size (displayed in human-readable format)
  • Mar 20 15:00: Last modified time

Why do this? Because...

  • Cat-ing large files is like eating an elephant in one go.
  • The server will be exhausted and gasping for breath
  • It may prevent other friends from accessing the website.

If you find that the log file is too large, we have a little trick:

 # 把大象搬到别的地方慢慢吃🚚 $ scp /var/log/nginx/access.log test-server:/tmp/ │ │ │ │ │ │ │ └── 目标路径:文件将被复制到这里📁 │ │ └── 目标服务器:可以是主机名或IP地址🖥️ │ └── 源文件:要复制的日志文件路径📄 └──── scp命令:secure copy,安全复制协议🔒

Detailed explanation of scp command parameters:

  • -r: copy the entire directory and its contents
  • -P: Specify the SSH port number (uppercase P)
  • -i: Use the specified private key file
  • -v: Display detailed transfer process
  • -p: Keep the modification time and permissions of the original file

Example of use:

 # 使用指定端口复制文件$ scp -P 2222 access.log test-server:/tmp/ # 使用2222端口🔌 # 使用私钥文件$ scp -i ~/.ssh/id_rsa access.log test-server:/tmp/ # 指定私钥🔑 # 复制整个目录$ scp -r /var/log/nginx/ test-server:/backup/ # 复制整个目录📂 # 保留文件属性$ scp -p access.log test-server:/tmp/ # 保留时间和权限⏰

Tips: Things to note when using scp

  • Make sure the target server has enough disk space
  • Check if the network connection is stable
  • Pay attention to file permission settings
  • When transferring large files, it is recommended to use the -C parameter to compress the transfer.

Want to sneak a peek at the last few lines of a log? Try this:

 $ tail -n 5 access.log │ │ │ │ │ └── 要查看的日志文件📄 │ └──── 显示的行数(这里是5行)📏 └──────── tail命令:查看文件末尾内容📌 192.168.1.100 GET /api/users 200 # 成功啦!🎉 192.168.1.101 POST /api/login 401 # 哎呀,登录失败了😅 # ... 更多访问记录...

Detailed explanation of tail command parameters:

  • -n: Specify the number of lines to display
  • -f: Real-time monitoring of file changes (follow mode)
  • -F: Similar to -f, but will retry after the file is deleted
  • -q: Do not display the file name header
  • -v: Display detailed file name header

Example of use:

 # 显示最后10行(默认) $ tail access.log # 查看最新10条记录📜 # 实时监控日志更新$ tail -f access.log # 像看电影一样实时观察🎬 # 同时监控多个文件$ tail -f access.log error.log # 多文件同步监控👥 # 显示文件末尾100字节$ tail -c 100 access.log # 按字节查看📊

Tips: tail command usage tips

  • When using -f monitoring, press Ctrl+C to exit
  • Use grep to filter specific content
  • You can use -n +1 to display the file from the beginning.
  • It is recommended to use tail instead of cat for large files

Remember: be gentle with your journal, and your journal will be gentle with you!

Let’s see who are the most active visitors:

 $ cat access.log | awk '{print $1}' | sort | uniq -c | sort -nr | head -3 156 192.168.1.100 # 这位可真是个忠实用户!🥇 89 192.168.1.101 # 二等奖也不错哦!🥈 67 192.168.1.102 # 铜牌得主继续加油!🥉

Let's take a peek at the server's little secrets:

 $ top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1234 root 20 0 985148 65404 31220 S 25.0 3.2 12:34 nginx 🏃♂️ 5678 mysql 20 0 1258216 326892 8216 S 15.0 8.1 5:67 mysql 🎲 9012 redis 20 0 162672 29684 4508 S 5.0 0.7 2:45 redis 🔄 # 看!每个进程都在忙着工作呢!💪

Isn’t it more interesting to look at the output this way? There is a little story behind each number waiting for you to discover!

Check the server status:

 $ top

It's like taking your server's temperature:

  • PID: ID number of each process
  • CPU%: the "body temperature" of the process
  • MEM%: the "meat consumption" of the process
  • COMMAND: The "name" of the process

Tip: Memorizing these commands is as fun as collecting Pokémon!

  • Each command has special skills
  • Combination is more powerful
  • Practice makes perfect, practice more!

See! Isn't it easier to understand what each command does now? Keep it up, you are already a little O&M expert!

Response time analysis: from ordering to serving

Let's analyze the response time with a professional tool:

 # 测试网站响应时间$ curl -w "\n⏱️ 总耗时: %{time_total}秒\n" -s https://example.com ⏱️ 总耗时: 0.235秒# 详细的性能分析$ ab -n 100 -c 10 https://example.com/测试结果📊: - 平均响应: 0.389秒⚡ - 成功率: 98% ✅ - 错误数: 2 ⚠️

Error log analysis: Find and resolve problems

When a problem occurs in the system, error logs are our good helper. Let's learn some practical log analysis commands:

 # 查看错误日志$ grep ERROR /var/log/app.log │ │ │ │ │ └── 要搜索的日志文件路径📄 │ └──── 要搜索的关键词🔍 └──────── grep命令:在文件中搜索文本🔎 [ERROR] 2024-03-20 15:00:23 数据库连接超时⚠️ [ERROR] 2024-03-20 15:00:25 内存不足💥

Detailed explanation of grep command parameters:

  • -i: Ignore case
  • -n: Display line numbers
  • -r: recursively search directories
  • -v: Display non-matching lines
  • -c: only display the number of matching lines

Example of use:

 # 显示行号$ grep -n ERROR /var/log/app.log # 知道错误在第几行📑 # 忽略大小写搜索$ grep -i error /var/log/app.log # 匹配ERROR、error 等🔤 # 递归搜索所有日志文件$ grep -r ERROR /var/log/ # 搜索整个日志目录📂 # 统计错误次数$ grep -c ERROR /var/log/app.log # 只显示错误数量🔢

Let's see how to count error types:

 $ grep ERROR /var/log/app.log | awk '{print $4}' | sort | uniq -c | sort -nr │ │ │ │ │ │ │ │ │ │ │ └────── 统计出现次数📊 │ │ │ └── 去重🎯 │ │ └── 排序📋 │ └── 提取第4列(错误类型)✂️ └── 过滤出错误日志🔍 15 数据库超时📊 # 最常见的错误8 内存不足📈 3 网络异常📉

Tip: Best practices for analyzing error logs

  • Check error logs regularly to detect problems early
  • Use the -C parameter of grep to view the error context
  • Analyze the error occurrence pattern based on timestamp
  • Create error type statistics report to find common problems

Error log analysis flow chart:

获取日志📄 --> 过滤错误🔍 --> 分析原因🤔 --> 解决问题✅

Interview points: Log analysis skills

Most popular questions asked by interviewers:

  • How to quickly locate performance issues
 # 组合使用多个工具$ dstat -cdngy 1 # 实时监控系统资源📊 │ │││││ │ │ ││││└─ y: 系统统计信息📈 │ │││└── g: 显示页面统计信息📑 │ ││└─── n: 网络统计信息🌐 │ │└──── d: 磁盘统计信息💾 │ └───── c: CPU 统计信息💻 └───────── 1: 每秒更新一次⏱️ $ iotop # 监控磁盘I/O 💾 PID USER IO> DISK READ DISK WRITE COMMAND 1234 mysql 2.1 50.2 M/s 10.1 M/s mysqld 5678 nginx 0.8 2.1 M/s 1.2 M/s nginx $ netstat -antp # 查看网络连接🌐 │ │││└── p: 显示进程信息👥 │ ││└─── t: 只显示TCP连接🔌 │ │└──── n: 显示数字地址而不是主机名🔢 │ └───── a: 显示所有连接🌍 └─────────── 查看网络统计信息📊
  • How to handle large log files?
 # 使用高效的日志分析方法$ zcat large.log.gz | grep ERROR | tail -n 100 │ │ │ │ │ │ │ │ │ │ │ └── 显示行数📏 │ │ │ │ └──── 查看末尾📌 │ │ │ └────────── 过滤ERROR关键词🔍 │ │ └─────────────── 管道传递输出📤 │ └─────────────────── 压缩文件分隔符| └────────────────────────────── 读取压缩文件📦 $ awk '/ERROR/ {print $4}' large.log | sort | uniq -c │ │ │ │ │ │ │ │ │ │ │ │ │ └── 计数🔢 │ │ │ │ │ └──── 去重🎯 │ │ │ │ └────── 排序📋 │ │ │ └──────────── 打印第4列✂️ │ │ └────────────────── 执行的动作🎬 │ └─────────────────────── 匹配模式🔍 └───────────────────────── 文本处理工具🛠️

Performance analysis tool comparison chart:

 dstat ➡️ 系统整体状况📊 ├── CPU使用率💻 ├── 磁盘I/O 💾 ├── 网络流量🌐 └── 内存使用🧠 iotop ➡️ 磁盘I/O详情💾 ├── 读取速度📥 ├── 写入速度📤 └── 进程信息👥 netstat ➡️ 网络连接状态🌐 ├── TCP/UDP连接🔌 ├── 端口占用🚪 └── 进程信息👥

Tip: Performance analysis best practices

  • First use dstat to obtain the overall system status
  • Use iotop to conduct in-depth analysis when I/O anomalies are found
  • Use netstat to troubleshoot network problems
  • Pay attention to collecting enough sample data
  • Establish benchmark data for comparative analysis

Common performance issues and solutions:

(1) High CPU usage

  • Use top to find high load processes
  • Analyze whether the process has an infinite loop
  • Consider adding more CPU cores or optimizing the code

(2) Disk I/O bottleneck

  • Use iotop to monitor disk reads and writes
  • Check whether there are a large number of small file operations
  • Consider using SSD or optimizing storage strategy

(3) High network latency

  • Use netstat to check the connection status
  • Analyze whether network packets are lost
  • Consider optimizing network configuration or increasing bandwidth

<<:  12 CMD command tools in Windows that network engineers must master!

>>: 

Recommend

The truth about 5G speed, is your 5G package worth it?

[[326825]] We'll cover the different 5G speed...

Why is your broadband speed never as fast as your operator says?

According to some users, in order to improve the ...

RAKsmart Korean three-network direct connection/Telecom CN2 VPS simple test

Yesterday, the blog shared information about RAKs...

Dish Network plans to acquire Republic Wireless

According to foreign media, Dish Network announce...

PacificRack: $8/year KVM-512MB/10GB/1TB/Los Angeles data center

PacificRack has launched the Winter Sales promoti...

IT spending priorities for 2020

The role of the CIO has become a transformational...

TCP Things 1: TCP Protocol, Algorithm and Principle

TCP is a very complex protocol because it has to ...

...

Kerlink and Radio Bridge provide LoRaWAN solutions for private IoT networks

According to recent announcements, Kerlink and Ra...