A brief introduction to ZAB protocol in Zookeeper

The full name of the ZAB protocol is Zookeeper Atomic Broadcast Protocol.

Function: The ZAB protocol can be used to synchronize data between active and standby nodes in a cluster to ensure data consistency.

Before explaining the ZAB protocol, we must understand the role of each Zookeeper node.

The role of each Zookeeper node

Leader:

Responsible for processing read and write transaction requests sent by the client. The transaction request here can be understood as this request has the ACID characteristics of the transaction.
Synchronously write transaction requests to other nodes, and ensure the order of transactions.
The status is LEADING.

Follower:

Responsible for processing read requests sent by the client
Forward the write transaction request to the Leader.
Participate in the Leader election.
The status is FOLLOWING.

Observer:

Same as Follower, the only difference is that it does not participate in the Leader election and its status is OBSERING.
Can be used to linearly scale read QPS.

How to choose a Leader during the startup phase?

When Zookeeper is first started, multiple nodes need to find a leader. How do they find one? By voting.

For example, there are two nodes in the cluster, A and B, and the schematic diagram is as follows:

Node A votes for itself first. The voting information includes the node id (SID) and a ZXID, such as (1,0). The SID is configured and unique, and the ZXID is a unique increasing number.
Node B votes for itself first, and the voting information is (2,0).
Then nodes A and B will vote their votes to all nodes in the cluster.
After receiving the voting information from node B, node A checks whether the status of node B is in this round of voting and whether it is in the LOOKING state.
Voting PK: Node A will compare its vote with others' votes. If the ZXID sent by other nodes is larger, it will update its voting information to the voting information sent by other nodes. If the ZXIDs are equal, the SIDs will be compared. Here, the ZXIDs of nodes A and B are the same, and the SID of node B is larger, so node A updates the voting information to (2, 0) and then sends the voting information out again. Node B does not need to update the voting information, but it needs to send out the vote again in the next round.

At this time, the voting information of node A is (2, 0), as shown in the following figure:

Counting votes: In each round of voting, the voting information received by each node is counted to determine whether more than half of the nodes have received the same voting information. The voting information received by node A and node B is (2, 0), and the number is greater than the number of half of the nodes, so node B is selected as the leader.
Update node status: Node A acts as a Follower and updates its status to FOLLOWING; Node B acts as a Leader and updates its status to LEADING.

What should I do if the Leader crashes during operation?

During the operation of Zookeeper, the Leader will remain in the LEADING state until the Leader crashes. At this time, the Leader must be re-elected, and the election process is basically the same as the election process in the startup phase.

Points to note:

The remaining Followers conduct elections, and Observers do not participate in the election.
The zxid in the voting information is from the local disk log file. If the zxid on this node is larger, it will be elected as the leader. If the zxids of the followers are the same, the follower with the larger node id will be elected as the leader.

How to synchronize data between nodes?

Different clients can connect to the primary node or the backup node separately.

When the client sends a read or write request, it does not know whether it is connected to the Leader or the Follower. If the client is connected to the master node and sends a write request, the Leader will execute 2PC (two-phase commit protocol) to synchronize with other Followers and Observers. However, if the client is connected to a Follower and sends a write request, the Follower will forward the write request to the Leader, and then the Leader will perform 2PC to synchronize data to the Follower.

Two-phase commit protocol:

Phase 1: The leader sends a proposal to the follower, and the follower sends an ack response to the leader. If more than half of the acks are received, the next phase begins.
Phase 2: Leader loads data from disk log files into memory, Leader sends commit message to Follower, and Follower loads data into memory.

Let's take a look at the process of Leader synchronizing data:

① The client sends a write transaction request.
② After receiving the write request, the Leader converts it into a "proposal01:zxid1" transaction request and saves it to the disk log file.
③ Send proposal to other followers.
④ After receiving the proposal, the Follower writes the disk log file.

Next, let's see how the Follower handles the proposal transaction request sent by the Leader:

⑤ Follower returns ack to Leader.
⑥ Leader receives more than half of the acks and proceeds to the next stage
⑦ Leader loads the proposal of the log file in the disk into the znode memory data structure.
⑧ Leader sends commit message to all Followers and Observers.
⑨ After receiving the commit message, Follower loads the data on disk into the znode memory data structure.

Now the data of the Leader and Follower are all in the memory and are consistent. The data read by the client from the Leader and Follower are consistent.

How does ZAB achieve sequential consistency?

When the Leader sends a proposal, it actually creates a queue for each Follower and sends the proposal to their respective queues.

The following figure shows the message broadcast process of Zookeeper:

The client sent three write transaction requests, and the corresponding proposals are:

 proposal01 : zxid1
 proposal02 : zxid2
 proposal03 : zxid3

After receiving the request, the Leader puts it into the queue one by one, and then the Follower gets the request from the queue one by one, thus ensuring the order of the data.

Is Zookeeper strongly consistent?

Official definition: sequential consistency.

Strong consistency is not guaranteed, why?

Because after the Leader sends the commit message to all Followers and Observers, they do not complete the commit at the same time.

For example, due to network reasons, different nodes receive commits later, so the submission time is also later, resulting in inconsistent data among multiple nodes. However, after a short period of time, when all nodes commit, the data will be synchronized.

In addition, Zookeeper supports strong consistency, which means manually calling the sync method to ensure that all nodes are committed for success.

Here is a question: If a node fails to commit, will the leader retry? How to ensure data consistency? Welcome to discuss.

Leader downtime data loss issue

First case:

Assume that the Leader has written the message to the local disk but has not yet sent a proposal to the Follower. At this time, the Leader crashes.

Then you need to select a new leader. When the new leader sends a proposal, the zxid auto-increment rule included in it will change:

The upper 32 bits of zxid are incremented by 1 once, and the upper 32 bits represent the version number of the Leader.
The lower 32 bits of zxid are incremented by 1, and continue to increase in size.

When the old Leader recovers, it will become a Follower. When the Leader sends the latest proposal to it, it finds that the high 32 bits of the zxid of the proposal on the local disk are smaller than the proposal sent by the new Leader, so it discards its own proposal.

Second case:

If the Leader successfully sends a commit message to the Follower, but all or some of the Followers have not had time to commit the proposal, that is, load the proposal from the disk into the memory, then the Leader crashes.

Then you need to select the Follower with the largest zxid in the disk log. If the zxids are the same, compare the node IDs and use the one with the larger node ID as the Leader.

This article tries to use plain language + drawings to explain, hoping to inspire everyone.

<<: No exaggeration or criticism! A rational view of the value and application challenges of cyberspace mapping technology

>>: 5G investment steadily declines: CAPEX spending of the three major operators collectively "shifts"

edgeNAT May 1st Promotion: 30% off for annual VPS and 20% off for monthly VPS, top up 500 and get 100 free, Hong Kong/Korea 2G memory package starting from 48 yuan per month

Blog

Shenyang University: All-optical wireless optimizes application experience and creates a "easy-to-use and easy-to-use" campus network

Blog

New challenges for operation and maintenance in 2018: Three experts tell you how to achieve intelligent operation and maintenance

Blog

A brief introduction to ZAB protocol in Zookeeper

The role of each Zookeeper node

How to choose a Leader during the startup phase?

What should I do if the Leader crashes during operation?

How to synchronize data between nodes?

How does ZAB achieve sequential consistency?

Is Zookeeper strongly consistent?

Leader downtime data loss issue

5G New Year's Guide

AlphaVPS: €2.99/month-AMD Ryzen, 1G RAM, 15G NVMe hard drive, 1TB monthly bandwidth, Los Angeles/Bulgaria data center

Working principles of physical layer/data link layer/network layer

Low Power Wide Area Network (LPWAN) Shapes the Future of IoT

When developing online documents, have you solved this technical difficulty?

The 5G competition between China and the United States is heating up. What will the future hold?

Still using OpenFeign? Try this new thing in SpringBoot3!

edgeNAT May 1st Promotion: 30% off for annual VPS and 20% off for monthly VPS, top up 500 and get 100 free, Hong Kong/Korea 2G memory package starting from 48 yuan per month

Shenyang University: All-optical wireless optimizes application experience and creates a "easy-to-use and easy-to-use" campus network

New challenges for operation and maintenance in 2018: Three experts tell you how to achieve intelligent operation and maintenance

Recommend

Cisco ASAP helps you to subvert the traditional architecture by starting the full digital transformation of data center

Finding edge applications on 5G

Chinese companies account for more than 30% of global 5G standard patent declarations

SD-WAN Buyer's Guide: Key Questions Enterprises Need to Ask Vendors and Themselves

Tencent Cloud New Year Special Offer: 2C2G4M Cloud Server 30 Yuan/3 Months or 108 Yuan/Year

Interviewer: Can you tell me what are the commonly used network models?

22 pictures to explain OSPF: the most commonly used dynamic routing protocol

DigitalVM: 50% off on all VPS, 1-10Gbps unlimited traffic starting from $4/month in the US/Japan/Singapore

[Black Friday] spinservers: $49/month - E3-1280v5, 32G memory, 1TB NVme, 10Gbps bandwidth, San Jose/Dallas data center

Top 5 WiFi Network Predictions for 2017

Insurance Geek: Lay a solid foundation and take group insurance to the extreme

HTTP 2.0 protocol interview questions that will make the interviewer tremble

DiyVM: Japan Osaka/US Los Angeles/Hong Kong Shatin CN2 line 2G memory VPS monthly payment starts from 50 yuan

Is blockchain the next big thing? But it can easily fail if you’re not careful