A brief introduction to ZAB protocol in Zookeeper

A brief introduction to ZAB protocol in Zookeeper

The full name of the ZAB protocol is Zookeeper Atomic Broadcast Protocol.

Function: The ZAB protocol can be used to synchronize data between active and standby nodes in a cluster to ensure data consistency.

Before explaining the ZAB protocol, we must understand the role of each Zookeeper node.

The role of each Zookeeper node

Leader:

  • Responsible for processing read and write transaction requests sent by the client. The transaction request here can be understood as this request has the ACID characteristics of the transaction.
  • Synchronously write transaction requests to other nodes, and ensure the order of transactions.
  • The status is LEADING.

Follower:

  • Responsible for processing read requests sent by the client
  • Forward the write transaction request to the Leader.
  • Participate in the Leader election.
  • The status is FOLLOWING.

Observer:

  • Same as Follower, the only difference is that it does not participate in the Leader election and its status is OBSERING.
  • Can be used to linearly scale read QPS.

How to choose a Leader during the startup phase?

When Zookeeper is first started, multiple nodes need to find a leader. How do they find one? By voting.

For example, there are two nodes in the cluster, A and B, and the schematic diagram is as follows:

  • Node A votes for itself first. The voting information includes the node id (SID) and a ZXID, such as (1,0). The SID is configured and unique, and the ZXID is a unique increasing number.
  • Node B votes for itself first, and the voting information is (2,0).
  • Then nodes A and B will vote their votes to all nodes in the cluster.
  • After receiving the voting information from node B, node A checks whether the status of node B is in this round of voting and whether it is in the LOOKING state.
  • Voting PK: Node A will compare its vote with others' votes. If the ZXID sent by other nodes is larger, it will update its voting information to the voting information sent by other nodes. If the ZXIDs are equal, the SIDs will be compared. Here, the ZXIDs of nodes A and B are the same, and the SID of node B is larger, so node A updates the voting information to (2, 0) and then sends the voting information out again. Node B does not need to update the voting information, but it needs to send out the vote again in the next round.

At this time, the voting information of node A is (2, 0), as shown in the following figure:

  • Counting votes: In each round of voting, the voting information received by each node is counted to determine whether more than half of the nodes have received the same voting information. The voting information received by node A and node B is (2, 0), and the number is greater than the number of half of the nodes, so node B is selected as the leader.
  • Update node status: Node A acts as a Follower and updates its status to FOLLOWING; Node B acts as a Leader and updates its status to LEADING.

What should I do if the Leader crashes during operation?

During the operation of Zookeeper, the Leader will remain in the LEADING state until the Leader crashes. At this time, the Leader must be re-elected, and the election process is basically the same as the election process in the startup phase.

Points to note:

  • The remaining Followers conduct elections, and Observers do not participate in the election.
  • The zxid in the voting information is from the local disk log file. If the zxid on this node is larger, it will be elected as the leader. If the zxids of the followers are the same, the follower with the larger node id will be elected as the leader.

How to synchronize data between nodes?

Different clients can connect to the primary node or the backup node separately.

When the client sends a read or write request, it does not know whether it is connected to the Leader or the Follower. If the client is connected to the master node and sends a write request, the Leader will execute 2PC (two-phase commit protocol) to synchronize with other Followers and Observers. However, if the client is connected to a Follower and sends a write request, the Follower will forward the write request to the Leader, and then the Leader will perform 2PC to synchronize data to the Follower.

Two-phase commit protocol:

  • Phase 1: The leader sends a proposal to the follower, and the follower sends an ack response to the leader. If more than half of the acks are received, the next phase begins.
  • Phase 2: Leader loads data from disk log files into memory, Leader sends commit message to Follower, and Follower loads data into memory.

Let's take a look at the process of Leader synchronizing data:

  • ① The client sends a write transaction request.
  • ② After receiving the write request, the Leader converts it into a "proposal01:zxid1" transaction request and saves it to the disk log file.
  • ③ Send proposal to other followers.
  • ④ After receiving the proposal, the Follower writes the disk log file.

Next, let's see how the Follower handles the proposal transaction request sent by the Leader:

  • ⑤ Follower returns ack to Leader.
  • ⑥ Leader receives more than half of the acks and proceeds to the next stage
  • ⑦ Leader loads the proposal of the log file in the disk into the znode memory data structure.
  • ⑧ Leader sends commit message to all Followers and Observers.
  • ⑨ After receiving the commit message, Follower loads the data on disk into the znode memory data structure.

Now the data of the Leader and Follower are all in the memory and are consistent. The data read by the client from the Leader and Follower are consistent.

How does ZAB achieve sequential consistency?

When the Leader sends a proposal, it actually creates a queue for each Follower and sends the proposal to their respective queues.

The following figure shows the message broadcast process of Zookeeper:

The client sent three write transaction requests, and the corresponding proposals are:

 proposal01 : zxid1
proposal02 : zxid2
proposal03 : zxid3

After receiving the request, the Leader puts it into the queue one by one, and then the Follower gets the request from the queue one by one, thus ensuring the order of the data.

Is Zookeeper strongly consistent?

Official definition: sequential consistency.

Strong consistency is not guaranteed, why?

Because after the Leader sends the commit message to all Followers and Observers, they do not complete the commit at the same time.

For example, due to network reasons, different nodes receive commits later, so the submission time is also later, resulting in inconsistent data among multiple nodes. However, after a short period of time, when all nodes commit, the data will be synchronized.

In addition, Zookeeper supports strong consistency, which means manually calling the sync method to ensure that all nodes are committed for success.

Here is a question: If a node fails to commit, will the leader retry? How to ensure data consistency? Welcome to discuss.

Leader downtime data loss issue

First case:

Assume that the Leader has written the message to the local disk but has not yet sent a proposal to the Follower. At this time, the Leader crashes.

Then you need to select a new leader. When the new leader sends a proposal, the zxid auto-increment rule included in it will change:

  • The upper 32 bits of zxid are incremented by 1 once, and the upper 32 bits represent the version number of the Leader.
  • The lower 32 bits of zxid are incremented by 1, and continue to increase in size.

When the old Leader recovers, it will become a Follower. When the Leader sends the latest proposal to it, it finds that the high 32 bits of the zxid of the proposal on the local disk are smaller than the proposal sent by the new Leader, so it discards its own proposal.

Second case:

If the Leader successfully sends a commit message to the Follower, but all or some of the Followers have not had time to commit the proposal, that is, load the proposal from the disk into the memory, then the Leader crashes.

Then you need to select the Follower with the largest zxid in the disk log. If the zxids are the same, compare the node IDs and use the one with the larger node ID as the Leader.

This article tries to use plain language + drawings to explain, hoping to inspire everyone.

<<:  No exaggeration or criticism! A rational view of the value and application challenges of cyberspace mapping technology

>>:  5G investment steadily declines: CAPEX spending of the three major operators collectively "shifts"

Recommend

5G commercialization promotes the scale development of industrial Internet

The Industrial Internet is a network that connect...

MoeCloud: US CN2 GIA line VPS annual payment starts from 299 yuan

MoeCloud also launched a promotion this month, of...

Teach you how to easily obtain local area network devices

[[430847]] Preface With the rapid development of ...

There are three major challenges in data center management

When an enterprise develops to a certain extent, ...

Cisco CEO: 5G will bring unexpected benefits to Cisco

[[278077]] Cisco is primarily known for its switc...

5G needs new Wi-Fi tech to succeed, Cisco says

As the tech industry talks up 5G networks, Cisco ...

The three major telecom operators earned 387 million yuan a day in 2020

On March 25, China Mobile released its 2020 perfo...

Share: Construction skills of integrated wiring system

In the process of implementing the integrated wir...