ZooKeeper (2013)
Part II. Programming with ZooKeeper
Chapter 6. ZooKeeper Caveat Emptor
The previous chapters discussed how to code with ZooKeeper, implementing some basic cases as well as a few advanced ones. In this chapter, we focus on some tricky aspects of ZooKeeper, mostly related to session semantics and ordering. The material covered here might not affect your development, but it is good to be aware of these issues in case you come across any of them.
The structure of this chapter is different from the others: we present it as a mix of issues without a linear flow. Each section is self-contained, so they can be read separately. We have discussed some tricky issues in earlier chapters, but this chapter puts together some others that it didn’t make sense to discuss in other places. They are still important issues, because many developers have stumbled upon them.
Using ACLs
Normally you would expect access control to be described in an administration section. However, in ZooKeeper the developer, rather than the administrator, is usually the one who manages access control. This is because access rights must be set every time a znode is created; it does not inherit the access permissions from its parent. Access checks are also done on a per-znode basis. If a client has access to a znode, it can access it even if that client cannot access the parent of the znode.
ZooKeeper controls access using access control lists (ACLs). An ACL contains entries of the form scheme:auth-info, where scheme corresponds to a set of built-in authentication schemes and auth-info encodes the authentication information in some manner specific to the scheme.ZooKeeper enforces security by checking the authorization information submitted by the client process upon access to each znode. If a process has not supplied authentication information, or if this information does not match what is needed to execute a request on a znode, the process will receive a permission error.
To add authentication information to a ZooKeeper handle, issue the addAuthInfo call in the format:
void addAuthInfo(
String scheme,
byte auth[]
)
where:
scheme
The scheme used to authenticate.
auth
The authentication information to be sent to the server. The type of this parameter is byte [], but most of the current schemes take a String, so usually you will convert the String to a byte [] using String.getBytes().
A process can add authentication information using addAuthInfo at any time. Usually, it will be called right after the ZooKeeper handle is created. A process can call this method multiple times to add multiple identities to a ZooKeeper handle.
Built-in Authentication Schemes
ZooKeeper offers four built-in schemes to handle ACLs. One of them we have been using implicitly through the OPEN_ACL_UNSAFE constant. That ACL uses the world scheme that just lists anyone as the auth-info. anyone is the only auth-info that can be used with the worldscheme.
Another special built-in scheme used by administrators is the super scheme. This scheme is never included in any ACL, but it can be used to authenticate to ZooKeeper. A client that authenticates with super will not be restricted by ACLs of any of the znodes. Authentication and Authorization Options has more information about the super scheme.
We will look at the other two schemes in the following example.
When ZooKeeper starts with an empty tree, there is one starting znode: /. This znode is open to everyone. Let’s assume that Amy the administrator is setting up the ZooKeeper service. Amy also creates the /apps znode to be the parent of znodes created for applications that will use the service. She wants to lock the service down a bit, so she sets the ACL for / and /apps to:
digest:amy:Iq0onHjzb4KyxPAp8YWOIC8zzwY=, READ | WRITE | CREATE | DELETE | ADMIN
This ACL has exactly one entry, giving Amy all access rights. Amy has chosen amy as her user ID.
digest is a built-in scheme whose auth-info has the form userid:passwd_digest when setting the ACL and userid:password when calling addAuthInfo. The passwd_digest is a cryptographic digest of the user’s password. In the example ACL,Iq0onHjzb4KyxPAp8YWOIC8zzwY= is the passwd_digest, so when Amy called addAuthInfo, auth would be the byte array corresponding to the string amy:secret.
Amy used DigestAuthenticationProvider as follows to generate a digest for her account amy with password secret:
java -cp $ZK_CLASSPATH \
org.apache.zookeeper.server.auth.DigestAuthenticationProvider amy:secret
....
amy:secret->amy:Iq0onHjzb4KyxPAp8YWOIC8zzwY=
The funny string following amy: is the password digest. This is what we used for the ACL entry. When Amy authenticates to ZooKeeper, she will use digest amy:secret. For example, if Amy is using zkCli.sh to connect to ZooKeeper, she can authenticate using:
[zk: localhost:2181(CONNECTED) 1] addauth digest amy:secret
To avoid writing out the full digests in the following examples, we will simply use XXXXX as a placeholder for the digest.
Amy wants to set up a subtree for a new application called SuperApp that is being developed by Dom the Developer, so she creates /apps/SuperApp and sets the ACL to be:
digest:dom:XXXXX, READ | WRITE | CREATE | DELETE | ADMIN
digest:amy:XXXXX, READ | WRITE | CREATE | DELETE | ADMIN
This ACL is composed of two entries, one for Dom and one for Amy. The entries give full privileges to clients who can produce the password of either dom or amy.
Note that, according to Dom’s entry in the ACL, he has ADMIN permission to /apps/SuperApp, which means that Dom can remove Amy’s access to /apps/SuperApp by changing the ACL to not include her entry. Of course, Amy has access to the super secret, so she can always access any znode even if Dom removes her access to it.
Dom uses ZooKeeper to store the configuration of his application, so he creates /apps/SuperApp/config to store the configuration. He then creates the znode using the ACL that we have used in all of our examples, OPEN_ACL_UNSAFE. Dom thinks that because access is restricted to/apps and /apps/SuperApp, /apps/SuperApp/config is protected. As we will see, we don’t call it UNSAFE for nothing.
Let’s say there is a guest named Gabe who has network access to the ZooKeeper service. Because of the ACLs, Gabe cannot access /app or /apps/SuperApp. Gabe cannot list the children of /apps/SuperApp, for example. But perhaps Gabe has an idea that Dom uses ZooKeeper for configuration, and config is a rather obvious name for a configuration file, so he connects to the ZooKeeper service and calls getData on /apps/SuperApp/config. Because of the open ACL used on that znode, Gabe can read the file. But it doesn’t stop there. Gabe can change, delete, and even restrict access to /apps/SuperApp/config.
Let’s assume that Dom realizes this and changes the ACL of /apps/SuperApp/config to:
digest:dom:XXXXX, READ | WRITE | CREATE | DELETE | ADMIN
As things progress, Dom gets a new developer, Nico, to help finish SuperApp. Nico needs access to the SuperApp subtree, so Dom changes the ACL of files in that subtree to include Nico. The new ACL is:
digest:dom:XXXXX, READ | WRITE | CREATE | DELETE | ADMIN
digest:nico:XXXXX, READ | WRITE | CREATE | DELETE | ADMIN
WHERE DO DIGEST USERNAMES AND PASSWORDS COME FROM?
You may notice that the usernames and passwords that are used with the digest scheme seem to be appearing from thin air. Actually, that is exactly where they come from. They don’t need to correspond to any real system identity. Usernames can also overlap. There may be another developer named Amy that starts working with Dom and Nico. Dom can add her into ACLs usingamy:XXXXX. The only clash would be if both Amys happened to choose the same password, because they would then have access to each other’s files.
Now Dom and Nico will have the access they need to finish working on SuperApp. When the application goes into production, though, Dom and Nico may not want to give the secret passwords to the processes that need to access the ZooKeeper data. So they decide to restrict access to the data based on the network address of the machines that are running the SuperApp processes. All of the machines are on the 10.11.12.0/24 network. They therefore change the ACL of the znodes in the SuperApp subtree to:
digest:dom:XXXXX, READ | WRITE | CREATE | DELETE | ADMIN
digest:nico:XXXXX, READ | WRITE | CREATE | DELETE | ADMIN
ip:10.11.12.0/24, READ
The ip scheme takes the network address and mask. Because it uses the address of the client to do the ACL check, clients do not need to call addAuthInfo with the ip scheme to access a znode using this ACL.
Now any ZooKeeper client that connects from the 10.11.12.0/24 network will have read access to znodes in the SuperApp subtree. This authentication scheme assumes that IP addresses cannot be spoofed, which is a rather naive assumption that may not be appropriate for all environments.
SASL and Kerberos
There are a couple of problems with the example in the previous section. First, if new developers join or leave the group, an administrator has to change all the ACLs; it would be nice if we could avoid this by using groups. Second, we also have to change all the ACLs if we want to change any of the passwords of any of the developers. Finally, neither the digest nor the ip scheme is appropriate if the network is not trusted. A scheme called sasl, which ships with ZooKeeper, addresses these issues.
SASL stands for Simple Authentication and Security Layer. It is a framework that abstracts the underlying system of authentication so that applications that use SASL can use any of the various protocols supported by SASL. With respect to ZooKeeper, SASL usually uses Kerberos, which is an authentication protocol that provides the missing features that we mentioned earlier. SASL uses sasl for its scheme name, and the id is the Kerberos ID of the client.
SASL is an extended ZooKeeper scheme. That means it needs to be enabled at runtime with a configuration parameter or Java system property. If you want to add it to the ZooKeeper configuration file, use the configuration parameter authProvider.XXX. If you want to use a system property instead, use the zookeeper.authProvider.XXX property name. In both cases, XXX can be anything as long as there are no other authProviders with the same XXX. Usually XXX is a number starting at 0. The value of the parameter or property isorg.apache.zookeeper.server.auth.SASLAuthenticationProvider. This will enable SASL.
Adding New Schemes
Many other possible schemes could be used with ZooKeeper. Making them available is a “simple matter of coding.” The org.apache.zookeeper.server.auth package offers an interface called AuthenticationProvider. If you implement your own provider, you can enable it by putting your new classes in the server’s classpath and creating a Java system property whose key has the zookeeper.authProvider. prefix and whose value is the name of the class that implements AuthenticationProvider.
Session Recovery
Suppose your ZooKeeper client crashes and recovers. When it comes back, there are a couple of issues that the application needs to be aware of. First, the state of ZooKeeper might not be the same as it was at the time the client crashed. As time elapses, other clients might have made progress and changed the ZooKeeper state. Consequently, it is recommended that the client not try to persist any cached state that comes from ZooKeeper, but instead uses ZooKeeper as the “source of truth” for all coordination state.
For example, in our master-worker implementation, if the primary master crashes and recovers, the ensemble might have failed over in the meantime to a backup master that has assigned tasks. When the first master recovers, it shouldn’t assume that it is still the master or that the list of pending assignments hasn’t changed.
The second important issue is that the client operations to ZooKeeper that were pending at the time the client crashed might have completed. Given that the client crashed before it received a confirmation, ZooKeeper can’t guarantee that the operations have been executed successfully. Consequently, upon recovery, a client might need to perform some cleanup on the ZooKeeper state to complete some operations. For example, if our master crashes before deleting a pending task that has been assigned, it needs to delete this task from ZooKeeper in case our master becomes a primary again.
Although the discussion so far has focused on the case where the client crashes, all the points in this section apply also when the session has simply expired. For a session to expire, it is not necessary for the client to crash. The session may expire due to network issues or other issues, such as garbage collection pauses in Java. In the case of session expiration, the client must take into account that the ZooKeeper state might have changed and that some of its ZooKeeper requests might not have gone through.
Version Is Reset When Znode Is Re-Created
This might sound like a naive observation, but it is necessary to remember that when a znode is deleted and re-created, its version number is reset. An application that tries to make version checks after a znode is re-created might fall into this trap.
Say that a client gets the data of a znode (e.g., /z), changes the data of the znode, and writes it back under the condition that the version is 1. If the znode is deleted and re-created while the client is updating the data, the version matches, but it now contains the wrong data.
Another possible scenario is for changes to a znode to occur by deleting and re-creating the znode; the znode is never changed with setData but is different nevertheless. In this case, checking the version gives no clue about changes to the znode. The znode may change an arbitrary number of times and its version will still be 0.
The sync Call
If clients of an application communicate only by reading and writing to ZooKeeper, the application shouldn’t worry about sync. sync exists because communication outside ZooKeeper may lead to a problem often referred to as a hidden channel, explained in Ordering Guarantees. The idea is that a client cʹ may tell another client c about some change to the ZooKeeper state through a direct channel (e.g., a TCP connection between c and cʹ), but when c reads the ZooKeeper state, it doesn’t observe the change.
This scenario is possible because the server serving a given client might not have processed the change yet. sync is supposed to deal with such situations. sync is an asynchronous call that a client uses before a read operation. Say that a client wants to read the znode that it has heard through a direct channel has changed. The client calls sync, followed by getData:
...
zk.sync(path, voidCb, ctx);
zk.getData(path, watcher, dataCb, ctx);
...
sync takes a path, a void callback implementation, and a context object.
The getData call is the same as before.
The path in the sync call documents the operation it is referring to. Internally, it doesn’t really matter to ZooKeeper. When the server side handles the sync call, it flushes the channel between the leader and the follower that is serving the client c that called sync. Flushing here means that by the time getData returns, it is sure to incorporate any changes that might have happened by the time c calls sync. In the case of a hidden channel, the change communicated to c will have happened before the call to sync. Consequently, when c receives a response from the getData call, it must incorporate the change communicated directly by cʹ. Note that other changes might have happened to the same znode in the meantime, so ZooKeeper only guarantees that the change communicated directly has been incorporated in the response of getData.
There is a caveat to the use of sync, which is fairly technical and deeply entwined with ZooKeeper internals. (Feel free to skip it.) Because ZooKeeper is supposed to serve reads fast and scale for read-dominated workloads, the implementation of sync has been simplified and it doesn’t really traverse the execution pipeline as a regular update operation, like create, setData, or delete. It simply reaches the leader, and the leader queues a response back to the follower that sent it. There is a small chance that the leader thinks that it is the leader l, but doesn’t have support from a quorum any longer because the quorum now supports a different leader, lʹ. In this case, the leader l might not have all updates that have been processed, and the sync call might not be able to honor its guarantee.
The ZooKeeper implementation deals with this issue by making it unlikely that a quorum of followers will abandon a leader without the leader noticing. It does so by having the leader time out for any given follower based on the tickTime, while a follower decides that a leader is gone by receiving a socket exception, which occurs when the TCP connection between them drops. The leader times out a follower much sooner than a TCP connection expiration. However, the corner case for an error is there, even though it has never been observed to our knowledge.
There have been discussions on the mailing list about changing the server handling of sync to traverse the pipeline and eliminate this corner case altogether. Currently, the ZooKeeper implementation relies on reasonable timing assumptions, and consequently no problem is expected.
Ordering Guarantees
Although ZooKeeper officially guarantees the order of client operations during a session, circumstances outside the control of ZooKeeper can still change the way the order appears to a client. There are a few guidelines that the developer needs to be aware of to ensure the expected behavior. We discuss three cases here.
Order in the Presence of Connection Loss
Upon a connection loss event, ZooKeeper cancels pending requests. For synchronous calls the library throws an exception, while for asynchronous calls the library invokes the callbacks with a return code indicating a connection loss. The client library won’t try to resubmit the request once it has indicated to the application that the connection has been lost, so it’s up to the application to resubmit operations that have been canceled. Consequently, the application can rely on the client library to issue all the callbacks, but it cannot rely on ZooKeeper to actually carry out the operation in the event of a connection loss.
To understand what impact this can have on an application, let’s consider the following sequence of events:
1. Application submits a request to execute Op1.
2. Client library detects a connection loss and cancels pending request to execute Op1.
3. Client reconnects before the session expires.
4. Application submits a request to execute operation Op2.
5. Op2 is executed successfully.
6. Op1 returns with CONNECTIONLOSS.
7. Application resubmits Op1.
In this case, the application submitted Op1 and Op2 in that order and got Op2 to execute successfully before Op1. When the application notices the connection loss in the callback of Op1, it tries to submit that request again. But suppose the client does not successfully reconnect. A new call to submit Op1 will report connection loss again, and there is a risk that the application will enter into an unbounded loop of resubmitting the Op1 request without connecting. To get out of this cycle, the application could set a bound on the number of attempts, or close the handle if reconnecting takes too long.
In some cases, it might be important to make sure that Op1 is executed successfully before Op2. If Op2 depends on Op1 in some way, then to avoid having Op2 executing successfully before Op1, we could simply wait for a successful execution of Op1 before submitting Op2. This is the approach we have taken in most of our master-worker example code, to guarantee that the requests are executed in order. In general, this approach of waiting for the result of Op1 is safe, but it adds a performance penalty because the application needs to wait for the result of one request to submit the next one, instead of having them processed concurrently.
WHAT IF WE GET RID OF CONNECTIONLOSS?
The CONNECTIONLOSS event exists essentially because of the case in which a request is pending when the client loses the connection. Say it is a create request. In such cases, the client doesn’t know whether the request has gone through. The client, however, could ask the server to check whether the request has executed successfully. The server knows what has gone through, due to information cached either in memory or in its logs, so it is feasible to do it this way. If the community eventually changes ZooKeeper to access this server information when reconnecting, we will be able to remove the limitation of not being able to guarantee that a prefix executes successfully because clients will be able to reexecute pending requests when necessary. Until then, developers have to live with this limitation and deal with connection loss events.
Order with the Synchronous API and Multiple Threads
Multithreaded applications are common these days. If you are using the synchronous API with multiple threads, it is important to pay attention to one ordering issue. A synchronous ZooKeeper call blocks until it gets a response. If two or more threads submit synchronous operations to ZooKeeper concurrently, they will each block until they receive a response. ZooKeeper will deliver the responses in order, but it is possible that due to thread scheduling, the result of an operation submitted later will be processed first. If ZooKeeper delivers responses to operations very close to each other in time, you may observe such a scenario.
If different threads submit operations concurrently, odds are that the operations are not directly related and can be executed in any order without causing consistency issues. But if the operations are related, the application client must take the order of submission into consideration when processing the results.
Order When Mixing Synchronous and Asynchronous Calls
Here’s another situation where results may appear to be out of order. Say that you submit two asynchronous operations, Aop1 and Aop2. It doesn’t matter what the operations are, only that they are asynchronous. In the callback of Aop1, you make a synchronous call, Sop1. The synchronous call blocks the dispatch thread of the ZooKeeper client, which causes the application client to receive the result of Sop1 before receiving the result of Aop2. Consequently, the application observes the results of Aop1, Sop1, and Aop2 in that order, which is different from the submission order.
In general, it is not a good idea to mix synchronous and asynchronous calls. There are a few exceptions—for example, when starting up, you may want to make sure that some data is in ZooKeeper before proceeding. It is possible to use Java latches and other such mechanisms in these cases, but one or more synchronous calls might also do the job.
Data and Child Limits
ZooKeeper limits the amount of data transferred for each request to 1MB by default. This limit bounds the maximum amount of data for any given node and the number of children any parent znode can have. The choice of 1MB is somewhat arbitrary, in the sense that there is nothing fundamental in ZooKeeper that prevents it from using a different value, larger or smaller. A limit exists, however, to keep performance high. A znode with very large data, for instance, takes a long time to traverse, essentially stalling the processing pipeline while requests are being executed. The same problem happens if a client executes getChildren on a znode with a large number of children.
The limits ZooKeeper imposes right now for data size and number of children are already quite high, so you should avoid even getting close to these limits. We have enabled large limits to satisfy the requirements of a broader set of applications. If you have a special use case and really need different limits, you can change them as described in Unsafe Options.
Embedding the ZooKeeper Server
Many developers have considered embedding the ZooKeeper server in their applications to hide their dependency on ZooKeeper. By “embedding,” we mean instantiating a ZooKeeper server inside an application. The idea is to make it transparent to the application user that ZooKeeper is in use. Although the idea sounds really appealing (who likes extra dependencies, after all?), it is not really recommended. One issue we have observed with embedding is that if anything goes wrong with ZooKeeper, the user will start seeing log messages related to ZooKeeper. At that point, its use is not transparent any longer, and the application developer might not be the right expert to deal with these problems. Even worse, the availability of the application and ZooKeeper are now coupled: if one exits, the other exits too. ZooKeeper is often used to provide high availability, but embedding it in an application effectively eliminates one of its strongest benefits.
Although we do not recommend embedding ZooKeeper, there is nothing really fundamental that prevents one from doing it. ZooKeeper tests, for example, do it. Consequently, if you really think you need to follow this path, the ZooKeeper tests are a good source of ideas for how to do so.
Takeaway Messages
Programming with ZooKeeper can be tricky at times. At a high level, we call the attention of the reader to order guarantees and the session semantics. They may seem easy to understand at first, and in a sense they are, but there are important corner cases that developers need to be aware of. Our goal in this chapter was to provide some guidelines and cover some of these corner cases.