Running ZooKeeper - Administering ZooKeeper - ZooKeeper (2013)

ZooKeeper (2013)

Part III. Administering ZooKeeper

Chapter 10. Running ZooKeeper

ZooKeeper was designed not only to be a great building block for developers, but also to be friendly for operations people. As distributed systems get bigger, managing operations becomes harder and robust administration practices become more important. Our vision was that ZooKeeper would be a standard distributed system component that an operations team could learn and manage well. We have seen from previous examples that a ZooKeeper server is easy to start up, but there are many knobs and dials to keep in mind when running a ZooKeeper service. Our goal in this chapter is to get you familiar and comfortable with the management tools available for running ZooKeeper.

In order for a ZooKeeper service to function correctly, it must be configured correctly. The distributed computing foundation upon which ZooKeeper is based works only when required operating conditions are met. For example, all ZooKeeper voting servers must have the same configuration. It has been our experience that improper or inconsistent configuration is the primary source of operational problems.

A simple example of one such problem happened in the early days of ZooKeeper. A team of early users had written their application around ZooKeeper, tested it thoroughly, and then pushed it to production. Even in the early days ZooKeeper was easy to work with and deploy, so this group pushed their ZooKeeper service and application into production without ever talking to us.

Shortly after the production traffic started, problems started appearing. We received a frantic call from operations saying that things were not working. He repeatedly asked them if they had thoroughly tested their solution before putting it into production, and they repeatedly assured him that they had. After piecing together the situation, he realized they were probably suffering from split brain. Finally he asked them to send him their configuration files. Once he saw them, it was clear what had happened: they had tested with a standalone ZooKeeper server, but when they went into production they used three servers to make sure they could tolerate a server failure. Unfortunately, they forgot to change the configuration, so they ended up pushing three standalone servers.

The clients treated all three servers as part of the same ensemble, but the servers themselves acted independently. Thus, three different groups of clients had three different (and conflicting) views of the system. It looked like everything was working fine, but behind the scenes it was chaos.

Hopefully, this example illustrates the importance of ZooKeeper configuration. It takes some understanding of basic concepts, but in reality it is not hard or complicated. The key is to know where the knobs are and what they do. That is what this chapter is about.

Configuring a ZooKeeper Server

In this section we will look at the various knobs that govern how a ZooKeeper server operates. We have already seen a couple of these knobs, but there are many more. They all have default settings that correspond to the most common case, but often these should be changed. ZooKeeper was designed for easy use and operation, and we have succeeded so well that sometimes people get off and running without really understanding their setups. It is tempting to just go with the simplest configuration once everything starts working, but if you spend the time to learn the different configuration options, you will find that you can get better performance and more easily diagnose problems.

In this section we go through each of the configuration parameters, what they mean, and why you might need to use them. It may feel like a bit of a slog, so if you are looking for more exciting information, you may want to skip to the next section. If you do, though, come back at some point to sit down and familiarize yourself with the different options. It can make a big difference in the stability and performance of your ZooKeeper installation.

Each ZooKeeper server takes options from a configuration file named zoo.cfg when it starts. Servers that play similar roles and have the same basic setup can share a file. The myid file in the data directory distinguishes servers from each other. Each data directory must be unique to the server anyway, so it is a convenient place to put the distinguishing file. The server ID contained in the myid file serves as an index into the configuration file, so that a particular ZooKeeper server can know how it should configure itself. Of course, if servers have different setups (for example, they store their transaction logs in different places), each must have its own unique configuration file.

Configuration parameters are usually set in the configuration file. In the sections that follow, these parameters are presented in list form. Many parameters can also be set using Java system properties, which generally have the form zookeeper.propertyName. These properties are set using the -D option when starting the server. Where appropriate, the system property that corresponds to a given parameter will be presented in parentheses. A configuration parameter in a file has precedence over system properties.

Basic Configuration

Some configuration parameters do not have a default and must be set for every deployment. These are:

clientPort

The TCP port that clients use to connect to this server. By default, the server will listen on all of its interfaces for connections on this port unless clientPortAddress is set. The client port can be set to any number and different servers can listen on different ports. The default port is2181.

dataDir and dataLogDir

dataDir is the directory where the fuzzy snapshots of the in-memory database will be stored. If this server is part of an ensemble, the id file will also be in this directory.

The dataDir does not need to reside on a dedicated device. The snapshots are written using a background thread that does not lock the database, and the writes to storage are not synced until the snapshot is complete.

Unless the dataLogDir option is set, the transaction log is also stored in this directory. The transaction log is very sensitive to other activity on the same device as this directory. The server tries to do sequential writes to the transaction log because the data must be synced to storage before the server can acknowledge a transaction. Other activity on the device—notably snapshots—can severely affect write throughput by causing disk heads to thrash during syncing. So, best practice is to use a dedicated log device and set dataLogDir to point to a directory on that device.

tickTime

The length of a tick, measured in milliseconds. The tick is the basic unit of measurement for time used by ZooKeeper, and it determines the bucket size for session timeout as described in Servers and Sessions.

The timeouts used by the ZooKeeper ensemble are specified in units of tickTime. This means, in effect, that the tickTime sets the lower bound on timeouts because the minimum timeout is a single tick. The minimum client session timeout is two ticks.

The default tickTime is 3,000 milliseconds. Lowering the tickTime allows for quicker timeouts but also results in more overhead in terms of network traffic (heartbeats) and CPU time (session bucket processing).

Storage Configuration

This section covers some of the more advanced configuration settings that apply to both standalone and ensemble configurations. They do not need to be set for ZooKeeper to function properly, but some (such as dataLogDir) really should be set.

preAllocSize

The number of kilobytes to preallocate in the transaction log files (zookeeper.preAllocSize).

When writing to the transaction log, the server will allocate blocks of preAllocSize kilobytes at a time. This amortizes the file system overhead of allocating space on the disk and updating metadata. More importantly, it minimizes the number of seeks that need to be done.

By default, preAllocSize is 64 megabytes. One reason to lower this number is if the transaction log never grows that large. Because a new transaction log is restarted after each snapshot, if the number of transactions is small between each snapshot and the transactions themselves are small, 64 megabytes may be too big. For example, if we take a snapshot every 1,000 transactions, and the average transaction size is 100 bytes, a 100-kilobyte preAllocSize would be much more appropriate. The default preAllocSize is appropriate for the default snapCount and transactions that average more than 512 bytes in size.

snapCount

The number of transactions between snapshots (zookeeper.snapCount).

When a ZooKeeper server restarts, it needs to restore its state. Two big factors in the time it takes to restore the state are the time it takes to read in a snapshot, and the time it takes to apply transactions that occurred after the snapshot was started. Snapshotting often will minimize the number of transactions that must be applied after the snapshot is read in. However, snapshotting does have an effect on the server’s performance, even though snapshots are written in a background thread.

By default the snapCount is 100000. Because snapshotting does affect performance, it would be nice if all of the servers in an ensemble were not snapshotting at the same time. As long as a quorum of servers is not snapshotting at once, the processing time should not be affected. For this reason, the actual number of transactions in each snapshot is a random number close to snapCount.

Note also that if snapCount is reached but a previous snapshot is still being taken, a new snapshot will not start and the server will wait another snapCount transactions before starting a new snapshot.

autopurge.snapRetainCount

The number of snapshots and corresponding transaction logs to retain when purging data.

ZooKeeper snapshots and transaction logs are periodically garbage collected. The autopurge.snapRetainCount governs the number of snapshots to retain while garbage collecting. Obviously, not all of the snapshots can be deleted because that would make it impossible to recover a server; the minimum for autopurge.snapRetainCount is 3, which is also the default.

autopurge.purgeInterval

The number of hours to wait between garbage collecting (purging) old snapshots and logs. If set to a nonzero number, autopurge.purgeInterval specifies the period of time between garbage collection cycles. If set to zero, the default, garbage collection will not be run automatically but should be run manually using the zkCleanup.sh script in the ZooKeeper distribution.

fsync.warningthresholdms

The duration in milliseconds of a sync to storage that will trigger a warning (fsync.warningthresholdms).

A ZooKeeper server will sync a change to storage before it acknowledges the change.

weight.x=n

Used along with group options, this assigns a weight n to a server when forming quorums. The value n is the weight of a server when voting. A few parts of ZooKeeper require voting, such as leader election and the atomic broadcast protocol. By default, the weight of a server is 1. If the configuration defines groups but not weights, a weight of 1 will be assigned to all servers.

If the sync system call takes too long, system performance can be severely impacted. The server tracks the duration of this call and will issue a warning if it is longer than fsync.warningthresholdms. By default, it’s 1,000 milliseconds.

traceFile

Keeps a trace of ZooKeeper operations by logging them in trace files named traceFile.year.month.day. Tracing is not done unless this option is set (requestTraceFile).

This option is used to get a detailed view of the operations going through ZooKeeper. However, to do the logging, the ZooKeeper server must serialize the operations and write them to disk. This causes CPU and disk contention. If you use this option, be sure to avoid putting the trace file on the log device. Also realize that, unfortunately, tracing does perturb the system and thus may make it hard to re-create problems that happen when tracing is off. Just to make it interesting, the traceFile Java system property has no zookeeper prefix and the property name does not match the name of the configuration variable, so be careful.

Network Configuration

These options place limits on communication between servers and clients. Timeouts are also covered in this section:

globalOutstandingLimit

The maximum number of outstanding requests in ZooKeeper (zookeeper.globalOutstandingLimit).

ZooKeeper clients can submit requests faster than ZooKeeper servers can process them. This will lead to requests being queued at the ZooKeeper servers and eventually (as in, in a few seconds) cause the servers to run out of memory. To prevent this, ZooKeeper servers will start throttling client requests once the globalOutstandingLimit has been reached. But globalOutstandingLimit is not a hard limit; each client must be able to have at least one outstanding request, or connections will start timing out. So, after the globalOutstandingLimit is reached, the servers will read from client connections only if they do not have any pending requests.

To determine the limit of a particular server out of the global limit, we simply divide the value of this parameter by the number of servers. There is currently no smart way implemented to figure out the global number of outstanding operations and enforce the limit accordingly. Consequently, this limit is more of an upper bound on the number of outstanding requests. As a matter of fact, having the load perfectly balanced across servers is typically not achievable, so some servers that are running a bit slower or that are a bit more loaded may end up throttling even if the global limit has not been reached.

The default limit is 1,000 requests. You will probably not need to modify this parameter. If you have many clients that are sending very large requests you may need to lower the value, but we have never seen the need to change it in practice.

maxClientCnxns

The maximum number of concurrent socket connections allowed from each IP address.

ZooKeeper uses flow control and limits to avoid overload conditions. The resources used in setting up a connection are much higher than the resources needed for normal operations. We have seen examples of errant clients that spun while creating many ZooKeeper connections per second, leading to a denial of service. To remedy the problem, we added this option, which will deny new connections from a given IP address if that address has maxClientCnxns active. The default is 60 concurrent connections.

Note that the connection count is maintained at each server. If we have an ensemble of five servers and the default is 60 concurrent connections, a rogue client will randomly connect to the five different servers and normally be able to establish close to 300 connections from a single IP address before triggering this limit on one of the servers.

clientPortAddress

Limits client connections to those received on the given address.

By default, a ZooKeeper server will listen on all its interfaces for client connections. However, some servers are set up with multiple network interfaces, generally one interface on an internal network and another on a public network. If you do not want a server to allow client connections from the public network, set the clientPortAddress to the address of the interface on the private network.

minSessionTimeout

The minimum session timeout in milliseconds. When clients make a connection, they request a specific timeout, but the actual timeout they get will not be less than minSessionTimeout.

ZooKeeper developers would love to be able to detect client failures immediately and accurately. Unfortunately, as Building Distributed Systems with ZooKeeper explained, systems cannot do this under real conditions. Instead, they use heartbeats and timeouts. The timeouts to use depend on the responsiveness of the ZooKeeper client and server machines and, more importantly, the latency and reliability of the network between them. The timeout must be equal to at least the network round trip time between the client and server, but occasionally packets will be dropped, and when that happens the time it takes to receive a response is increased by the retransmission timeout as well as the latency of receiving the retransmitted packet.

By default, minSessionTimeout is two times the tickTime. Setting this timeout too low will result in incorrect detection of client failures. Setting this timeout too high will delay the detection of client failures.

maxSessionTimeout

The maximum session timeout in milliseconds. When clients make a connection, they request a specific timeout, but the actual timeout they get will not be greater than maxSessionTimeout.

Although this setting does not affect the performance of the system, it does limit the amount of time for which a client can consume system resources. By default, maxSessionTimeout is 20 times the tickTime.

Cluster Configuration

When an ensemble of servers provide the ZooKeeper service, we need to configure each server to have the correct timing and server list so that the servers can connect to each other and detect failures. These parameters must be the same on all the ZooKeeper servers in the ensemble:

initLimit

The timeout, specified in number of ticks, for a follower to initially connect to a leader.

When a follower makes an initial connection to a leader, there can be quite a bit of data to transfer, especially if the follower has fallen far behind. initLimit should be set based on the transfer speed of the network between leader and follower and the amount of data to be transferred. If the amount of data stored by ZooKeeper is particularly large (i.e., if there are a large number of znodes or large data sets) or the network is particularly slow, initLimit should be increased. Because this value is so specific to the environment, there is no default for it. You should choose a value that will conservatively allow the largest expected snapshot to be transferred. Because you may have more than one transfer happening at a time, you may want to set initLimit to twice that expected time. If you set the initLimit too high, it will take longer for initial connections to faulty servers to fail, which can increase recovery time. For this reason it is a good idea to benchmark how long it takes for a follower to connect to a leader on your network with the amount of data you plan on using to find your expected time.

syncLimit

The timeout, specified in number of ticks, for a follower to sync with a leader.

A follower will always be slightly behind the leader, but if the follower falls too far behind—due to server load or network problems, for example—it needs to be dropped. If the leader hasn’t been able to sync with a follower for more than syncLimit ticks, it will drop the follower. Just like initLimit, syncLimit does not have a default and must be set. Unlike initLimit, syncLimit does not depend on the amount of data stored by ZooKeeper; instead, it depends on network latency and throughput. On high-latency networks it will take longer to send data and get responses back, so naturally the syncLimit will need to be increased. Even if the latency is relatively low, you may need to increase the syncLimit because any relatively large transaction may take a while to transmit to a follower.

leaderServes

A “yes” or “no” flag indicating whether or not a leader will service clients (zookeeper.leaderServes).

The ZooKeeper server that is serving as leader has a lot of work to do. It talks with all the followers and executes all changes. This means the load on the leader is greater than that on the follower. If the leader becomes overloaded, the entire system may suffer.

This flag, if set to “no,” can remove the burden of servicing client connections from the leader and allow it to dedicate all its resources to processing the change operations sent to it by followers. This will increase the throughput of operations that change system state. On the other hand, if the leader doesn’t handle any of the client connections itself directly, the followers will have more clients because clients that would have connected to the leader will be spread among the followers. This is particularly problematic if the number of servers in an ensemble is low. By default,leaderServes is set to “yes.”

server.x=[hostname]:n:n[:observer]

Sets the configuration for server x.

ZooKeeper servers need to know how to communicate with each other. A configuration entry of this form in the configuration file specifies the configuration for a given server x, where x is the ID of the server (an integer). When a server starts up, it gets its number from the myid file in the data directory. It then uses this number to find the server.x entry. It will configure itself using the data in this entry. If it needs to contact another server, y, it will use the information in the server.y entry to contact the server.

The hostname is the name of the server on the network n. There are two TCP port numbers. The first port is used to send transactions, and the second is for leader election. The ports typically used are 2888:3888. If observer is in the final field, the server entry represents an observer.

Note that it is quite important that all servers use the same server.x configuration; otherwise, the ensemble won’t work properly because servers might not be able to establish connections properly.

cnxTimeout

The timeout value for opening a connection during leader election (zookeeper.cnxTimeout).

The ZooKeeper servers connect to each other during leader election. This value determines how long a server will wait for a connection to complete before trying again. Leader Elections showed the purpose of this timeout. The default value of 5 seconds is very generous and probably will not need to be adjusted.

electionAlg

The election algorithm.

We have included this configuration option for completeness. It selects among different leader election algorithms, but all have been deprecated except for the one that is the default. You shouldn’t need to use this option.

Authentication and Authorization Options

This section contains the options that are used for authentication and authorization. For infomation on configuration options for Kerberos, refer to SASL and Kerberos:

zookeeper.DigestAuthenticationProvider.superDigest (Java system property only)

This system property specifies the digest for the “super” user’s password. (This feature is disabled by default.) A client that authenticates as super bypasses all ACL checking. The value of this system property will have the form super:encoded_digest. To generate the encoded digest, use the org.apache.zookeeper.server.auth.DigestAuthenticationProvider utility as follows:

java -cp $ZK_CLASSPATH \

org.apache.zookeeper.server.auth.DigestAuthenticationProvider super:asdf

The following example command line generates an encoded digest for the password asdf:

super:asdf->super:T+4Qoey4ZZ8Fnni1Yl2GZtbH2W4=

To start a server using this digest, you can use the following command:

export SERVER_JVMFLAGS

SERVER_JVMFLAGS=-Dzookeeper.DigestAuthenticationProvider.superDigest=

super:T+4Qoey4ZZ8Fnni1Yl2GZtbH2W4=

./bin/zkServer.sh start

Now, when connecting with zkCli, you can issue the following:

[zk: localhost:2181(CONNECTED) 0] addauth digest super:asdf

[zk: localhost:2181(CONNECTED) 1]

At this point you are authenticated as the super user and will not be restricted by any ACLs.

UNSECURED CONNECTIONS

The connection between the ZooKeeper client and server is not encrypted, so the super password should not be used over untrusted links. The safest way to use the super password is to run the client using the super password on the same machine as the ZooKeeper server.

Unsafe Options

The following options can be useful, but be careful when you use them. They really are for very special situations. The majority of administrators who think they need them probably don’t:

forceSync

A “yes” or “no” option that controls whether data should be synced to storage (zookeeper.forceSync).

By default, and when forceSync is set to yes, transactions will not be acknowledged until they have been synced to storage. The sync system call is expensive and is the cause of one of the biggest delays in transaction processing. If forceSync is set to no, transactions will be acknowledged as soon as they have been written to the operating system, which usually caches them in memory before writing them to disk. Setting forceSync to no will yield an increase in performance at the cost of recoverability in the case of a server crash or power outage.

jute.maxbuffer (Java system property only)

The maximum size, in bytes, of a request or response. This option can be set only as a Java system property. There is no zookeeper. prefix on it.

ZooKeeper has some built-in sanity checks, one of which is the amount of data that can be transferred for a given znode. ZooKeeper is designed to store configuration data, which generally consists of small amounts of metadata information (on the order of hundreds of bytes). By default, if a request or response has more than 1 megabyte of data, it is rejected as insane. You may want to use this property to make the sanity check smaller or, if you really are insane, increase it.

CHANGING THE SANITY CHECK

Although the size limit specified by jute.maxbuffer is most obviously exceeded with a large write, the problem can also happen when getting the list of children of a znode with a lot of children. If a znode has hundreds of thousands of immediate child znodes whose names average 10 characters in length, the default maximum buffer size will get hit when trying to return the list of children, causing the connection to get reset.

skipACL

Skips all ACL checks (zookeeper.skipACL).

There is some overhead associated with ACL checking. This option can be used to turn off all ACL checking. It will increase performance, but will leave the data completely open to any client that can connect to a ZooKeeper server.

readonlymode.enabled (Java system property only)

Setting this value to true enables read-only-mode server support. Clients that request read-only-mode support will be able to connect to a server to read (possibly stale) information even if that server is partitioned from the quorum. To enable read-only mode, a client needs to setcanBeReadOnly to true.

This feature enables a client to read (but not write) the state of ZooKeeper in the presence of a network partition. In such cases, clients that have been partitioned away can still make progress and don’t need to wait until the partition heals. It is very important to note that a ZooKeeper server that is disconnected from the rest of the ensemble might end up serving stale state in read-only mode.

Logging

The ZooKeeper server uses SLF4J (the Simple Logging Facade for Java) as an abstraction layer for logging, and by default uses Log4J to do the actual logging. It may seem like overkill to use two layers of logging abstractions, and it is. In this section we will give a brief overview of how to configure Log4J. Although Log4J is very flexible and powerful, it is also a bit complicated. There are whole books written about it; in this section we will just give a brief overview of the basics needed to get things going.

The Log4J configuration file is named log4j.properties and is pulled from the classpath. One disappointing thing about Log4J is that if you don’t have the log4j.properties file in place, you will get the following output:

log4j:WARN No appenders could be found for logger (org.apache.zookeeper.serv ...

log4j:WARN Please initialize the log4j system properly.

That’s it; all the rest of the log messages will be discarded.

Generally, log4j.properties is kept in a conf directory that is included in the classpath. Let’s look at the main part of the log4j.properties file that is distributed with ZooKeeper:

zookeeper.root.logger=INFO, CONSOLE 1

zookeeper.console.threshold=INFO

zookeeper.log.dir=.

zookeeper.log.file=zookeeper.log

zookeeper.log.threshold=DEBUG

zookeeper.tracelog.dir=.

zookeeper.tracelog.file=zookeeper_trace.log

log4j.rootLogger=${zookeeper.root.logger} 2

log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender 3

log4j.appender.CONSOLE.Threshold=${zookeeper.console.threshold} 4

log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout 5

log4j.appender.CONSOLE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] -

...

log4j.appender.ROLLINGFILE=org.apache.log4j.RollingFileAppender 6

log4j.appender.ROLLINGFILE.Threshold=${zookeeper.log.threshold} 7

log4j.appender.ROLLINGFILE.File=${zookeeper.log.dir}/${zookeeper.log.file}

log4j.appender.ROLLINGFILE.MaxFileSize=10MB

log4j.appender.ROLLINGFILE.MaxBackupIndex=10

log4j.appender.ROLLINGFILE.layout=org.apache.log4j.PatternLayout

log4j.appender.ROLLINGFILE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] -

...

1

This first group of settings, which all start with zookeeper., set up the defaults for this file. They are actually system properties and can be overridden with the corresponding -D JVM options on the Java command line. The first line configures logging. The default set here says that messages below the INFO level should be discarded and messages should be output using the CONSOLE appender. You can specify multiple appenders; for example, you could set zookeeper.root.logger to INFO, CONSOLE, ROLLINGFILE if you wanted to send log messages to both the CONSOLE and ROLLINGFILE appenders.

2

The rootLogger is the logger that processes all log messages, because we do not define any other loggers.

3

This line associates the name CONSOLE with the class that will actually be handling the output of the message—in this case, ConsoleAppender.

4

Appenders can also filter messages. This line states that this appender will ignore any messages below the INFO level, because that is the threshold set in zookeeper.root.logger.

5

Appenders use a layout class to format the messages before they are written out. We use the pattern layout to log the message level, date, thread information, and calling location information in addition to the message itself.

6

The RollingFileAppender will implement rolling log files rather than continually appending to a single log or console. Unless ROLLINGFILE is referenced by the rootLogger, this appender will be ignored.

7

The threshold for the ROLLINGFILE is set to DEBUG. Because the rootLogger filters out all messages below the INFO level, no DEBUG messages will get to the ROLLINGFILE. If you want to see the DEBUG messages, you must also change INFO to DEBUG inzookeeper.root.logger.

Logging can affect the performance of a process, especially at the DEBUG level. At the same time, logging can provide valuable information for diagnosing problems. A useful way to balance the performance cost of detailed logging with the insight that logging gives you is to set the appenders to have thresholds at DEBUG and set the level of the rootLogger to WARN. When the server is running, if you need to diagnose a problem as it is happening, you can change the level of the rootLogger to INFO or DEBUG on the fly with JMX to examine system activity more closely.

Dedicating Resources

As you think about how you will configure ZooKeeper to run on a machine, it is also important to think about the machine itself. For good, consistent performance you will want to have a dedicated log device. This means the log directory must have its own hard drive that is not used by other processes. You don’t even want ZooKeeper to use it for the periodic fuzzy snapshots it does.

You should also consider dedicating the whole machine on which it will run to ZooKeeper. ZooKeeper is a critical component that needs to be reliable. We can use replication to handle failures of ZooKeeper servers, so it is tempting to think that the machines that ZooKeeper runs on do not need to be particularly reliable and can be shared with other processes. The problem is that other processes can greatly increase the probability of a ZooKeeper failure. If another process starts making the disk thrash or uses all the memory or CPU, it will cause the ZooKeeper server to fail, or at least perform very poorly. A particularly problematic scenario is when a ZooKeeper server process runs on the same server as one of the application processes that it manages. If that process goes into an infinite loop or starts behaving badly, it may adversely affect the ZooKeeper server process at the very moment that it is needed to allow other processes to take over from the bad one. Dedicate a machine to run each ZooKeeper server in order to avoid these problems.

Configuring a ZooKeeper Ensemble

The notion of quorums, introduced in ZooKeeper Quorums, is deeply embedded in the design of ZooKeeper. The quorum is relevant when processing requests and when electing a leader in replicated mode. If a quorum of ZooKeeper servers is up, the ensemble makes progress.

A related concept is that of observers, explained in Observers. Observers participate in the ensemble, receiving requests from clients and process state updates from the servers. The leader, however, does not consider observer acknowledgments when processing requests. The ensemble also does not consider observer notifications when electing a leader. Here we discuss how to configure quorums and ensembles.

The Majority Rules

When an ensemble has enough ZooKeeper servers to start processing requests, we call the set of servers a quorum. Of course, we never want there to be two disjoint sets of servers that can process requests, or we would end up with split brain. We can avoid the split-brain problem by requiring that all quorums have at least a majority of servers. (Note: half of the servers do not constitute a majority; you must have greater than half the number of servers to have a majority.)

When we set up a ZooKeeper ensemble to use multiple servers, we use majority quorums by default. ZooKeeper automatically detects that it should run in replicated mode because there are multiple servers in the configuration (see Cluster Configuration), and it defaults to using majority quorums.

Configurable Quorums

One important property we have mentioned about quorums is that, if one quorum is dissolved and another formed within the ensemble, the two quorums must intersect in at least one server. Majority quorums clearly satisfy this intersection property. Quorums in general, however, are not constrained to majorities, and ZooKeeper allows flexible configuration of quorums. The particular scheme we use is to group servers into disjoint sets and assign weights to the servers. To form a quorum in this scheme, we need a majority of votes from each of a majority of groups. For example, say that we have three groups, each group containing three servers, and each server having weight 1. To form a quorum in this example, we need four servers: two servers from one group and two servers from a different group. In general, the math boils down to the following. If we have G groups, then we need servers from a subset of servers such that |Gʹ| > |G|/2. Additionally, for each g in , we need a subset of g such that the sum of the weights of is at least half of the sum of the weights of g (i.e., Wʹ > W/2).

The following configuration option creates a group:

group.x=n[:n]

Enables a hierarchical quorum construction. x is a group identifier and the numbers following the equals sign correspond to server identifiers. The right side of the assignment is a colon-separated list of server identifiers. Note that groups must be disjoint and the union of all groups must be the ZooKeeper ensemble. In other words, every server in the ensemble must be listed once in some group.

Here is an example of nine servers split across three different groups:

group.1=1:2:3

group.2=4:5:6

group.3=7:8:9

In this example all servers have the same weight, and to form a quorum we need two servers from two groups, or a total of four servers. With majority quorums, we would need at least five servers to form a quorum. Note that the quorum cannot be formed from any subset of four servers, however: an entire group plus a single server from a different group does not form a quorum.

A configuration like this has a variety of benefits when we are deploying ZooKeeper across different data centers. For example, one group may represent a group of servers running in a different data center, and if that data center goes down, ZooKeeper can keep going.

One way of deploying across three data centers that tolerates one data center going down and uses majorities is to have three servers in each of two data centers and put only one server in the third data center. If any of the data centers becomes unavailable, the other two can form a quorum. This configuration has the advantage that any four servers out of the seven form a quorum. One shortcoming is that the number of servers is not balanced across the data centers. A second shortcoming is that once a data center becomes unavailable, no further server crashes in other data centers are tolerated.

If there are only two data centers available, we can use weights to express a preference, based, for instance, on the number of ZooKeeper clients in each data center. With only two data centers, we can’t tolerate either of the data centers going down if the servers all have equal weight, but we can tolerate one of the two going down if we assign a higher weight to one of the servers. Say that we assign three servers to each of the data centers and we include them all in the same group:

group.1=1:2:3:4:5:6

Because all the servers default to the same weight, we will have a quorum of servers as long as four of the six servers are up. Of course, this means that if one of the data centers goes down, we will not be able to form a quorum even if the three servers in the other data center are up.

To assign different weights to servers, we use the following configuration option:

weight.x=n

Used along with group options, this assigns a weight n to a server when forming quorums. The value n is the weight the server has when voting. A few parts of ZooKeeper require voting, such as leader election and the atomic broadcast protocol. By default, the weight of a server is 1. If the configuration defines groups but not weights, a weight of 1 will be assigned to all servers.

Let’s say that we want one of the data centers, which we will call D1, to still be able to function as long as all of its servers are up even if the other data center is down. We can do this by assigning one of the servers in D1 more weight, so that it can more easily form a quorum with other servers.

Let’s assume that servers 1, 2, and 3 are in D1. We use the following line to assign server 1 more weight:

weight.1=2

With this configuration, we have seven votes total and we need four votes to form a quorum. Without the weight.1=2 parameter, any server needs at least three other servers to form a quorum, but with that parameter server 1 can form a quorum with just two other servers. So if D1 is up, even if the other data center fails, servers 1, 2, and 3 can form a quorum and continue operation.

These are just a couple of examples of how different quorum configurations might impact a deployment. The hierarchical scheme we provide is flexible, and it enables other configurations with different weights and group organizations.

Observers

Recall that observers are ZooKeeper servers that do not participate in the voting protocol that guarantees the order of state updates. To set up a ZooKeeper ensemble that uses observers, add the following line to the configuration files of any servers that are to be observers:

peerType=observer

You also need to add :observer to the server definition in the configuration file of each server, like this:

server.1:localhost:2181:3181:observer

Reconfiguration

Wow, configuration is a lot of work, isn’t it? But you’ve worked it all out, and now you have your ZooKeeper ensemble made up of three different machines. But then a month or two goes by, and you realize that the number of client processes using ZooKeeper has grown and it has become a much more mission-critical service. So you want to grow it to an ensemble of five machines. No big deal, right? You can pull the cluster down late one night, reconfigure everything, and have it all back up in less than a minute. Your users may not even see the outage if your applications handle the Disconnected event correctly. That’s what we thought we when first developed ZooKeeper, but it turns out that things are more complicated.

Look at the scenario in Figure 10-1. Three machines (A, B, and C) make up the ensemble. C is lagging behind a bit due to some network congestion, so it has only seen transactions up to ⟨1,3⟩ (where 1 is the epoch and 3 is the transaction within that epoch, as described in Requests, Transactions, and Identifiers. But A and B are actively communicating, so C’s lag hasn’t slowed down the system. A and B have been able to commit up to transaction⟨1,6⟩.

An ensemble of 3 servers about to change to 5.

Figure 10-1. An ensemble of three servers about to change to five

Now suppose we bring down these machines to add D and E into the mix. Of course, the two new machines don’t have any state at all. We reconfigure A, B, C, D, and E to be one big ensemble and start everything back up. Because we have five machines, we need three machines to form a quorum. C, D, and E are enough for a quorum, so in Figure 10-2 we see what happens when they form a quorum and sync up. This scenario can easily happen if A and B are slow at starting, perhaps because they were started a little after the other three. Once our new quorum syncs up, A and B will sync with C because it has the most up-to-date state. However, our three quorum members will each end up with ⟨1,3⟩ as the last transaction. They never see ⟨1,4⟩, ⟨1,5⟩, and ⟨1,6⟩, because the only two servers that did see them are not part of this new quorum.

An ensemble of 5 servers with a quorum of 3.

Figure 10-2. An ensemble of five servers with a quorum of three

Because we have an active quorum, these servers can actually commit new transactions. Let’s say that two new transactions come in: ⟨2,1⟩ and ⟨2,2⟩. Now, in Figure 10-3, when A and B finally come up and connect with C, who is the leader of the quorum, C welcomes them in and promptly tells them to delete transactions ⟨1,4⟩, ⟨1,5⟩, and ⟨1,6⟩ while receiving ⟨2,1⟩ and ⟨2,2⟩.

An ensemble of 5 servers that lost data.

Figure 10-3. An ensemble of five servers that has lost data

This result is very bad. We have lost state, and the state of the replicas is no longer consistent with clients that observed ⟨1,4⟩, ⟨1,5⟩, and ⟨1,6⟩. To remedy this, ZooKeeper has a reconfigure operation. This means that administrators do not have to do the reconfiguration procedure by hand and risk corrupting the state. Even better, we don’t even have to bring anything down.

Reconfiguration allows us to change not only the members of the ensemble, but also their network parameters. Because the configuration can change, ZooKeeper needs to move the reconfigurable parameters out of the static configuration file and into a configuration file that will be updated automatically. The dynamicConfigFile parameter links the two files together.

CAN I USE DYNAMIC CONFIGURATION?

This feature is currently only available in the trunk branch of the Apache repository. The target release for trunk is 3.5.0, although nothing really guarantees that the release number will be this one; it might change depending on how trunk evolves. The latest release at the time this book is being written is 3.4.5, and it does not include this feature.

Let’s take an example configuration file that we have been using before dynamic configuration:

tickTime=2000

initLimit=10

syncLimit=5

dataDir=./data

dataLogDir=./txnlog

clientPort=2182

server.1=127.0.0.1:2222:2223

server.2=127.0.0.1:3333:3334

server.3=127.0.0.1:4444:4445

and change it to a configuration file that supports dynamic configuration:

tickTime=2000

initLimit=10

syncLimit=5

dataDir=./data

dataLogDir=./txnlog

dynamicConfigFile=./dyn.cfg

Notice that we have even removed the clientPort parameter from the configuration file. The dyn.cfg file is now going to be made up of just the server entries. We are adding a bit of information, though. Now each entry will have the form:

server.id=host:n:n[:role];[client_address:]client_port

Just as in the normal configuration file, the hostname and ports used for quorum and leader election messages are listed for each server. The role must be either participant or observer. If role is omitted, participant is the default. We also specify the client_port (the server port to which clients will connect) and optionally the address of a specific interface on that server. Because we removed clientPort from the static config file, we need to add it here.

So now our dyn.cfg file looks like this:

server.1=127.0.0.1:2222:2223:participant;2181

server.2=127.0.0.1:3333:3334:participant;2182

server.3=127.0.0.1:4444:4445:participant;2183

These files have to be created before we can use reconfiguration. Once they are in place, we can reconfigure an ensemble using the reconfig operation. This operation can operate incrementally or as a complete (bulk) update.

An incremental reconfig can take two lists: the list of servers to remove and the list of server entries to add. The list of servers to remove is simply a comma-separated list of server IDs. The list of server entries to add is a comma-separated list of server entries of the form found in the dynamic configuration file. For example:

reconfig -remove 2,3 -add \

server.4=127.0.0.1:5555:5556:participant;2184,\

server.5=127.0.0.1:6666:6667:participant;2185

This command removes servers 2 and 3 and adds servers 4 and 5. There are some conditions that must be satisfied in order for this operation to succeed. First, like with all other ZooKeeper operations, a quorum in the original configuration must be active. Second, a quorum in the new configuration must also be active.

RECONFIGURING FROM ONE TO MANY

When we have a single ZooKeeper server, the server runs in standalone mode. This makes things a bit more complicated because a reconfiguration that adds servers not only changes the composition of quorums, but also switches the original server from standalone to quorum mode. At this time we have opted for not allowing reconfiguration for standalone deployments, so to use this feature you will need to start with a configuration in quorum mode.

ZooKeeper allows only one configuration change to happen at a time. Of course, the configuration operation happens very fast, and reconfiguration is infrequent enough that concurrent reconfigurations should not be a problem.

The -file parameter can also be used to do a bulk update using a new membership file. For example, reconfig -file newconf would produce the same result as the incremental operation if newconf contained:

server.1=127.0.0.1:2222:2223:participant;2181

server.4=127.0.0.1:5555:5556:participant;2184

server.5=127.0.0.1:6666:6667:participant;2185

The -members parameter followed by a list of server entries can be used instead of -file for a bulk update.

Finally, all the forms of reconfig can be made conditional. If the -v parameter is used, followed by the configuration version number, the reconfig will succeed only if the configuration is at the current version when it executes. You can get the version number of the current configuration by reading the /zookeeper/config znode or using the config command in zkCli.

MANUAL RECONFIGURATION

If you really want to do reconfiguration manually (perhaps you are using an older version of ZooKeeper), the easiest and safest way to do it is to make one change at a time and bring the ensemble all the way up (i.e., let a leader get established) and down between each change.

Managing Client Connect Strings

We have been talking about ZooKeeper server configuration, but the clients have a bit of related configuration as well: the connect string. The client connect string is usually represented as a series of host:port pairs separated by a comma. The host can be specified either as an IP address or as a hostname. Using a hostname allows for a layer of indirection between the actual IP address of the server and the identifier used to access the server. It allows, for example, an administrator to replace a ZooKeeper server with a different one without changing the setup of the clients.

However, this flexibility is limited. The administrator can change the machines that make up the cluster, but not the machines being used by the clients. For example, in Figure 10-4, the ZooKeeper ensemble can be easily changed from a three-server ensemble to a five-server ensemble using reconfiguration, but the clients will still be using three servers, not all five.

There is another way to make ZooKeeper more elastic with respect to the number of servers, without changing client configuration. It is natural to think of a hostname resolving to a single IP address, but in reality a hostname can resolve to multiple addresses. If a hostname resolves to multiple IP addresses, the ZooKeeper client can connect to any of these addresses. In Figure 10-4, suppose the three individual IP addresses, zk-a, zk-b, and zk-c, resolved to 10.0.0.1, 10.0.0.2, and 10.0.0.3. Now suppose instead you use DNS to configure a single hostname, zk, to resolve to all three IP addresses. You can just change the number of addresses to five in DNS, and any client that subsequently starts will be able to connect to all five servers, as shown in Figure 10-5.

Reconfiguring clients from using 3 servers to using 5.

Figure 10-4. Reconfiguring clients from using three servers to using five

Reconfiguring clients from using 3 servers to using 5 and DNS.

Figure 10-5. Reconfiguring clients from using three servers to using five with DNS

There are a couple of caveats to using a hostname that resolves to multiple addresses. First, all of the servers have to use the same client port. Second, hostname resolution currently happens only on handle creation, so clients that have already started do not recognize the new name resolution. It applies only to newly created ZooKeeper clients.

The client connect string can also include a path component. This path indicates the root to use when resolving pathnames. The behavior is similar to the chroot command in Unix, and you will often hear this feature referred to as “chroot” in the ZooKeeper community. For example, if a client specifies the connection string zk:2222/app/superApp when connecting and issues getData("/a.dat", . . .), the client will receive the data from the znode at path /app/superApp/a.dat. (Note that there must be a znode at the specifed path. The connect string will not create one for you.)

The motivation for using a path component in a connect string is to allow a single ZooKeeper ensemble to host multiple applications without requiring them to append a prefix to all their paths. Each application can use ZooKeeper as if it were the only application using the ensemble, and administrators can carve up the namespace as they wish. Figure 10-6 shows examples of different connect strings that can be used to root client applications at different points in the tree.

Using the connect string to root ZooKeeper client handles.

Figure 10-6. Using the connect string to root ZooKeeper client handles

OVERLAPPING CONNECTION STRING

When managing client connection strings, take care that a client connection string never includes hosts from two different ensembles. It’s the quickest and easiest path to split brain.

Quotas

Another configurable aspect of ZooKeeper is quotas. ZooKeeper has initial support for quotas on the number of znodes and the amount of data stored. It allows us to specify quotas based on a subtree and it will track the usage of the subtree. If a subtree exceeds its quota, a warning will be logged, but the operation will still be allowed to continue. At this point ZooKeeper detects when a quota is exceeded, but does not prevent processes from doing so.

Quota tracking is done in the special /zookeeper subtree. Applications should not store their own data in this subtree; instead, it is reserved for ZooKeeper’s use. The /zookeeper/quota znode is an example of this use. To create a quota for the application /application/superApp, create the znode /application/superApp with two children: zookeeper_limits and zookeeper_stats.

The limit on the number of znodes is called the count, whereas the limit on the amount of data is called the bytes. The quotas for both zookeeper_limits and zookeeper_stats are specified as count=n,bytes=m, where n and m are integers. In the case of zookeeper_limits, nand m represent levels at which warnings will be triggered. (If one of them is –1, it will not act as a trigger.) In the case of zookeeper_stats, n and m represent the current number of znodes in the subtree and the current number of bytes in the data of the znodes of the subtree.

QUOTA TRACKING OF METADATA

The quota tracking for the number of bytes in the subtree does not include the metadata overhead for each znode. This metadata is on the order of 100 bytes, so if the amount of data in each znode is small, it is more useful to track the count of the number of znodes rather than the number of bytes of znode data.

Let’s use zkCli to create /application/superApp and then set a quota:

[zk: localhost:2181(CONNECTED) 2] create /application ""

Created /application

[zk: localhost:2181(CONNECTED) 3] create /application/superApp super

Created /application/superApp

[zk: localhost:2181(CONNECTED) 4] setquota -b 10 /application/superApp

Comment: the parts are option -b val 10 path /application/superApp

[zk: localhost:2181(CONNECTED) 5] listquota /application/superApp

absolute path is /zookeeper/quota/application/superApp/zookeeper_limits

Output quota for /application/superApp count=-1,bytes=10

Output stat for /application/superApp count=1,bytes=5

We create /application/superApp with 5 bytes of data (the word “super”). We then set the quota on /application/superApp to be 10 bytes. When we list the quota on /application/superApp we see that we have 5 bytes left of our data quota and that we don’t have a quota for the number of znodes under this subtree, because count is –1 for the quota.

If we issue get /zookeeper/quota/application/superApp/zookeeper_stats, we can access this data directly without using the quota commands in zkCli. As a matter of fact, we can create and delete these files to create and delete quotas ourselves. If we run the following command:

create /application/superApp/lotsOfData ThisIsALotOfData

we should see the following entry in our log:

Quota exceeded: /application/superApp bytes=21 limit=10

Multitenancy

Quotas, some of the throttling configuration options, and ACLs all make it worth considering the use ZooKeeper to host multiple tenants. There are some compelling reasons to do this:

§ To provide reliable service, ZooKeeper servers should run on dedicated hardware. Sharing that hardware across multiple applications makes it easier to justify the capital investment.

§ We have found that, for the most part, ZooKeeper traffic is extremely bursty: there are bursts of configuration or state changes that cause a lot of load followed by long periods of inactivity. If the activity bursts between applications are not correlated, making them share a server will better utilize hardware resources. Also remember to account for spikes generated when disconnect events happen. Some poorly written apps may generate more load than needed when processing a Disconnected event.

§ By pooling hardware, we can achieve greater fault tolerance: if two applications that previously had their own ensembles of three servers are moved to a single cluster of five servers, they use fewer servers in total but can survive two server failures rather than just one.

When hosting multiple tenants, administrators will usually divide up the data tree into subtrees, each dedicated to a certain application. Developers can design their applications to take into account that their znodes need to have a prefix, but there is an easier way to isolate applications: using the path component in the connect string, described in Managing Client Connect Strings. Each application developer can write her application as if she has a dedicated ZooKeeper service. Then, if the administrator decides that the application should be deployed under/application/newapp, the application can use host:port/application/newapp rather than just host:port, and it will appear to the application that it is using a dedicated service. In the meantime, the administrator can set up the quota for /application/newapp to also track the space usage of the application.

File System Layout and Formats

We have talked about snapshots and transaction logs and storage devices. This section describes how this is all laid out on the file system. A good understanding of the concepts discussed in Local Storage will come in handy in this section; be prepared to refer back to it.

As we have already discussed, data is stored in two ways: transaction logs and snapshots. Both of these end up as normal files in the local file system. Transaction logs are written during the critical path of transaction processing, so we highly recommend storing them on a dedicated device. (We realize we have said this multiple times, but it really is important for good, consistent throughput and latency.) Not using a dedicated device for the transaction log does not cause any correctness issues, but it does affect performance. In a virtualized environment, for example, dedicated devices might not be available. Snapshots need not be stored on a dedicated device because they are taken lazily in a background thread.

Snapshots are written to the path specified in the DataDir parameter, and transaction logs are written to the path specified in the DataLogDir parameter. Let’s take a look at the files in the transaction log directory first. If you list the contents of that directory, you will see a single subdirectory called version-2. We have had only one major change to the format of logs and snapshots, and when we made that change we realized that it would be useful to separate data by file versions to more easily handle data migration between versions.

Transaction Logs

Let’s look at a directory where we have been running some small tests, so it has only two transaction logs:

-rw-r--r-- 1 breed 67108880 Jun 5 22:12 log.100000001

-rw-r--r-- 1 breed 67108880 Jul 15 21:37 log.200000001

We can make a couple of observations about these files. First, they are quite large (over 6 MB each), considering the tests were small. Second, they have a large number as the filename suffix.

ZooKeeper preallocates files in rather large chunks to avoid the metadata management overhead of growing the file with each write. If you do a hex dump of one of these files, you will see that it is full of bytes of the null character (\0), except for a bit of binary data at the beginning. As the servers run longer, the null characters will be replaced with log data.

The log files contain transactions tagged with zxids, but to ease recovery and allow for quick lookup, each log file’s suffix is the first zxid of the log file in hexadecimal. One nice thing about representing the zxid in hex is that you can easily distinguish the epoch part of the zxid from the counter. So, the first file in the preceding example is from epoch 1 and the second is from epoch 2.

Of course, it would be nice to be able to see what is in the files. This can be super useful for problem determination. There have been times when developers have sworn up and down that ZooKeeper is losing track of their znodes, only to find out by looking at the transaction logs that a client has actually deleted some.

We can look at the second log file with the following command:

java -cp $ZK_LIBS org.apache.zookeeper.server.LogFormatter version-2 /

log.200000001

This command outputs the following:

7/15/13... session 0x13...00 cxid 0x0 zxid 0x200000001 createSession 30000

7/15/13... session 0x13...00 cxid 0x2 zxid 0x200000002 create

'/test,#22746573746 ...

7/15/13... session 0x13...00 cxid 0x3 zxid 0x200000003 create

'/test/c1,#6368696c ...

7/15/13... session 0x13...00 cxid 0x4 zxid 0x200000004 create

'/test/c2,#6368696c ...

7/15/13... session 0x13...00 cxid 0x5 zxid 0x200000005 create

'/test/c3,#6368696c ...

7/15/13... session 0x13...00 cxid 0x0 zxid 0x200000006 closeSession null

Each transaction in the log file is output as its own line in human-readable form. Because only change operations result in a transaction, you will not see read transactions in the transaction log.

Snapshots

The naming scheme for snapshot files is similar to the transaction log scheme. Here is the list of snapshots on the server used earlier:

-rw-r--r-- 1 br33d 296 Jun 5 07:49 snapshot.0

-rw-r--r-- 1 br33d 415 Jul 15 21:33 snapshot.100000009

Snapshot files are not preallocated, so the size more accurately reflects the amount of data they contain. The suffix used reflects the current zxid when the snapshot started. As we discussed earlier, the snapshot file is actually a fuzzy snapshot; it is not a valid snapshot on its own until the transaction log is replayed over it. Specifically, to restore a system, you must start replaying the transaction log starting at the zxid of the snapshot suffix or earlier.

The files themselves also store the fuzzy snapshot data in binary form. Consequently, there is another tool to examine the snapshot files:

java -cp ZK_LIBS org.apache.zookeeper.server.SnapshotFormatter version-2 /

snapshot.100000009

This command outputs the following:

----

/

cZxid = 0x00000000000000

ctime = Wed Dec 31 16:00:00 PST 1969

mZxid = 0x00000000000000

mtime = Wed Dec 31 16:00:00 PST 1969

pZxid = 0x00000100000002

cversion = 1

dataVersion = 0

aclVersion = 0

ephemeralOwner = 0x00000000000000

dataLength = 0

----

/sasd

cZxid = 0x00000100000002

ctime = Wed Jun 05 07:50:56 PDT 2013

mZxid = 0x00000100000002

mtime = Wed Jun 05 07:50:56 PDT 2013

pZxid = 0x00000100000002

cversion = 0

dataVersion = 0

aclVersion = 0

ephemeralOwner = 0x00000000000000

dataLength = 3

----

....

Only the metadata for each znode is dumped. This allows an administrator to figure out things like when a znode was changed and which znodes are taking up a lot of memory. Unfortunately, the data and ACLs don’t appear in this output. Also, remember when doing problem determination that the information from the snapshot must be merged with the information from the log to figure out what was going on.

Epoch Files

Two more small files make up the persistent state of ZooKeeper. They are the two epoch files, named acceptedEpoch and currentEpoch. We have talked about the notion of an epoch before, and these two files reflect the epoch numbers that the given server process has seen and participated in. Although the files don’t contain any application data, they are important for data consistency, so if you are doing a backup of the raw data files of a ZooKeeper server, don’t forget to include these two files.

Using Stored ZooKeeper Data

One of the nice things about ZooKeeper data is that the standalone servers and the ensemble of servers store data in the same way. We’ve just mentioned that to get an accurate view of the data you need to merge the logs and the snapshot. You can do this by copying log files and snapshot files to another machine (like your laptop, for example), putting them in the empty data directory of a standalone server, and starting the server. The server will now reflect the state of the server that the files were copied from. This technique allows you to capture the state of a server in production for later review.

This also means that you can easily back up a ZooKeeper server by simply copying its data files. There are a couple of things to keep in mind if you choose to do this. First, ZooKeeper is a replicated service, so there is redundancy built into the system. If you do take a backup, you need to back up only the data of one of the servers.

It is important to keep in mind that when a ZooKeeper server acknowledges a transaction, it promises to remember the state from that time forward. So if you restore a server’s state using an older backup, you have caused the server to violate its promise. This might not be a big deal if you have just suffered data loss on all servers, but if you have a working ensemble and you move a server to an older state, you risk causing other servers to also start to forget things.

If you are recovering from data loss on all or a majority of your servers, the best thing to do is to grab your latest captured state (from a backup from the most up-to-date surviving server) and copy that state to all the other servers before starting any of them.

Four-Letter Words

Now that we have our server configured and up and running, we need to monitor it. That is where four-letter words come in. We have already seen some examples of four-letter words when we used telnet to see the system status in Running the Watcher Example. Four-letter words provide a simple way to do various checks on the system. The main goal with four-letter words is to provide a very simple protocol that can be used with simple tools, such as telnet and nc, to check system health and diagnose problems. To keep things simple, the output of a four-letter word will be human readable. This makes the words easy to experiment with and use.

It is also easy to add new words to the server, so the list has been growing. In this section we will point out some of the commonly used words. Consult the ZooKeeper documentation for the most recent and complete list of words:

ruok

Provides (limited) information about the status of the server. If the server is running, it will respond with imok. It turns out that “OK” is a relative concept, though. For example, the server might be running but unable to communicate with the other servers in the ensemble, yet still report that it is “OK.” For a more detailed and reliable health check, use the stat word.

stat

Provides information about the status of the server and the connections that are currently active. The status includes some basic statistics and whether the server is currently active, if it is a leader or follower, and the last zxid the server has seen. Some of the statistics are cumulative and can be reset using the srst word.

srvr

Provides the same information as stat, except the connection information, which it omits.

dump

Provides session information, listing the currently active sessions and when they will expire. This word can be used only on a server that is acting as the leader.

conf

Lists the basic server configuration parameters that the server was started with.

envi

Lists various Java environment parameters.

mntr

Offers more detailed statistics than stat about the server. Each line of output has the format key<tab>value. (The leader will list some additional parameters that apply only to leaders.)

wchs

Lists a brief summary of the watches tracked by the server.

wchc

Lists detailed information on the watches tracked by the server, grouped by session.

wchp

Lists detailed information on the watches tracked by the server, grouped by the znode path being watched.

cons, crst

cons lists detailed statistics for each connection on a server and crst resets all the connection counters to zero.(((“cons”))

Monitoring with JMX

Four-letter words are great for monitoring, but they do not provide a way to control or make changes to the system. ZooKeeper also uses a standard Java management protocol called JMX (Java Management Extensions) to provide more powerful monitoring and management capabilities. There are many books about how to set up and use JMX and many tools for managing servers with JMX; in this section we will use a simple management console called jconsole to explore the ZooKeeper management functionality that is available via JMX.

The jconsole utility is distributed with Java. In practice, a JMX tool such as jconsole is used to monitor remote ZooKeeper servers, but for this part of the exercise we will run it on the same machine as our ZooKeeper servers.

First, let’s start up the second ZooKeeper server (the one with the ID of 2). Then we’ll start jconsole by simply running the jconsole command from the command line. You should see a window similar to Figure 10-7 when jconsole starts.

Notice the process that has “zookeeper” in the name. This is the local process that jconsole has discovered that it can connect to.

Now let’s connect to the ZooKeeper process by double-clicking that process in the list. We will be asked about connecting insecurely because we do not have SSL set up. Clicking the insecure connection button should bring up the screen shown in Figure 10-8.

jconsole startup screen

Figure 10-7. jconsole startup screen

The first management window for a process

Figure 10-8. The first management window for a process

As we can see from this screen, we can get various interesting statistics about the ZooKeeper server process with this tool. JMX allows customized information to be exposed to remote managers through the use of MBeans (Managed Beans). Although the name sounds goofy, it is a very flexible way to expose information and operations. jconsole lists all the MBeans exposed by the process in the rightmost information tab, as shown in Figure 10-9.

jconsole MBeans.

Figure 10-9. jconsole MBeans

As we can see from the list of MBeans, some of the components used by ZooKeeper are also exposed via MBeans. We are interested in the ZooKeeperService, so we will double-click on that list item. We will see a hierarchal list of replicas and information about those replicas. If we open some of the subentries in the list, we will see something like Figure 10-10.

jconsole information for server 2.

Figure 10-10. jconsole information for server 2

As we explore the information for replica.2 we will notice that it also includes some information about the other replicas, but it’s really just the contact information. Because server 2 doesn’t know much about the other replicas, there is not much more it can reveal about them. Server 2 does know a lot about itself, though, so it seems like there should be more information that it can expose.

If we start up server 1 so that server 2 can form a quorum with server 1, we will see that we get more information about server 2. Start up server 1 and then check server 2 in jconsole again. Figure 10-11 shows some of the additional information that is exposed by JMX. We can now see that server 2 is acting as a follower. We can also see information about the data tree.

Figure 10-11 shows the JMX information for server 1. As we see, server 1 is acting as a leader. One additional operation, FollowerInfo, is available on the leader to list the followers. When we click this button, we see a rather raw list of information about the other ZooKeeper servers connected to server 1.

jconsole information for server 1.

Figure 10-11. jconsole information for server 1

Up to now, the information we’ve seen from JMX looks prettier than the information we get from four-letter words, but we really haven’t seen any new functionality. Let’s look at something we can do with JMX that we cannot do with four-letter words. Start a zkCli shell. Connect to server 1, then run the following command:

create -e /me "foo"

This will create an ephemeral znode on the server. Figure 10-11 shows that a new informational entry for Connections has appeared in the JMX information for server 1. The attributes of the connection list various pieces of information that are useful for debugging operational issues. This view also exposes two interesting operations: terminateSession and terminateConnection.

The terminateConnection operation will close the ZooKeeper client’s connection to the server. The session will still be active, so the client will be able to reconnect to another server; the client will see a disconnection event but should be able to easily recover from it.

In contrast, the terminateConnection operation declares the session dead. The client’s connection with the server will close and the session will be terminated as if it has expired. The client will not be able to connect to another server using the session. Care should be taken when usingterminateConnection because that operation can cause the session to expire long before the session timeout, so other processes may find out that the session is dead before the process that owns that session finds out.

Connecting Remotely

The JMX agent that runs inside of the JVM of a ZooKeeper server must be configured properly to support remote connections. There are a variety of options to configure remote connections for JMX. In this section we show one way of getting JMX set up to see what kind of functionality it provides. If you want to use JMX in production, you will probably want to use another JMX-specific reference to get some of the more advanced security features set up properly.

All of the JMX configuration is done using system properties. The zkServer.sh script that we use to start a ZooKeeper server has support for setting these properties using the SERVER_JVMFLAGS environment variable.

For example, we can access server 3 remotely using port 55555 if we start the server as follows:

SERVER_JVMFLAGS="-Dcom.sun.management.jmxremote.password.file=passwd \

-Dcom.sun.management.jmxremote.port=55555 \

-Dcom.sun.management.jmxremote.ssl=false \

-Dcom.sun.management.jmxremote.access.file=access"

_path_to_zookeeper_/bin/zkServer.sh start _path_to_server3.cfg_

The properties refer to a password and access file. These have a very simple format. Create the passwd file with:

# user password

admin <password>

Note that the password is stored in clear text. For this reason, the password file must be readable and writable only by the owner of the file; if it is not, Java will not start up. Also, we have turned off SSL. That means the password will go over the network in clear text. If you need stronger security, there are much stronger options available to JMX, but they are outside the scope of this book.

For the access file, we are going to give readwrite privileges to admin by creating the file with:

admin readwrite

Now, if we start jconsole on another computer, we can use host:5555 for the remote process location (where host is the hostname or address of the machine running ZooKeeper), and the user admin with the password <password> to connect. If you happen to misconfigure something,jconsole will fail with messages that give little clue about what is going on. Starting jconsole with the -debug option will provide more information about failures.

Tools

Many tools and utilities come with ZooKeeper or are distributed separately. We have already mentioned the log formatting utilities and the JMX tool that comes with Java. In the contrib directory of the ZooKeeper distribution, you can find utilities to help integrate ZooKeeper into other monitoring systems. A few of the most popular offerings there are:

§ Bindings for Perl and Python, implemented using the C binding.

§ Utilities for visualizing the ZooKeeper logs.

§ A web UI for browsing the cluster nodes and modifying ZooKeeper data.

§ zktreeutil, which comes with ZooKeeper, and guano, which is available on GitHub. These utilities conveniently import and export data to and from ZooKeeper.

§ zktop, also available on GitHub, which monitors ZooKeeper load and presents it in a Unix top-like interface.

§ ZooKeeper Smoketest, available on GitHub. This is a simple smoketest client for a ZooKeeper ensemble; it’s a great tool for developers getting familiar with ZooKeeper.

Of course, this isn’t an exhaustive list, and many of the really great tools for running ZooKeeper are developed and distributed outside of the ZooKeeper distribution. If you are a ZooKeeper administrator, it would be worth your while to try out some of these tools in your environment.

Takeaway Messages

Although ZooKeeper is simple to get going, there are many ways to tweak the service for your environment. ZooKeeper’s reliability and performance also depend on correct configuration, so it is important to understand how ZooKeeper works and what the different parameters do. ZooKeeper can adjust to various network topologies if the timing and quorum configurations are set properly. Although changing the members of a ZooKeeper ensemble by hand is risky, it is a snap with the ZooKeeper reconfig operation. There are many tools available to make your job easier, so take a bit of time to explore what is out there.