Managing Swift - Implementing Cloud Storage with OpenStack Swift (2014)

Implementing Cloud Storage with OpenStack Swift (2014)

Chapter 5. Managing Swift

After a Swift cluster has been installed and deployed, it needs to be managed to serve customer expectations and service level agreements. Since there are several components in a Swift cluster, it is a little different, and hence more difficult to manage compared to traditional storage. There are several tools and mechanisms an administrator can use to effectively manage a Swift cluster. This chapter deals with these aspects in more detail.

Routine management

The Swift cluster consists of several proxy server nodes and storage server nodes. These nodes run many processes and services to keep the cluster up and running, and provide overall availability. Any kind of general server management tools/applications such as Nagios, which is described later in the chapter, can be run to track the state of the general services, CPU utilization, memory utilization, disk subsystem performance, and so on. Looking at the system logs is a great way to detect impending failures. Along with this, there are some tools to monitor the Swift services in particular. Some of them are Swift Recon, Swift StatsD, Swift Dispersion, and Swift Informant.

Nagios is a monitoring framework that comprises several plugins that can be used to monitor network services (such as HTTP and SSH), processor load, performance, and CPU and disk utilization. It also provides remote monitoring capabilities by running scripts remotely connected to the monitored system using SSH or SSL. Users can write their own plugins depending on their requirements to extend these monitoring capabilities. These plugins can be written in several languages such as Perl, Ruby, C++, and Python. Nagios also provides a notification mechanism where an administrator can be alerted when problems occur on the system. The following figure shows how to integrate a monitoring solution based on Nagios:

Routine management

More information on Nagios can be found at www.nagios.org. Next, let us look into the details of Swift monitoring tools.

Swift cluster monitoring

In this section, we describe various tools that are available to monitor the Swift clusters. We also show snapshots from the Vedams Swift monitoring application that integrates data from various Swift monitoring tools.

Swift Recon

Swift Recon is a middleware software that is configured on the object server node and sits in the data path. A local cache directory, which is used to store logs, needs to be specified during setup. It comes with the swift-recon command-line tool which can be used to access and display the various metrics that are being tracked. You can use swift-recon –h to get help with using the swift-recon tool.

Some of the general server metrics that are tracked are as follows:

· Load averages

· The /proc/meminfo data

· Mounted filesystems

· Unmounted drives

· Socket statistics

Along with these, some of the following Swift stats are also tracked:

· MD5checksums of account, container, and object ring

· Replication information

· Number of quarantined accounts, containers, and objects

The following screenshot shows Swift Recon data within the Vedams Swift monitoring application:

Swift Recon

Swift Informant

Swift Informant is also a middleware software that gives insight to client requests to the proxy server. This software sits in the proxy server's data path and provides the following metrics to the StatsD server:

· Status code for requests to account, container, or object

· Duration of the request and time until start_response metric was seen

· Bytes transferred in the request

Swift Informant can be downloaded from https://github.com/pandemicsyn/swift-informant.

The following screenshot displays the Swift Informant data within the Vedams Swift monitoring application:

Swift Informant

Swift dispersion tools

This postprocessing tool is used to determine the overall health of a Swift cluster. The swift-dispersion-populate tool is used to distribute objects and containers throughout the Swift cluster in such a way that the objects and containers fall in distinct partitions. Next, the swift-dispersion-report tool is run to determine the health of these objects and containers. In the case of objects, Swift makes three replicas for redundancy. If the replicas of an object are all good, then the health of the object is said to be good; the swift-dispersion-report tool helps figure out this health of all objects and containers within the cluster.

The following screenshot displays the Swift Dispersion data within the Vedams Swift monitoring application:

Swift dispersion tools

StatsD

Swift services have been instrumented to send statistics (counters and logs) directly to a StatsD server that is configured.

A simple StatsD daemon to receive the metrics can be found at https://github.com/etsy/statsd/.

The StatsD metrics are provided in real time and can help identify problems as they occur. Configuration files containing the following parameters should be set in the Swift configuration files to enable StatsD logging:

· log_statsd_host

· log_statsd_port

· log_statsd_default_sample_rate

· log_statsd_sample_rate_factor

· log_statsd_metric_prefix

The statsd_sample_rate_factor parameter can be adjusted to set the logging frequency. The log_statsd_metric_prefix parameter is configured on a node to prepend this prefix to every metric sent to the StatsD server from this node. If the log_statsd_host entry is not set, then this functionality will be disabled.

The StatsD logs can be sent to a backend Graphite server to display the metrics as graphs. The following screenshot of the Vedams Swift monitoring application represents the StatsD logs as graphs:

StatsD

Swift metrics

The Swift source code has metrics logging (counters, timings, and so on) built into it. Some of the metrics sent to the StatsD server from various Swift services are listed in the table. They have been classified based on the Create, Read, Update, Delete (CRUD) operations:

Create/PUT

Read/GET

Update/POST

Delete

account-server.PUT.errors.timing

account-server.GET.errors.timing

account-server.POST.errors.timing

account-server.DELETE.errors.timing

account-server.PUT.timing

account-server.GET.timing

account-server.POST.timing

account-server.DELETE.timing

container-server.PUT.errors.timing

container-server.GET.errors.timing

container-server.POST.errors.timing

container-server.DELETE.errors.timing

container-server.PUT.timing

container-server.GET.timing

container-server.POST.timing

container-server.DELETE.timing

object-server.async_pendings

object-server.GET.errors.timing

object-server.POST.errors.timing

object-server.async_pendings

object-server.PUT.errors.timing

object-server.GET.timing

object-server.POST.timing

object-server.DELETE.errors.timing

object-server.PUT.timeouts

proxy-server.<type>.client_timeouts

proxy-server.<type>.<verb>.<status>.timing

object-server.DELETE.timing

object-server.PUT.timing

proxy-server.<type>.<verb>.<status>.timing

proxy-server.<type>.<verb>.<status>.xfer

proxy-server.<type>.<verb>.<status>.timing

object-server.PUT.<device>.

timing

proxy-server.<type>.<verb>.<status>.xfer

proxy-server.<type>.<verb>.<status>.xfer

proxy-server.<type>.client_timeouts

proxy-server.<type>.client_disconnects

proxy-server.<type>.<verb>.<status>.timing

proxy-server.<type>.<verb>.<status>.xfer

Logging using rsyslog

It is very useful to get logs from various Swift services and that can be achieved by configuring proxy-server.conf and rsyslog. In order to receive logs from the proxy server, we modify the /etc/swift/proxy-server.conf configuration file by adding the following lines:

log_name = name

log_facility = LOG_LOCALx

log_level = LEVEL

Let's describe the preceding entries: name can be any name that you would like to see in the logs. The letter x in LOG_LOCALx can be any number between zero and seven. The LEVEL parameter can be either emergency, alert, critical, error, warning, notification,informational, or debug.

Next, we modify /etc/rsyslog.conf to add the following line of code in the GLOBAL_DIRECTIVES section:

$PrivDropToGroup adm

Also, we create a config file /etc/rsyslog.d/swift.conf and add the following line of code to it:

local2.* /var/log/swift/proxy.log

The preceding line tells syslog that any log written to the LOG_LOCAL2 facility should go to the /var/log/swift/proxy.log file. We then give permissions for access to the /var/log/swift folder, and restart the proxy service and syslog service.

Failure management

In this section, we deal with detecting failures and actions to rectify failures. There can be drive, server, zone, or even region failures. As described in Chapter 2, OpenStack Swift Architecture, Swift is designed for availability and tolerance to partial failure (where entire parts of the cluster can fail) during the CAP theorem discussion.

Detecting drive failure

Kernel logs are a good place to look for drive failures. The disk subsystem will log warnings or errors that can help an administrator determine whether drives are going bad or have already failed. We can also set up a script on storage nodes to capture drive failure information using the drive audit process described in Chapter 2, OpenStack Swift Architecture, executing the following steps:

1. On each storage node, create a script swift-drive-audit in the /etc/swift folder with the following contents:

2. [drive-audit]

3. log_facility = LOG_LOCAL0

4. log_level = DEBUG

5. device_dir = /srv/node

6. minutes = 60

7. error_limit = 2

8. log_file_pattern = /var/log/kern*

regex_pattern_X = berrorb.*b(sd[a-z]{1,2}d?)b and b(sd[a-z]{1,2}d?)b.*berrorb

9. Add the following line of code to /etc/rsyslog.d/swift.conf:

local0.* /var/log/swift/drive-audit

10. We then restart the rsyslog service using the following command:

11.Service rsyslog restart

12. We then restart the Swift services using the following command:

13.swift-init rest restart

14. The drive failure information will now be stored in the /var/log/swift/drive-audit log file.

Handling drive failure

When a drive failure occurs, we can either replace the drive quickly at a later time or not replace it at all. If we do not plan to replace the drive immediately, then it is better to unmount the drive and remove it from the ring. If we decide to replace the drive, then we take out the failed drive and replace it with a good drive, format it, and mount it. We will let the replication algorithm take care of filling this drive with data to maintain consistent replicas and data integrity.

Handling node failure

When a storage server in a Swift cluster is experiencing problems, we have to determine whether the problem can be fixed in a short interval, such as a couple of hours, or if it will take an extended period of time. If the downtime interval is small, we can let Swift services work around the failure while we debug and fix the issue with the node. Since Swift maintains multiple replicas of data (the default is three), there won't be a problem of data availability, but the timings for data access might increase. As soon as the problem is found and fixed and the node is brought back up, Swift replication services will take care of figuring out the missing information and will update the nodes and get them in sync.

If the node repair time is extended, then it is better to remove the node and all associated devices from the ring. Once the node is brought back online, the devices can be formatted, remounted, and added back to the ring.

The two following commands are useful to remove devices and nodes from the ring:

· To remove a device from the ring, use:

swift-ring-builder <builder-file> remove <ip_address>/<device_name>

For example, swift-ring-builder account.builder remove 172.168.10.52/sdb1.

· To remove a server from the ring, use:

swift-ring-builder <builder-file> remove <ip_address>

For example, swift-ring-builder account.builder remove 172.168.10.52.

Proxy server failure

If there is only one proxy server in the cluster and it goes down, then there is a chance that no objects can be accessed (upload or download) by the client, so this needs immediate attention. This is why it is always a good idea to have a redundant proxy server to increase data availability in the Swift cluster. After identifying and fixing the failure in the proxy server, the Swift services are restarted and object store access is restored.

Zone and region failure

When a complete zone fails, it is still possible that the Swift services are not interrupted because of the High Availability configuration that contains multiple storage nodes and multiple zones. The storage servers and drives belonging to the failed node have to be brought back into service if the failure can be debugged quickly. Otherwise, the storage servers and drives that belong to the zone have to be removed from the ring, and the ring needs to be rebalanced. Once the zone is brought back into service, the drives and storage servers can be added back into the ring and the ring can be rebalanced. In general, a zone failure should be dealt with as a critical issue. In some cases, the top-of-the-rack storage or network switch can have failures, thus disconnecting storage arrays and servers from the Swift cluster, leading to zone failures. In these cases, the switch failures have to be diagnosed and rectified quickly.

In a multiregion setup, if there is a region failure, then all requests can be routed to the surviving regions. The servers and drives that belong to the region need to be brought back into service quickly to balance the load that is currently being handled by the surviving regions. In other words, this failure should be dealt as a blocker issue. There can be latencies observed in uploads and downloads due to the requests being routed to different regions. Region failures can also occur due to failures occurring in core routers or firewalls. These failures should also be quickly diagnosed and rectified to bring the region back into service.

Capacity planning

As more clients start accessing the Swift cluster, it will increase demand for additional storage. With Swift, this is easy to accomplish; you can simply add more storage nodes and associated proxy servers. This section deals with the planning and adding of new storage drives as well as storage servers.

Adding new drives

Though adding new drives is a straightforward process, it requires careful planning since this involves rebalancing of the ring. Once we decide to add new drives, we will add these drives to a particular storage server in a zone by formatting and mounting these drives. Next, we will run the swift-ring-builder add commands to add the drives to the ring. Finally, we will run the swift-ring-builder rebalance command to rebalance the ring. The generated .gz ring files need to be distributed to all the storage server nodes. The commands to perform these operations were explained in Chapter 3, Installing OpenStack Swift, in the Formatting and mounting hard disks and Ring setup sections.

Often, we end up replacing old drives with bigger and better drives. In this scenario, rather than executing an abrupt move, it is better to slowly start migrating data off the old drive to other drives by reducing the weight of the drive in the ring and repeating this step a few times. Once data has been moved off this drive, it can be safely removed. After removing the old drive, simply insert the new drive and follow the previously mentioned steps to add this drive to the ring.

Adding new storage and proxy servers

Adding new storage and proxy servers is also a straightforward process, where new servers need to be provisioned according to the instructions provided in Chapter 3, Installing OpenStack Swift. Storage servers need to be placed in the right zones, and drives that belong to these servers need to be added to the ring. After rebalancing and distributing the .gz ring files to the rest of the storage servers, the new storage servers are now part of the cluster. Similarly, after setting up a new proxy server, the configuration files and load balancing settings need to be updated. This proxy server is now part of the cluster and can start accepting requests from users.

Migrations

This section deals with hardware and software migrations. The migrations can be to either existing servers or to new servers within a zone or region. As new hardware and software (operating system, packages, or Swift software) becomes available, the existing servers and software need to be migrated to take advantage of faster processor speeds and latest software updates. It is a good idea to upgrade one server at a time and one zone at a time since Swift services can deal with an entire zone being migrated.

The following steps are required to upgrade a storage server node:

1. Execute the following command to stop all the Swift operations running in the background:

2. swift-init rest stop

3. Gracefully shutdown all the Swift services by using the following command line:

4. swift-init {account|container|object} shutdown

5. Upgrade the necessary operating system and system software packages, and install/upgrade the Swift package required. In general, Swift is on a six-month update cycle.

6. Next, create or perform the required changes to the Swift configuration files.

7. After rebooting the server, restart all the required services by executing the following commands:

8. swift-init {account|container|object} start

9. swift-init rest start

If there are changes with respect to the drives on the storage server, we have to make sure we update and rebalance the ring.

Once we have completed migration to the new server, we check the log files for proper operation of the server. If the server is operating without any issues, we then proceed to upgrade the next storage server.

Next, we discuss how to upgrade proxy servers. We can make use of the load balancer to isolate the proxy server that we plan to upgrade so that client requests are not sent to this proxy server.

We perform the following steps to upgrade the proxy server:

1. Gracefully shut down the proxy services by using the following command line:

2. swift-init proxy shutdown

3. Upgrade the necessary operating system, system software packages, and install/upgrade the Swift package required.

4. Next, create or perform the required changes to the Swift proxy configuration files.

5. After rebooting the server, restart all the required services by using the following command:

6. swift-init proxy start

We then have to make sure that we add the upgraded proxy server back into the load balancer pool so that it can start receiving client requests.

After the upgrade, we have to make sure that the proxy server is operating correctly by monitoring the log files.

Summary

In this chapter, you learned how to manage a Swift cluster, the various tools available to monitor and manage the Swift cluster, and the various metrics to determine the health of the cluster. You also learned what actions need to be taken if a component fails in the cluster and how a cluster can be extended by adding new disks and nodes.