MySQL High Availability (2014)
Part I. High Availability and Scalability
One of the key database features that supports both high availability and scalability in an application is replication. Replication is used to create redundancy in the database layer as well as to make copies of the database available for scaling the reads. Part I covers how you can use replication to ensure high availability and how you can scale your system.
Chapter 1. Introduction
Joel looked through the classified ads for a new job. His current job was a good one, and the company had been very accommodating to him while he attended college. But it had been several years since he graduated, and he wanted to do more with his career.
“This looks promising,” he said, circling an advertisement for a computer science specialist working with MySQL. He had experience with MySQL and certainly met the academic requirements for the job. After reading through several other ads, he decided to call about the MySQL job. After a brief set of cursory questions, the human resources manager granted him an interview in two days’ time.
Two days and three interviews later, he was introduced to the company’s president and chief executive officer, Robert Summerson, for his final technical interview. He waited while Mr. Summerson paused during the questions and referred to his notes. So far, they were mostly mundane questions about information technology, but Joel knew the hard questions about MySQL were coming next.
Finally, the interviewer said, “I am impressed with your answers, Mr. Thomas. May I call you Joel?”
“Yes, sir,” Joel said as he endured another uncomfortable period while the interviewer read over his notes for the third time.
“Tell me what you know about MySQL,” Mr. Summerson said before placing his hands on his desk and giving Joel a very penetrating stare.
Joel began explaining what he knew about MySQL, tossing in a generous amount of the material he had read the night before. After about 10 minutes, he ran out of things to talk about.
Mr. Summerson waited a couple of minutes, then stood and offered Joel his hand. As Joel rose and shook Mr. Summerson’s hand, Summerson said, “That’s all I need to hear, Joel. The job is yours.”
“Thank you, sir.”
Mr. Summerson motioned for Joel to follow him out of his office. “I’ll take you back to the HR people so we can get you on the payroll. Can you start two weeks from Monday?”
Joel was elated and couldn’t help but smile. “Yes, sir.”
“Excellent.” Mr. Summerson shook Joel’s hand again and said, “I want you to come prepared to evaluate the configuration of our MySQL servers. I want a complete report on their configuration and health.”
Joel’s elation waned as he drove out of the parking lot. He didn’t go home right away. Instead, he drove to the nearest bookstore. “I’m going to need a good book on MySQL,” he thought.
So, you have decided to take on a large installation and take care of its operation. Well, you are up for some very interesting—as well as rewarding—times.
Compared to running a small site, supporting a large venture requires planning, foresight, experience, and even more planning. As a database administrator for a large venture, you are required to—or will be required to—do things like the following:
§ Provide plans for recovery of business-essential data in the event of a disaster. It is also likely that you will have to execute the procedure at least once.
§ Provide plans for handling a large customer/user base and monitoring the load of each node in the site in order to optimize it.
§ Plan for rapid scale-out in the event the user base grows rapidly.
For all these cases, it is critical to plan for the events in advance and be prepared to act quickly when necessary.
Because not all applications using big sets of servers are websites, we prefer to use the term deployment—rather than the term site or website—to refer to the server that you are using to support some kind of application. This could be a website, but could just as well be a customer relationship management (CRM) system or an online game. The book focuses on the database layer of such a system, but there are some examples that demonstrate how the application layer and the database layer integrate.
You need three things to keep a site responsive and available: backups of data, redundancy in the system, and responsiveness. The backups can restore a node to the state it was in before a crash, redundancy allows the site to continue to operate even if one or more of the nodes stops functioning, and the responsiveness makes the system usable in practice.
There are many ways to perform backups, and the method you choose will depend on your needs. Do you need to recover to an exact point in time? In that case, you have to ensure that you have all that is necessary for performing a point-in-time recovery (PITR). Do you want to keep the servers up while making a backup? If so, you need to ensure that you are using some form of backup that does not disturb the running server, such as an online backup.
Redundancy is handled by duplicating hardware, keeping several instances running in parallel, and using replication to keep multiple copies of the same data available on several machines. If one of the machines fails, it is possible to switch over to another machine that has a copy of the same data.
Together with replication, backup also plays an important role in scaling your system and adding new nodes when needed. If done right, it is even possible to automatically add new slaves at the press of a button, at least figuratively.
What’s This Replication Stuff, Anyway?
If you’re reading this book, you probably have a pretty good idea of what replication is about. It is nevertheless a good idea to review the concepts and ideas.
Replication is used to clone all changes made on a server—called the master server or just master—to another server, which is called the slave server or just slave. This is normally used to create a faithful copy of the master server, but replication can be used for other purposes as well.
The two most common uses of replication are to create a backup of the main server to avoid losing any data if the master crashes and to have a copy of the main server to perform reporting and analysis work without disturbing the rest of the business.
For a small business, this makes a lot of things simpler, but it is possible to do a lot more with replication, including the following:
Support several offices
It is possible to maintain servers at each location and replicate changes to the other offices so that the information is available everywhere. This may be necessary to protect data and also to satisfy legal requirements to keep information about the business available for auditing purposes.
Ensure the business stays operational even if one of the servers goes down
An extra server can be used to handle all the traffic if the original server goes down.
Ensure the business can operate even in the presence of a disaster
Replication can be used to send changes to an alternative data center at a different geographic location.
Protect against mistakes (“oopses”)
It is possible to create a delayed slave by connecting a slave to a master such that the slave is always a fixed period—for example, an hour—behind the master. If a mistake is made on the master, it is possible to find the offending statement and remove it before it is executed by the slave.
One of the two most important uses of replication in many modern applications is that of scaling out. Modern applications are typically very read-intensive; they have a high proportion of reads compared to writes. To reduce the load on the master, you can set up a slave with the sole purpose of answering read queries. By connecting a load balancer, it is possible to direct read queries to a suitable slave, while write queries go to the master.
When using replication in a scale-out scenario, it is important to understand that MySQL replication traditionally has been asynchronous in the sense that transactions are committed at the master server first, then replicated to the slave and applied there. This means that the master and slave may not be consistent, and if replication is running continuously, the slave will lag behind the master.
The advantage of using asynchronous replication is that it is faster and scales better than synchronous replication, but in cases where it is important to have current data, the asynchrony must be handled to ensure the information is actually up-to-date.
Scaling out reads is, however, not sufficient to scale all applications. With growing demands on larger databases and higher write load, it is necessary to scale more than just reads. Managing larger databases and improving performance of large database systems can be accomplished using techniques such as sharding. With sharding, the database is split into manageable chunks, allowing you to increase the size of the database by distributing it over as many servers as you need as well as scaling writes efficiently.
Another important application of replication is ensuring high availability by adding redundancy. The most common technique is to use a dual-master setup (i.e., using replication to keep a pair of masters available all the time, where each master mirrors the other). If one of the masters goes down, the other one is ready to take over immediately.
In addition to the dual-master setup, there are other techniques for achieving high availability that do not involve replication, such as using shared or replicated disks. Although they are not specifically tied to MySQL, these techniques are important tools for ensuring high availability.
So, Backups Are Not Needed Then?
A backup strategy is a critical component of keeping a system available. Regular backups of the servers provide safety against crashes and disasters, which, to some extent, can be handled by replication. Even when replication is used correctly and efficiently, however, there are some things that it cannot handle. You’ll need to have a working backup strategy for the following cases:
Protection against mistakes
If a mistake is discovered, potentially a long time after it actually occurred, replication will not help. In this case, it is necessary to roll back the system to a time before the mistake was introduced and fix the problem. This requires a working backup schedule.
Replication provides some protection against mistakes if you are using a time-delayed slave, but if the mistake is discovered after the delay period, the change will have already taken effect on the slave as well. So, in general, it is not possible to protect against mistakes using replication only—backups are required as well.
Creating new servers
When creating new servers—either slaves for scale-out purposes or new masters to act as standbys—it is necessary to make a backup of an existing server and restore that backup image on the new server. This requires a quick and efficient backup method to minimize the downtime and keep the load on the system at an acceptable level.
In addition to pure business reasons for data preservation, you may have legal requirements to keep data safe, even in the event of a disaster. Not complying with these requirements can pose significant problems to operating the business.
In short, a backup strategy is necessary for operating the business, regardless of any other precautions you have in place to ensure that the data is safe.
What’s With All the Monitoring?
Even if you have replication set up correctly, it is necessary to understand the load on your system and to keep a keen eye on any problems that surface. As business requirements shift due to changing customer usage patterns, it is necessary to balance the system to use resources as efficiently as possible and to reduce the risk of losing availability due to sudden changes in resource utilization.
There are a number of different things that you can monitor, measure, and plan for to handle these types of changes. Here are some examples:
§ You can add indexes to tables that are frequently read.
§ You can rewrite queries or change the structure of databases to speed up execution time.
§ If locks are held for a long time, it is an indication that several connections are using the same table. It might pay off to switch storage engines.
§ If some of your scale-out slaves are hot-processing a disproportionate number of queries, the system might require some rebalancing to ensure that all the scale-out slaves are hit evenly.
§ To handle sudden changes in resource usage, it is necessary to determine the normal load of each server and understand when the system will start to respond slowly because of a sudden increase in load.
Without monitoring, you have no way of spotting problematic queries, hot slaves, or improperly utilized tables.
Is There Anything Else I Can Read?
There is plenty of literature on using MySQL for various jobs, and also a lot of literature about high-availability systems. Here is a list of books that we strongly recommend if you are going to work with MySQL:
MySQL by Paul DuBois (Addison-Wesley)
This is the reference to MySQL and consists of 1,200 pages (really!) packed with everything you want to know about MySQL (and probably a lot that you don’t want to know).
High Performance MySQL, Third Edition by Baron Schwartz, Peter Zaitsev, and Vadim Tkachenko (O’Reilly)
This is one of the best books on using MySQL in an enterprise setting. It covers optimizing queries and ensuring your system is responsive and available.
Scalable Internet Architectures by Theo Schlossnagle (Sams Publishing)
Written by one of the most prominent thinkers in the industry, this is a must for anybody working with systems of scale.
The book uses a Python library developed by the authors (called the MySQL Python Replicant) for many of the administrative tasks. MySQL Python Replicant is available on Launchpad.
In the next chapter, we will start with the basics of setting up replication, so get a comfortable chair, open your computer, and we’ll get started.
Joel was adjusting his chair when a knock sounded from his door.
“Settling in, Joel?” Mr. Summerson asked.
Joel didn’t know what to say. He had been tasked to set up a replication slave on his first day on the job and while it took him longer than he had expected, he had yet to hear his boss’s feedback about the job. Joel spoke the first thing on his mind: “Yes, sir, I’m still trying to figure out this chair.”
“Nice job with the documentation, Joel. I’d like you to write a report explaining what you think we should do to improve our management of the database server.”
Joel nodded. “I can do that.”
“Good. I’ll give you another day to get your office in order. I expect the report by Wednesday, close of business.”
Before Joel could reply, Mr. Summerson walked away.
Joel sat down and flipped another lever on his chair. He heard a distinct click as the back gave way, forcing him to fling his arms wide. “Whoa!” He looked toward his door as he clumsily picked up his chair, thankful no one saw his impromptu gymnastics. “OK, that lever is now off limits,” he said.
 You are not restricted to using a single backup method; you can just as well use a mix of different methods depending on your needs. For each case, however, you have to make a choice of the most appropriate method to do the backup.
 There is an extension called semisynchronous replication as well (see Semisynchronous Replication), but that is a relatively new addition. Until MySQL 5.7.2 DMR, it externalized the transaction before it was replicated, allowing it to be read before it had been replicated and acknowledged requiring some care when being used for high availability.