Troubleshooting Replication - MySQL Troubleshooting (2012)

MySQL Troubleshooting (2012)

Chapter 5. Troubleshooting Replication

I already touched on replication issues in previous chapters, showing how the problems discussed in each chapter can affect replicated environments. This chapter focuses on issues specific to replication itself. These are mostly replication failures due to errors or slowdowns, such as a slave that lags several hours behind the master.

MySQL’s replication is asynchronous. This means the master does not care whether data on the slave is consistent. Although circular multiple master replication can be set up, in a practical sense it is a chain of servers, with each of them serving as a slave and a master at the same time.

MYSQL MULTIPLE MASTER SETUP

To illustrate the concept behind MySQL multiple master replication, let’s consider Figure 5-1.

Two servers that replicate each other

FIGURE 5-1. TWO SERVERS THAT REPLICATE EACH OTHER

Here server A is a master of its slave B, and at the very same time, server B is a master of slave A.

You can add as many servers as you wish to such chains (Figure 5-2).

Circular replication with multiple servers

FIGURE 5-2. CIRCULAR REPLICATION WITH MULTIPLE SERVERS

To troubleshoot such setups, you need to take a single master/slave pair (Figure 5-3).

Focusing on one direction in multimaster replication

FIGURE 5-3. FOCUSING ON ONE DIRECTION IN MULTIMASTER REPLICATION

Then work it out like any simple replication setup. Then take another pair, and so on. I won’t describe this specific case in detail, but just touch on it later in the section Circular Replication and Nonreplication Writes on the Slave.

NOTE

Since version 5.5, MySQL packages include a semi-synchronous replication plug-in. If this plug-in is turned on, the master will wait for a confirmation from one of its slaves that it received and successfully applied each event. This is still not synchronous replication, because the master does not know whether the slave has the same data after applying the event (I discussed this possibility inStatement-Based Replication Issues). Furthermore, if many slaves are connected to the same master, there is no guarantee that data is replicated on all of the slaves.

Troubleshooting semi-synchronous replication is the same as troubleshooting asynchronous replication. The only difference is the effect of its specific options. So I won’t describe issues specific to semi-synchronous replication. If you encounter problems with it, just follow the methods described in Chapter 3.

The MySQL slave runs two threads related to replication: the I/O thread, which handles all the traffic with the master, and the SQL thread, which reruns the events themselves to replicate results on the slave. These threads experience different problems that should be solved using different techniques, so I discuss each in its own section in this chapter.

Displaying Slave Status

Before we start troubleshooting threads, I’ll present an excellent source of replication state information: the SHOW SLAVE STATUS query for slaves.

The following examples were taken when running a slave without errors and a slave whose master has stopped to show how errors would look. I’ll discuss the output in chunks.

mysql> SHOW SLAVE STATUS\G

*************************** 1. row ***************************

Slave_IO_State: Connecting to master

This is the status of the I/O thread. For a running slave, it usually contains Waiting for master to send event:

Slave_IO_State: Waiting for master to send event

Master_Host: 127.0.0.1

Master_User: root

Master_Port: 4041

Connect_Retry: 60

Master_Log_File: mysqld511-bin.000007

The Master_Log_File field shows the name of the master binary logfile. If the slave had an I/O error, the field would be empty.

Read_Master_Log_Pos: 106

106 is the position in the master’s binary log that was read.

Relay_Log_File: mysqld512-relay.000097

Relay_Log_File is the name of the relay logfile, a file on the slave that contains the information transferred from the master’s binary log.

Relay_Log_Pos: 255

255 is the current position in the relay logfile.

Relay_Master_Log_File: mysqld511-bin.000007

Slave_IO_Running: Yes

Slave_IO_Running indicates the basic state of the I/O thread, i.e., whether it is running. This can be either Yes or No.

Slave_SQL_Running: Yes

This time, we see the running status of the SQL thread. Again, it can be either Yes or No.

Replicate_Do_DB:

Replicate_Ignore_DB:

Replicate_Do_Table:

Replicate_Ignore_Table:

Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:

Last_Errno: 0

Last_Error:

Skip_Counter: 0

Exec_Master_Log_Pos: 106

The 106 in Exec_Master_Log_Pos represents the position in the master binary log that was just executed. This can be different from Read_Master_Log_Pos if the slave is behind the master.

Relay_Log_Space: 106

Until_Condition: None

Until_Log_File:

Until_Log_Pos: 0

Master_SSL_Allowed: No

Master_SSL_CA_File:

Master_SSL_CA_Path:

Master_SSL_Cert:

Master_SSL_Cipher:

Master_SSL_Key:

Seconds_Behind_Master: 2

Seconds_Behind_Master shows how far the slave lags behind the master. It contains the number of seconds between the last executed event on the slave and the last event from the master binary log that was replicated into the relay logfile. Ideally this value should be zero. If the slave is not connected to a master, this field contains NULL.

Master_SSL_Verify_Server_Cert: No

Here is output from the stopped slave again:

Last_IO_Errno: 2013

Last_IO_Errno is either the number of the most recent error on the I/O thread or zero if there has been no error since the slave started.

Last_IO_Error: error connecting to master 'root@127.0.0.1:4041'

- retry-time: 60 retries: 86400

These two rows contain the text of the most recent I/O error. In this case, they contain information about why the I/O thread failed.

Last_SQL_Errno: 0

Last_SQL_Errno is either the number of the most recent error on the SQL thread or zero if there has been no error since the slave started.

Last_SQL_Error:

Again, the text of last SQL error. Although there is no error in this example, these two rows can contain information about why the SQL thread failed.

Now that you are familiar with the output of SHOW SLAVE STATUS, we can go on to troubleshooting.

Problems with the I/O Thread

Common I/O errors include:

§ The slave cannot connect to the master.

§ The slave connects to master, but repeatedly disconnects.

§ The slave is far behind master.

When an I/O error happens, the slave status that we saw in the previous section becomes Slave_IO_Running: No and the reason appears in the Last_IO_Errno and Last_IO_Error fields. The error logfile also contains messages about I/O thread failures if log_warnings is set to 1 (the default).

When the slave cannot connect, the first thing to check is whether the replication user has the correct permissions on the master. The replication user (the user you specify as master_user in the CHANGE MASTER query that begins replication) must have the REPLICATION SLAVE privilege on the master. If it does not, just grant such a privilege to this user on the master.

Once you are sure the replication user has the correct permissions, you need to check the network. Use the ping utility to find out whether the master host can be reached. Here is an example:

$ping 192.168.0.4

PING 192.168.0.4 (192.168.0.4): 56 data bytes

64 bytes from 192.168.0.4: icmp_seq=0 ttl=64 time=0.113 ms

64 bytes from 192.168.0.4: icmp_seq=1 ttl=64 time=0.061 ms

^C

--- 192.168.0.4 ping statistics ---

2 packets transmitted, 2 packets received, 0% packet loss

round-trip min/avg/max/stddev = 0.061/0.087/0.113/0.026 ms

If ping fails to connect to the master host, this clearly locates the problem in the network and you need to fix it. You can also use telnet to check whether the MySQL server itself is reachable. Specify the host and port of the master as arguments for the telnet command:

$telnet 192.168.0.4 33511

Trying 192.168.0.4...

Connected to apple.

Escape character is '^]'.

>

5.1.59-debug-log}O&i`(D^,#!\o8h%zY0$`;D^]

telnet> quit

Connection closed.

In this example, the MySQL server was reachable: 5.1.59-debug-log}O&i`(D^,#!\o8h%zY0$`;D^] is its welcome string. If ping works but telnet cannot connect to the server, you need to find out whether the MySQL server is running and whether the port is accessible, that is, whether the slave host can open the master port and whether the master host allows the slave host to connect to this port.

If the preceding tests succeed but the replication IO thread is still stopped, connect using the MySQL command-line client to be sure you can connect to the master using the credentials of the replication user. Here is an example where I successfully establish a connection and determine that the replication user has the right privileges:

$mysql -h 127.0.0.1 -P 33511 -urepl -preplrepl

Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 6783

Server version: 5.1.59-debug-log Source distribution

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its

affiliates. Other names may be trademarks of their respective

owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> SELECT user(), current_user();

+----------------+----------------+

| user() | current_user() |

+----------------+----------------+

| repl@localhost | repl@localhost |

+----------------+----------------+

1 row in set (0.13 sec)

mysql> SHOW GRANTS\G

*************************** 1. row ***************************

Grants for repl@localhost: GRANT REPLICATION SLAVE ON

*.* TO 'repl'@'localhost' IDENTIFIED BY PASSWORD

'*17125BDFB190AB635083AF9B26F9E8F00EA128FE'

1 row in set (0.00 sec)

SHOW GRANTS here shows the parameters through which the slave’s replication user can replicate data from the master.

When the slave can connect to the master but repeatedly disconnects, use your operating system tools to check the network. You can use tcpdump or netstat to watch traffic, or even send a large file through the network and watch progress to be sure the network is stable. The goal is to determine whether the connection between the master and slave is being interrupted.

If a connection to the master is established, netstat should print something like:

$netstat -a

Active Internet connections (including servers)

Proto Recv-Q Send-Q Local Address Foreign Address (state)

tcp4 0 0 apple.60344 master.mysql.com.33051 ESTABLISHED

tcpdump would print packets:

$tcpdump -i en1 host master.mysql.com and port 33051

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on en1, link-type EN10MB (Ethernet), capture size 96 bytes

22:28:12.195270 IP master.mysql.com.33051 > apple.60344: P

1752426772:1752426864(92) ack 1474226199 win 91 <nop,nop,timestamp 1939999898

649946687>

22:28:12.195317 IP apple.60344 > master.mysql.com.33051: . ack 92 win 65535

<nop,nop,timestamp 649946998 1939999898>

^C

2 packets captured

37 packets received by filter

0 packets dropped by kernel

This example was taken when I issued a query on the master, and it was successfully replicated.

When the slave is far behind the master, this can be a symptom of a slow network or a load that’s too heavy on the slave. We will return to overloaded slaves later in this chapter when discussing the SQL thread.

To check whether the network is slow, use tcpdump or send large files and watch the times in which packets are transferred. Also check whether MySQL is using all of the bandwidth available to the system on each side. If bandwidth usage is above 80%, you may need to buy faster network hardware. If it isn’t using all of the available bandwidth, check whether other software is using the same network interface and affecting the MySQL server. If other software is getting in the way, move it to a different host or at least a different hardware network interface.

Another error related to the I/O thread is relay log corruption. You would most likely see it as an SQL thread error:

Last_SQL_Errno: 1594

Last_SQL_Error: Relay log read failure: Could not parse relay log event

entry. The possible reasons are: the master's binary log is corrupted (you can

check this by running 'mysqlbinlog' on the binary log), the slave's relay log is

corrupted (you can check this by running 'mysqlbinlog' on the relay log), a

network problem, or a bug in the master's or slave's MySQL code. If you want to

check the master's binary log or slave's relay log, you will be able to know

their names by issuing 'SHOW SLAVE STATUS' on this slave.

I’m discussing this problem in this section instead of the SQL thread section because the real cause of the error could be a failure in the I/O thread that corrupted the relay log earlier. What could be happening is that the SQL thread has encountered the corruption while trying to execute events in the relay log.

In case of such an error, the first thing to do is follow the directions in the error message: check the master’s binary log and the slave’s relay log for corruption using the mysqlbinlog utility. mysqlbinlog converts binary logfiles into a human-readable format. Simply call it like this:

$mysqlbinlog /Users/apple/Applications/mysql-5.1/data511/mysqld511-bin.005071

/*!40019 SET @@session.max_insert_delayed_threads=0*/;

/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;

DELIMITER /*!*/;

# at 4

#110904 16:50:00 server id 511 end_log_pos 106 Start: binlog v 4,

server v 5.1.59-debug-log created 110904 16:50:00

BINLOG '

CIJjTg//AQAAZgAAAGoAAAAAAAQANS4xLjU5LWRlYnVnLWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAUwAEGggAAAAICAgC

'/*!*/;

# at 106

#110904 16:50:14 server id 511 end_log_pos 192 Query thread_id=7251

exec_time=0 error_code=0

use test/*!*/;

SET TIMESTAMP=1315144214/*!*/;

SET @@session.pseudo_thread_id=7251/*!*/;

SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=1,

@@session.unique_checks=1, @@session.autocommit=1/*!*/;

SET @@session.sql_mode=0/*!*/;

SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;

/*!\C latin1 *//*!*/;

SET

@@session.character_set_client=8,@@session.collation_connection=8,@@session.

collation_server=33/*!*/;

SET @@session.lc_time_names=0/*!*/;

SET @@session.collation_database=DEFAULT/*!*/;

create table t1(f1 int)

/*!*/;

# at 192

#110904 16:50:20 server id 511 end_log_pos 260 Query thread_id=7251

exec_time=0 error_code=0

SET TIMESTAMP=1315144220/*!*/;

BEGIN

/*!*/;

# at 260

# at 301

#110904 16:50:20 server id 511 end_log_pos 301 Table_map: `test`.`t1`

mapped to number 21

#110904 16:50:20 server id 511 end_log_pos 335 Write_rows: table id 21

flags: STMT_END_F

BINLOG '

HIJjThP/AQAAKQAAAC0BAAAAABUAAAAAAAEABHRlc3QAAnQxAAEDAAE=

HIJjThf/AQAAIgAAAE8BAAAAABUAAAAAAAEAAf/+AQAAAA==

'/*!*/;

# at 335

#110904 16:50:20 server id 511 end_log_pos 404 Query thread_id=7251

exec_time=0 error_code=0

SET TIMESTAMP=1315144220/*!*/;

COMMIT

/*!*/;

# at 404

#110904 16:50:36 server id 511 end_log_pos 451 Rotate to

mysqld511-bin.005072 pos: 4

DELIMITER ;

# End of log file

ROLLBACK /* added by mysqlbinlog */;

/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

In this example I used a valid binary logfile. If the file was corrupted, mysqlbinlog will mention it explicitly:

$mysqlbinlog --verbose --start-position=260 --stop-position=335 \

/Users/apple/Applications/mysql-5.1/data511/mysqld511-bin.000007.corrupted

/*!40019 SET @@session.max_insert_delayed_threads=0*/;

/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;

DELIMITER /*!*/;

ERROR: Error in Log_event::read_log_event(): 'Found invalid event in binary

log', data_len: 102, event_type: 15

ERROR: Could not read a Format_description_log_event event at offset 4; this

could be a log format error or read error.

DELIMITER ;

# End of log file

ROLLBACK /* added by mysqlbinlog */;

/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

I used the row binary log format here to show how row events look when printed. If binlog_format='statement' is used, all events are printed as SQL statements. You can see the SQL representation of row events by using the --verbose option:

$mysqlbinlog --verbose --start-position=260 --stop-position=335 \

/Users/apple/Applications/mysql-5.1/data511/mysqld511-bin.005071

/*!40019 SET @@session.max_insert_delayed_threads=0*/;

/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;

DELIMITER /*!*/;

# at 4

#110904 16:50:00 server id 511 end_log_pos 106 Start: binlog v 4,

server v 5.1.59-debug-log created 110904 16:50:00

BINLOG '

CIJjTg//AQAAZgAAAGoAAAAAAAQANS4xLjU5LWRlYnVnLWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAUwAEGggAAAAICAgC

'/*!*/;

# at 260

# at 301

#110904 16:50:20 server id 511 end_log_pos 301 Table_map: `test`.`t1`

mapped to number 21

#110904 16:50:20 server id 511 end_log_pos 335 Write_rows: table id 21

flags: STMT_END_F

BINLOG '

HIJjThP/AQAAKQAAAC0BAAAAABUAAAAAAAEABHRlc3QAAnQxAAEDAAE=

HIJjThf/AQAAIgAAAE8BAAAAABUAAAAAAAEAAf/+AQAAAA==

'/*!*/;

### INSERT INTO test.t1

### SET

### @1=1

DELIMITER ;

# End of log file

ROLLBACK /* added by mysqlbinlog */;

/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

In addition to the --verbose option, I used --start-position and --stop-position to show how to limit mysqlbinlog output to particular positions if the logfile is large.

You can pipe mysqlbinlog output into the MySQL client and have it execute the queries. This works for SBR and Row-Based Replication (RBR), and is useful when you want to debug how binary log events are applied to the slave.

If there is a problem with the master logfile, find out why it happened. First, restart replication manually. Restore events from Exec_Master_Log_Pos to the latest possible position, and apply them manually. Then, wait until Seconds_Behind_Master is 0, and compare the tables on the master and slave.

If a lot of changes are corrupted and finding which rows are modified is practically impossible, you probably need to back up the master, then load the backup on the slave and restart replication. You can replay replication for a single table. Figure out the last point in the binary log where the data was correctly replicated, then set the replicate-wild-ignore-table option and run:

START SLAVE [SQL_THREAD] UNTIL

MASTER_LOG_FILE = 'log_name', MASTER_LOG_POS = log_pos

START SLAVE [SQL_THREAD] UNTIL

RELAY_LOG_FILE = 'log_name', RELAY_LOG_POS = log_pos

where log_pos is the position of the last correct change of that table in either the master binary file or relay logfile. After the slave reaches this position and stops, remove the option replicate-wild-ignore-table and restart the server.

HOW TO CHECK WHETHER TABLES ARE CONSISTENT

There are few ways to check whether tables on a master and a slave are consistent. Here is quick overview of them. Use one or another depending on the problem you hit.

CHECKSUM TABLE

As should be clear from the name, this query returns a table checksum. This MySQL statement does not require additional installation and is always available.

mysql> CHECKSUM TABLE test;

+--------------+------------+

| Table | Checksum |

+--------------+------------+

| test.test | 4220395591 |

+--------------+------------+

1 row in set (0.43 sec)

When you want to check whether the tables on the master and slave have the same data, run the query on both servers and compare results. Make sure that Seconds_Behind_Master is zero and that there is no write activity on the same table on the master while CHECKSUM TABLE is running.

mysqldiff

This is a tool from the MySQL WB Utilities bundle that comes with the MySQL Workbench installation. The tool reads the definition of database objects and compares their definitions using a diff-like method to determine whether two objects are same. Here is an example of its use for troubleshooting replication:

$mysqldiff --server1=root@127.0.0.1:33511 --server2=root@127.0.0.1:33512 \

test.t1:test.t1

# server1 on 127.0.0.1: ... connected.

# server2 on 127.0.0.1: ... connected.

# Comparing test.t1 to test.t1 [PASS]

Success. All objects are the same.

pt-table-checksum

This is part of the Percona Toolkit and the most powerful tool among those discussed here. It connects to a master and a slave and compares whether tables have the same structure and the same data. To do this, the tool creates a table that stores a checksum from the master table. After this value is replicated, a second run of pt-table-checksum checks the data on a slave.

Here is an example of checking replication:

$pt-table-checksum --replicate=test.checksum --create-replicate-table

h=127.0.0.1,P=33511,u=root --databases book

DATABASE TABLE CHUNK HOST ENGINE COUNT CHECKSUM TIME WAIT STAT LAG

book t1 0 127.0.0.1 MyISAM 5 42981178 0 NULL NULL NULL

book ts 0 127.0.0.1 MyISAM 65 aeb6b7a0 0 NULL NULL NULL

This command calculates and saves checksums for each table in the book database. Once the slave is up, we can check whether the tables are the same:

$pt-table-checksum --replicate=test.checksum --replicate-check=2

h=127.0.0.1,P=33511,u=root --databases book

Differences on P=33512,h=127.0.0.1

DB TBL CHUNK CNT_DIFF CRC_DIFF BOUNDARIES

book ts 0 -5 1 1=1

The tool prints the differences it finds, if any. Here we can see that data in the ts table is different on the master and slave, whereas data in t1 is same.

Whatever tool you use, you need to be sure that no change was replicated since the last check was taken on the master. The easiest way to do this is to write-lock the tables you are currently examining.

mysqldiff and pt-table-checksum can do more than what I described here, but the uses I showed are the most important to help diagnose replication failures.

If you haven’t found any problem with the master binary log or find that the relay log is corrupt, it can be a symptom of either a network issue or disk corruption. In both cases, you can reposition the relay log on the slave to the Exec_Master_Log_Pos position and restart it using sequence of queries STOP SLAVE; CHANGE MASTER master_log_pos=Exec_Master_Log_Pos_Value, master_log_file='Relay_Master_Log_File_Value'; START SLAVE, and the relay log will be re-created. If the corruption was a singular occurrence, replication will be up and running again.

But don’t just clean up and ignore whatever could have caused the problem. Check your logs for disk and network problems.

To find out whether the cause was a disk problem, examine the operating system’s logfiles and use tools to check the disk for bad segments. If you find any, fix the disk. Otherwise, you can expect similar failures again.

Network problems can cause corruption in older versions of MySQL. Before versions 5.0.56 and 5.1.24, relay logs could often be corrupted by unstable networks. In versions 5.0.56 and 5.1.24, bug #26489 was fixed, and now this problem is quite rare. Since version 5.6.2, replication checksums were also inserted. This solves the rest of the corruption problems caused by network breakage.

These fixes do not automatically recover corrupted relay logs, but prevent them from becoming corrupted due to either a problem on the master or a network issue. Since version 5.5, the relay-log-recovery option is available, which turns on automatic recovery when the slave restarts.

But even if you use newer versions of MySQL with these fixes, you should check the network. The earlier you find problems with a network, the sooner you can fix them. Even with automatic recovery, resolving network issues takes time and can slow down replication.

In this section, we’ve had to consider some SQL thread errors, even though the causes were on the I/O thread. In the next section, I will discuss SQL thread problems that are not related to the I/O thread.

Problems with the SQL Thread

As I already mentioned in Statement-Based Replication Issues, each slave has a single SQL thread, so all its errors can be tested in a single-threaded MySQL client. Even if you run the multithreaded slave preview, you can always ask it to use a single thread when you’re trying to reproduce an error. If reducing activity to one SQL thread fails to make the problem go away, use the following techniques to fix logic errors on the single thread, then switch to multiple threads again.

It’s easy to re-create a query that caused a slave to fail: just run it using the MySQL command-line utility.

When you get an SQL error on the slave, it stops. SHOW SLAVE STATUS shows the SQL thread error that caused the problem:

Last_SQL_Errno: 1146

Last_SQL_Error: Error 'Table 'test.t1' doesn't exist' on query.

Default database: 'test'.

Query: 'INSERT INTO t1 VALUES(1)'

The error message usually contains the text of the SQL query and the reason why it failed. In this case, the error message explains everything (I dropped the t1 table on the slave to create this example), but in case of doubt, you can try to run same query in the MySQL command-line client and see the results:

mysql> INSERT INTO t1 VALUES(1);

ERROR 1146 (42S02): Table 'test.t1' doesn't exist

The error in this example makes it clear what you need to do to solve the problem: create the table.

mysql> CREATE TABLE t1(f1 INT);

Query OK, 0 rows affected (0.17 sec)

After the table is created, we can restart the slave SQL thread:

mysql> STOP SLAVE SQL_THREAD;

Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> SHOW WARNINGS;

+-------+------+--------------------------------+

| Level | Code | Message |

+-------+------+--------------------------------+

| Note | 1255 | Slave already has been stopped |

+-------+------+--------------------------------+

1 row in set (0.00 sec)

mysql> START SLAVE SQL_THREAD;

Query OK, 0 rows affected (0.10 sec)

mysql> SHOW SLAVE STATUS\G

*************************** 1. row ***************************

Slave_IO_State: Waiting for master to send event

<skipped>

Slave_IO_Running: Yes

Slave_SQL_Running: Yes

<skipped>

Last_Errno: 0

Last_Error:

<skipped>

Last_IO_Errno: 0

Last_IO_Error:

Last_SQL_Errno: 0

Last_SQL_Error:

1 row in set (0.00 sec)

Now the problem is solved and the slave runs successfully again.

When Data Is Different on the Master and Slave

If you have errors that cannot be solved so easily, check whether the table definition is the same on the master and slave. You should also check whether both tables had the same data before the problematic query ran.

NOTE

MySQL replication allows you to have different definitions for a table on the master and slave. If you run such a setup, analyze how the same query runs on these different tables. Also check whether using different storage engines and indexes can affect the final result.

When the SQL thread stops, a very common reason is that the slave’s tables differ from the master. I won’t describe all the reasons for these errors here, but the most common are:

§ The problems we saw in Example 2-1 and elsewhere with concurrent transactional updates that don’t preserve the consistency of data.

§ INSERT ON DUPLICATE KEY UPDATE, which, if run on the slave along with other updates in other connections in a different order from the master, can update the wrong rows and skip rows that were updated on the master.

§ Running concurrent inserts on a MyISAM table on the master without taking into account the existence of a slave.

§ Use of nondeterministic functions.[16]

It is also important to remember that a slave is not crash-proof, so if mysqld crashed, restarting it can repeat the transaction executed just before the crash and thus leave the slave with different data from the master. A similar issue happens when the slave fails while in the middle of updating a nontransactional table, such as MyISAM.

Circular Replication and Nonreplication Writes on the Slave

If you write to a slave outside of replication, you have to care about data consistency. Two good ways to avoid problems are to make sure the writes affect different objects from the changes being replicated and to always use primary keys with different sequences on the master and slave. You can ensure different primary key values by using AUTO_INCREMENT, giving each master a different starting point with the auto_increment_offset option, and setting auto_increment_increment to the number of servers in the replication setup.

Circular replication, where each master is a slave of another master,[17] can be affected by the same issue because such a setup does not prevent writes on the slave that can conflict with the data being replicated.

MySQL allows you to create circular replication setups, but does not guarantee data consistency for such setups due to their asynchronous replication design. You need to take care of data consistency yourself. Therefore, from a troubleshooting perspective, debugging errors caused by circular replication are the same as any other replication problem. In a bidirectional setup, remember that servers can be both masters and slaves of each other. So when you meet an error, determine which of them is playing the master role and which the slave’s, then act correspondingly. You will probably need to test replication with one server playing just the master role and the other the slave’s, then swap roles.

To illustrate this technique, let’s take a simple multimaster setup from MySQL Multiple Master Setup (Figure 5-4).

Two servers that replicate each other

Figure 5-4. Two servers that replicate each other

If you meet a problem in such a setup, take a pair (as in Figure 5-5) and solve the problem as if it were simple master-slave replication. Stop all updates on B while you are working on the problem.

Focusing on one direction in multimaster replication

Figure 5-5. Focusing on one direction in multimaster replication

After the problem is solved, temporarily stop updates on A and turn them on on B, so you have a simple setup like the one shown in Figure 5-6.

Reversing direction in multimaster replication

Figure 5-6. Reversing direction in multimaster replication

If this setup is still experiencing problems, solve them here, too, and then start updates on A again, so you have same circular multimaster setup.

At this point, it is good to analyze why the problem occurred in the first place and fix it before restarting updates on both servers.

The same method would work for any number of servers in the circle.

When setting up circular replication, you need to clearly split queries so that changes from one master don’t interfere with another. Errors can break replication or cause data inconsistency. I won’t describe best practices here, but you can find a detailed overview of how to set up multimaster replication in Chapter 4, “Replication for High Availability,” of MySQL High Availability by Charles Bell et al. (O’Reilly).

§ Good design is crucial to creating a trouble-free circular multimaster replication setup.

Incomplete or Altered SQL Statements

If an error message does not show the full query and the error log does not contain the full query either (which can happen if the query is larger than 1,024 bytes), you need to run the mysqlbinlog utility on either the master binary log or the slave’s relay log to get the full version of the query as the slave got it, and then analyze why it failed.

That can help with statement-based replication, which logs and sends queries in the original human-readable SQL. But what do you do if row format was used? Row events are the same events as queries and can be run through any MySQL client. Use mysqlbinlog with the --verbose option to get the SQL representation of row events.

§ Always use the same query on the slave that was executed when the master binary log event was applied. Using the mysqlbinlog utility to check what query ran will save you time. The binary log sometimes contains a query that is a bit different from what was originally executed on the master, and side effects could be introduced. If you ignore such a side effect, you could spend hours trying to repeat a problem that just can’t be found by running the query as the master ran it.

Different Errors on the Master and Slave

Another status message that can be confusing is “Query caused different errors on master and slave...” The message usually contains the error. Most confusing is when such a message says there was no error on the slave—“Error on slave: ‘no error’ (0).” —but there was one on the master. This can happen, for example, if an error was caused by a trigger on the master but the update on the master table succeeded. The query on the original table will be written to the binary logfile with a note containing the error number for the failed trigger. In this case, if the trigger successfully finished on the slave or the slave’s table does not have a trigger at all, the query on the slave will return no error, and thus you will get such a message.

To quickly fix this issue, skip the error using SET GLOBAL SLAVE_SKIP_COUNTER=1; and continue replication. Don’t forget to find the real cause of the problem to prevent the same error from recurring.

If a problem occurred on the master due to a deadlock in a trigger, fix the slave manually because the master and slave tables can contain different data. To do this, you need to find out which tables contain different data and update the slave to match the master.

Configuration

Another important thing to check is the configuration options of the master and slave. Ideally they should be the same, but sometimes there are good reasons, such as different hardware or the kind of load, to make them different. When they are different and you starting receive SQL errors that cannot be easily explained, check the options that can change server behavior. I described some of them in Chapter 3.

Just as for the single-server setup, when I recommended you run mysqld with the --no-defaults option to know whether it is affected by your custom options, here I recommend copying the master’s options into the slave configuration, so that the slave has exactly the same options as the master. Adjust only those options that differ between servers, such as server_id, which must always be unique. Determine whether the problem is still reproducible. If it is not, you can feel confident that a configuration variable is the reason for the failure. At that point you need only to find the troublesome variable using the techniques described in Haste Makes Waste, and adjust it correspondingly.

§ Always compare the options on the master and slave for differences.

When the Slave Lags Far Behind the Master

Problems with the I/O Thread discussed the situation when Seconds_Behind_Master grows large because the network is flaky. Another reason for huge delays is when the slave performs more slowly than the master.

The slave can be slower than the master if it has slower hardware, smaller buffers, or its own read load that competes with the replicated load for resources. Another possible cause of slowness is when a master executes queries in parallel, but the slave executes all the binary log events in a single thread, one after another.

The first thing you need to do when a slave lags behind the master is to find what is causing this behavior and how it can be improved.

Slow hardware is an easy case and can be solved by buying faster hardware. But before spending the money, analyze whether the master is using all its hardware and whether performance-related options on the slave are optimized. For example, if the master runs in a shared environment and the slave is on a dedicated but slower server, you have a chance to improve its speed. Calculate how many resources the master actually uses and how much the slave’s performance can be improved by tuning configuration parameters.

If the hardware is the same on both servers, or the slave runs on an even faster machine but still lags behind, check differences in performance-related options. Analyze their effect, and tune the slave correspondingly. Configuring the same options on the master and slave is a good start. In this case, you can be sure that the options are good for replicated events, so you should adjusting only those options on the slave that can improve a concurrent load.

The hardest case to solve is a slave whose slowness is caused by performing operations in a single thread while the master executes them in parallel. The only thing you can do in this case, besides trying the multithreaded slave preview, is to upgrade the hardware on the slave and tune performance-related options as much as possible.

In all these cases, you also should analyze the effect of queries that run on the slave concurrently with the slave SQL thread. I described concurrency troubleshooting in Chapter 2.


[16] A deterministic function is one that returns the same result each time it runs with the same input parameters. CONCAT('Hello, ', 'world!') is deterministic, whereas NOW() is not.

[17] Such replication can be also called “multimaster replication” or “bidirectional replication.”