mysql主从宕机恢复步骤

在生产环境中经常会出现slave出现错误，从而发生主从同步故障，此时就需要人工干预了。以下是小生整理出的一个回复思路，欢迎大佬指导，分享更好的方法。

宕机恢复分为几种情况：

1.从库数据一致性要求低

2.从库数据一致性要求高

从库数据要求一致性低：

这种情况比较好解决，由于对数据一致性要求比较低，我们可以先把slave起来从而达到热备份的效果(因为之前有做过全量备份和增量备份，所以不用担心数据丢失)。

以目前master对应的pos作为slave起始的pos。

MariaDB [db1]> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.11
                  Master_User: repluser
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: matser11.000002
          Read_Master_Log_Pos: 1486
               Relay_Log_File: mariadb-relay-bin.000002
                Relay_Log_Pos: 630
        Relay_Master_Log_File: matser11.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1146
                   Last_Error: Error 'Table 'db1.name' doesn't exist' on query. Default database: 'db1'. Query: 'insert into name values(1,"haha")'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 919
              Relay_Log_Space: 1493
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1146
               Last_SQL_Error: Error 'Table 'db1.name' doesn't exist' on query. Default database: 'db1'. Query: 'insert into name values(1,"haha")'
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 11
1 row in set (0.00 sec)

ERROR: No query specified

1.将业务主库上锁，阻止对数据的更新

MariaDB [db1]> SET AUTOCOMMIT=0;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> lock tables name read;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> show master status;
+-----------------+----------+--------------+------------------+
| File            | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-----------------+----------+--------------+------------------+
| matser11.000002 |     1486 |              |                  |
+-----------------+----------+--------------+------------------+
1 row in set (0.00 sec)

2.回到从库重新做主从

MariaDB [db1]> stop slave ;
Query OK, 0 rows affected (0.00 sec)

MariaDB [db1]> reset slave all;
Query OK, 0 rows affected (0.01 sec)

MariaDB [db1]> show slave status\G;
Empty set (0.00 sec)

ERROR: No query specified
MariaDB [db1]> change master to master_host="192.168.1.11",master_user="repluser",master_password="123qqq...A",master_log_file="matser11.000002",master_log_pos=1486;
Query OK, 0 rows affected (0.01 sec)
MariaDB [db1]> start slave ;
Query OK, 0 rows affected (0.00 sec)

MariaDB [db1]> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.11
                  Master_User: repluser
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: matser11.000002
          Read_Master_Log_Pos: 1486
               Relay_Log_File: mariadb-relay-bin.000002
                Relay_Log_Pos: 528
        Relay_Master_Log_File: matser11.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1486
              Relay_Log_Space: 824
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 11
1 row in set (0.00 sec)

１）使用LOCK TALBES虽然可以给InnoDB加表级锁，但必须说明的是，表锁不是由InnoDB存储引擎层管理的，而是由其上一层ＭySQL Server负责的，仅当autocommit=0、innodb_table_lock=1（默认设置）时，InnoDB层才能知道MySQL加的表锁，ＭySQL Server才能感知InnoDB加的行锁，这种情况下，InnoDB才能自动识别涉及表级锁的死锁；否则，InnoDB将无法自动检测并处理这种死锁。

２）在用LOCAK TABLES对InnoDB锁时要注意，要将AUTOCOMMIT设为0，否则ＭySQL不会给表加锁；事务结束前，不要用UNLOCAK TABLES释放表锁，因为UNLOCK TABLES会隐含地提交事务；COMMIT或ROLLBACK产不能释放用LOCAK TABLES加的表级锁，必须用UNLOCK TABLES释放表锁，正确的方式见如下语句。

MariaDB [db1]> COMMIT;
Query OK, 0 rows affected (0.00 sec)

MariaDB [db1]> unlock tables;
Query OK, 0 rows affected (0.00 sec)

这种方法小生并不推荐

skip掉相关错误

停止slave服务

SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;

开启slave服务并查看状态

MariaDB [db1]> stop slave;
Query OK, 0 rows affected (0.00 sec)

MariaDB [db1]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
Query OK, 0 rows affected (0.00 sec)

MariaDB [db1]> start slave;
Query OK, 0 rows affected (0.00 sec)

MariaDB [db1]> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.11
                  Master_User: repluser
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: matser11.000001
          Read_Master_Log_Pos: 1195
               Relay_Log_File: mariadb-relay-bin.000003
                Relay_Log_Pos: 528
        Relay_Master_Log_File: matser11.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1195
              Relay_Log_Space: 2057
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 11
1 row in set (0.00 sec)

这里跳过的是一个事务。当然，也可以跳过多个事务，但要谨慎，毕竟，你并不知道跳过的是什么事务。

建议：可反复执行上述步骤，仔细查看导致从库不能同步的语句。有的时候，阻止从库的事务太多，这种方法就显得略为低效。

可分析主库日志的事务，来确定SQL_SLAVE_SKIP_COUNTER的合适值

数据库一致性要求高

这种情况比较难受，因为生产环境还在继续，你必须以最快的方式恢复到最新的数据。

旧数据可用：

查看宕机时的偏移量或者最后一条sql语句(可根据时间查看)，并找出原因，加以解决

MariaDB [db1]> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.11
                  Master_User: repluser
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: matser11.000001
          Read_Master_Log_Pos: 1385
               Relay_Log_File: mariadb-relay-bin.000003
                Relay_Log_Pos: 528
        Relay_Master_Log_File: matser11.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1146
                   Last_Error: Error 'Table 'db1.name' doesn't exist' on query. Default database: 'db1'. Query: 'insert into name values(13,"lala")'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1195
              Relay_Log_Space: 2247
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1146
               Last_SQL_Error: Error 'Table 'db1.name' doesn't exist' on query. Default database: 'db1'. Query: 'insert into name values(13,"lala")'
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 11
1 row in set (0.00 sec)

ERROR: No query specified
[root@slave mysql]# mysqlbinlog mariadb-relay-bin.000003 
/*!*/;
# at 595
#200821 19:28:18 server id 11  end_log_pos 1358     Query    thread_id=14    exec_time=0    error_code=0
use `db1`/*!*/;
SET TIMESTAMP=1598009298/*!*/;
insert into name values(13,"lala")
/*!*/;
# at 691
#200821 19:28:18 server id 11  end_log_pos 1385     Xid = 512
COMMIT/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

将业务主库上锁，阻止对数据的更新

MariaDB [db1]> SET AUTOCOMMIT=0;
Query OK, 0 rows affected (0.00 sec)
MariaDB [db1]> lock tables name read;
Query OK, 0 rows affected (0.00 sec)

将缺失数据导出

[root@master mysql]# date
2020年 08月 21日 星期五 19:57:42 CST
[root@master mysql]# mysqlbinlog --start-datetime="20-08-21 19:28:18"  --stop-datetime="20--8-21 19:57:42" matser11.000001 > data.sql
[root@master mysql]# scp data.sql root@192.168.1.12:/root

恢复数据并查看

[root@slave mysql]# mysql -uroot -p < /root/data.sql
[root@slave mysql]# mysql
....

恢复主从(这里最好重做主从，由于我的是缺失库导致sql线程失败，所以重启slave就ok了)

MariaDB [db1]> stop slave;
MariaDB [db1]> start slave;
MariaDB [db1]> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.11
                  Master_User: repluser
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: matser11.000001
          Read_Master_Log_Pos: 5185
               Relay_Log_File: mariadb-relay-bin.000004
                Relay_Log_Pos: 1098
        Relay_Master_Log_File: matser11.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 5185
              Relay_Log_Space: 5097
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 11
1 row in set (0.00 sec)

ERROR: No query specified

解开主库的锁

MariaDB [db1]> COMMIT;
Query OK, 0 rows affected (0.00 sec)

MariaDB [db1]> unlock tables;
Query OK, 0 rows affected (0.00 sec)

旧数据不可用：

清空原来的主从

同上

使用最新一次的全量备份和增量备份先恢复大量数据。

[root@slave mysql]# mysql -uroot -p < /mysql/all/data.sql
[root@slave mysql]# mysql -uroot -p < /mysql/name/data.sql

最后再合适的时机将业务主库上锁，阻止对数据的更新。

同上

利用binlog日志导出剩余的数据进行恢复。(此时的起始值应该是上次增量备份结束的时间)

同上

恢复主从

同上

解开主库的锁

优化方案

什么？运维小哥没做备份？(不想混了？)嫌麻烦？

那强烈推荐PXC集群。