MySQL从站停留在单个bin日志+ bin日志位置上17个小时以上

tl; dr：复制停滞在特定的二进制日志和位置，我不知道为什么

我有MySQL 5.5的MySQL复制设置。

这种复制设置没有落后的历史，一直坚实。

今天早上，我注意到奴隶在主人后面17个小时。

做更多的研究，它似乎是SQL_Thread的一个问题。

根据从站（通过SLAVE STATUS ），当前的主日志文件是mysql-bin.001306 @ position 20520499 。这与master的MASTER STATUS输出一致。

但是， SLAVE STATUS显示Relay_Master_Log_File当前是mysql-bin.001302 ， Exec_Master_Log_Pos为36573336 。 Relay_Master_Log_File和Exec_Master_Log_Pos在我今天早上一直在监视它们的时候已经进行了。

查看master上的mysql-bin.001302@3657336 ，这是位于mysql-bin.001302@3657336的语句：

 # at 36573053 #170221 14:33:48 server id 1 end_log_pos 36573130 Query thread_id=96205677 exec_time=0 error_code=0 SET TIMESTAMP=1487716428/*!*/; BEGIN /*!*/; # at 36573130 # at 36573213 #170221 14:33:48 server id 1 end_log_pos 36573213 Table_map: `database-name`.`table-name` mapped to number 5873 #170221 14:33:48 server id 1 end_log_pos 36573309 Write_rows: table id 5873 flags: STMT_END_F ### INSERT INTO `database-name`.`table-name` ### SET ### @1='xxxxxxxx' ### @2=6920826 ### @3='xxxxxxxx' ### @4='GET' ### @5='address' ### @6=2017-02-21 14:40:24 ### @7=2017-02-21 14:40:24 # at 36573309 #170221 14:33:48 server id 1 end_log_pos 36573336 Xid = 1668637037 COMMIT/*!*/; # at 36573336

大约在这个时候，昨天，我确实执行了一些大的查询来将数据迁移到一个新表中。这个过程看起来有点像这样;

 mysql> insert into tmp_table ( select <rows> from origin table ); -- 44 million rows mysql> insert into dest_table ( select * from tmp_table ); -- 44 million rows

这两张表格没有一个主要或唯一的关键，我读过的可能是一个问题。但是，虽然上面binlog条目中显示的数据库+表是目标表，但显示的插入logging不是在迁移过程中生成的。

如果你已经得到这么多，你应该得到互联网点。

在这一点上，我不知道还有什么要考虑的，或者还有什么地方可以find日志停滞的原因。任何洞察力是赞赏。

谢谢。

作为参考，这里是这个职位时间的MASTER STATUS和SLAVE STATUS输出：

主状态

 mysql> show master status; +------------------+----------+--------------+------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | +------------------+----------+--------------+------------------+ | mysql-bin.001306 | 20520499 | | | +------------------+----------+--------------+------------------+ 1 row in set (0.00 sec)

奴隶状态

 mysql> show slave status \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: master-host Master_User: replication-user Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.001306 Read_Master_Log_Pos: 20520499 Relay_Log_File: relay-bin.002601 Relay_Log_Pos: 36573482 Relay_Master_Log_File: mysql-bin.001302 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 36573336 Relay_Log_Space: 3565987462 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 63435 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 1 row in set (0.00 sec)

我从昨天开始的大型查询事务处于正确的轨道上。

在迁移数据之后，我在原始表上执行了一条DELETE语句，以除去已迁移的行。

这些表格只是充满了跟踪数据，因此没有任何主键或唯一键。

由于基于ROW的复制是如何工作的，所以slave并不执行在master上执行的相同的DELETE语句，而是为每一行执行DELETE语句，最后看起来像这样：

 DELETE FROM table WHERE colA=foo AND colB=bar AND colC=baz....etc

而且，由于没有与该查询相匹配的索引，因此单线程复制SQL线程执行了4000万以上的删除语句（或…正在尝试），这需要花费很长时间才能运行，因为所有必须进行的扫描完成识别每一行（当时表格大小约为8000万行）。

最后，我通过停止从属线程（ STOP SLAVE ）跳过一个从属事务（ SET GLOBAL sql_slave_skip_counter = 1; ）并重新启动从属线程（ START SLAVE ）来解决这个问题。

这导致我的Master和Slave在这里所讨论的表上不同步，但是我能够利用基于行的复制的性质，通过在Master上执行以下步骤使其重新同步：

 mysql> CREATE TABLE table_tmp; -- with the same schema as 'table' (SHOW CREATE TABLE table;) mysql> RENAME TABLE table TO table_bak, table_tmp TO table; mysql> INSERT INTO table ( SELECT * FROM table_bak ); mysql> DROP TABLE table_bak;

由于DELETE是在Master上执行的，INSERT这里只插入了我想保留的logging（删除的logging不见了）。而且，由于基于行的复制分别插入每一行而不是执行相同的INSERT INTO … SELECT语句，所以从表只被填充所需的数据。然后，随后的DROP TABLE语句将从机上的表删除，而不必分别对每行进行寻址。

这里要注意的是，因为表格的主版本仍然是3000到4000万行……插入和相应的复制最终会locking你的奴隶一会儿（重复上面的问题），但是它是一个更短的摊位（结束了大约20分钟）由于MySQL不必扫描数据库的行删除。

我希望这可以帮助未来的某个人。对不起，它是啰嗦，希望它是内容丰富和有益的。