MySQL主库xtrabackup备份导致应用异常

昨天有位同学问我:有台主库tps一下子跌没了,应用写入全部报错,她提供几个时间点给我。tps一下子跌没,我条件反射:莫非有全局读锁。开始排查
第一步:
查看数据监控 mysql thread running 23:47分左右开始上涨,到第二天01:17恢复正常。其他指标变化不是非常明显

又得到非常有用信息:因备库问题,在主库16:00启动xtrabackup,这时感觉找到问题的根源
第二步:
查看slow,一条子查询跑了23659S、flush跑了 5216S,这条子查询应该在19:00左右跑起,备份应该在23:45左右跑flush tables with read lock

# Time: 121225  1:09:54
# User@Host: xxx[xxx] @  [xxxx.xxxx.xxxx.xxxx]
# Query_time: 23659.989155  Lock_time: 0.000104 Rows_sent: 0  Rows_examined: 47501046480
SET timestamp=1356368994;
select id, parent_biz_order_id, buyer_id, status, type, gmt_modify from ttt where parent_biz_order_id 
in (select parent_biz_order_id from ttt where type='哈哈');
# Time: 121225  1:13:25
# User@Host: root[root] @ localhost []
# Query_time: 5216.430881  Lock_time: 0.000000 Rows_sent: 0  Rows_examined: 0
use mysql;
SET timestamp=1356369205;
FLUSH TABLES WITH READ LOCK;

第三步:
查看备份信息:备份脚本实际跑完(备份未成功)时间是在第二天00:02:15

+--------------+---------------------+---------------------+
| db_name      | start_time          | end_time |
+--------------+---------------------+---------------------+
| xxxxxxxxx | 2012-12-24 17:40:56 | 2012-12-25 00:02:15 |

从备份日志看出,23:46:24 xtrabackup触发flush tables with read lock,等了900秒后,因为close tables失败,最终导致备份失败。触发flush tables时间点+900S=xtrabackup脚本跑完时间end_time
备份脚本报错:
121224 23:46:24 innobackupex-1.5.1: Starting to lock all tables…
innobackupex-1.5.1: Error: Connection to mysql child process (pid=18366) timedout. (Time limit of 900 seconds exceeded. You may adjust time limit by editing the value of parameter “$mysql_response_timeout” in this script.) while waiting for reply to MySQL request: ‘FLUSH TABLES WITH READ LOCK;’ at /usr/bin/innobackupex-1.5.1 line 336.

xtrabackup源码解释:

# flush tables with read lockmysql_check();--没有问题
        mysql_lockall()if!$option_no_lock;--在执行flush tables with read lock时,mysql_send函数执行超时900S,备份失败
 
# timeout in seconds for a reply from mysqlmy$mysql_response_timeout=900;
 
sub mysql_send {my$request=shift;
 $current_mysql_request=$request;print MYSQL_WRITER "$request\n";mysql_check();$current_mysql_request='';}
 
sub mysql_lockall {$now= current_time();printSTDERR"$now  $prefix Starting to lock all tables...\n";
 
    mysql_send "USE mysql;";mysql_send "SET AUTOCOMMIT=0;";if(compare_versions($mysql_server_version,'4.0.22')==0|| compare_versions($mysql_server_version,'4.1.7')==0){# MySQL server version is 4.0.22 or 4.1.7mysql_send "COMMIT;";mysql_send "FLUSH TABLES WITH READ LOCK;";}else{# MySQL server version is other than 4.0.22 or 4.1.7mysql_send "FLUSH TABLES WITH READ LOCK;";mysql_send "COMMIT;";}write_binlog_info;write_slave_info if$option_slave_info;
 $now= current_time();printSTDERR"$now  $prefix All tables locked and flushed to disk\n";}

把整个过程连贯起来描述:

1、16:00 xtracbackup启动,因为xtrabackup会先copy innodb data file
2、19:00 查询sql跑起
3、23:45 xtrabackup加flush tables with read lock,跑成功这条语句要做三件事:1、加全局锁 2、关闭open tables 3、打上阻塞commit的flag。因为加全局锁成功,但在关闭open tables时,因为长查询,一直hang住
4、23:45以后,所有的写入都会被阻塞,应用不停创建连接,thread running升高
5、第二天1:09:54 sql跑完
6、第二天1:17:25thread running恢复正常

小结:还有几个问题没有解释,flush tables with read lock失败,thread running为什么没有在flush失败时间点下降?slow.log为什么记得5216S?

觉得文章有用?立即: 和朋友一起 共学习 共进步!

猜您喜欢

文章评论: “MySQL主库xtrabackup备份导致应用异常

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注

*

您可以使用这些 HTML 标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>