昨天有位同学问我:有台主库tps一下子跌没了,应用写入全部报错,她提供几个时间点给我。tps一下子跌没,我条件反射:莫非有全局读锁。开始排查
第一步:
查看数据监控 mysql thread running 23:47分左右开始上涨,到第二天01:17恢复正常。其他指标变化不是非常明显
又得到非常有用信息:因备库问题,在主库16:00启动xtrabackup,这时感觉找到问题的根源
第二步:
查看slow,一条子查询跑了23659S、flush跑了 5216S,这条子查询应该在19:00左右跑起,备份应该在23:45左右跑flush tables with read lock
# Time: 121225 1:09:54
# User@Host: xxx[xxx] @ [xxxx.xxxx.xxxx.xxxx]
# Query_time: 23659.989155 Lock_time: 0.000104 Rows_sent: 0 Rows_examined: 47501046480
SET timestamp=1356368994;
select id, parent_biz_order_id, buyer_id, status, type, gmt_modify from ttt where parent_biz_order_id
in (select parent_biz_order_id from ttt where type='哈哈');
# Time: 121225 1:13:25
# User@Host: root[root] @ localhost []
# Query_time: 5216.430881 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0
use mysql;
SET timestamp=1356369205;
FLUSH TABLES WITH READ LOCK;
第三步:
查看备份信息:备份脚本实际跑完(备份未成功)时间是在第二天00:02:15
+--------------+---------------------+---------------------+
| db_name | start_time | end_time |
+--------------+---------------------+---------------------+
| xxxxxxxxx | 2012-12-24 17:40:56 | 2012-12-25 00:02:15 |
从备份日志看出,23:46:24 xtrabackup触发flush tables with read lock,等了900秒后,因为close tables失败,最终导致备份失败。触发flush tables时间点+900S=xtrabackup脚本跑完时间end_time
备份脚本报错:
121224 23:46:24 innobackupex-1.5.1: Starting to lock all tables…
innobackupex-1.5.1: Error: Connection to mysql child process (pid=18366) timedout. (Time limit of 900 seconds exceeded. You may adjust time limit by editing the value of parameter “$mysql_response_timeout” in this script.) while waiting for reply to MySQL request: ‘FLUSH TABLES WITH READ LOCK;’ at /usr/bin/innobackupex-1.5.1 line 336.
xtrabackup源码解释:
# flush tables with read lockmysql_check();--没有问题
mysql_lockall()if!$option_no_lock;--在执行flush tables with read lock时,mysql_send函数执行超时900S,备份失败
# timeout in seconds for a reply from mysqlmy$mysql_response_timeout=900;
sub mysql_send {my$request=shift;
$current_mysql_request=$request;print MYSQL_WRITER "$request\n";mysql_check();$current_mysql_request='';}
sub mysql_lockall {$now= current_time();printSTDERR"$now $prefix Starting to lock all tables...\n";
mysql_send "USE mysql;";mysql_send "SET AUTOCOMMIT=0;";if(compare_versions($mysql_server_version,'4.0.22')==0|| compare_versions($mysql_server_version,'4.1.7')==0){# MySQL server version is 4.0.22 or 4.1.7mysql_send "COMMIT;";mysql_send "FLUSH TABLES WITH READ LOCK;";}else{# MySQL server version is other than 4.0.22 or 4.1.7mysql_send "FLUSH TABLES WITH READ LOCK;";mysql_send "COMMIT;";}write_binlog_info;write_slave_info if$option_slave_info;
$now= current_time();printSTDERR"$now $prefix All tables locked and flushed to disk\n";}
把整个过程连贯起来描述:
1、16:00 xtracbackup启动,因为xtrabackup会先copy innodb data file
2、19:00 查询sql跑起
3、23:45 xtrabackup加flush tables with read lock,跑成功这条语句要做三件事:1、加全局锁 2、关闭open tables 3、打上阻塞commit的flag。因为加全局锁成功,但在关闭open tables时,因为长查询,一直hang住
4、23:45以后,所有的写入都会被阻塞,应用不停创建连接,thread running升高
5、第二天1:09:54 sql跑完
6、第二天1:17:25thread running恢复正常
小结:还有几个问题没有解释,flush tables with read lock失败,thread running为什么没有在flush失败时间点下降?slow.log为什么记得5216S?
觉得文章有用?立即:
和朋友一起 共学习 共进步!
猜您喜欢
HI