侧边栏壁纸
  • 累计撰写 35 篇文章
  • 累计创建 14 个标签
  • 累计收到 0 条评论

目 录CONTENT

文章目录

测试环境故障排查

子曰
2023-08-24 / 0 评论 / 0 点赞 / 679 阅读 / 1,820 字 / 正在检测是否收录...

1. 背景描述

环境:测试环境

相关库:eo_os

研发同学企业微信告警:1. 连接被拒绝 2. 查不到写入的数据

2. 相关日志

通过主机日志、数据库错误日志可以看到发生了oom 。

# 数据库虚拟机日志
# grep -i oom /var/log/messages
Aug 23 14:46:21 uat14-dqa212 kernel: zabbix_agent2 invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
Aug 23 14:46:21 uat14-dqa212 kernel: oom_kill_process+0x22c/0x440
Aug 23 14:46:21 uat14-dqa212 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Aug 23 14:46:21 uat14-dqa212 kernel: oom_reaper: reaped process 794 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
# 故障的数据库日志
(root@uat14-dqa212 2023-08-23 15:30:05 ~)
# tail -n 20 /data/logs/services/mysql/mysqld.log
2023-08-22T19:18:06.331064+08:00 19464703 [Warning] [MY-013526] [InnoDB] Resource not available to create threads for parallel scan. Falling back to single thread mode.
2023-08-22T19:18:06.331092+08:00 19464599 [Warning] [MY-013526] [InnoDB] Resource not available to create threads for parallel scan. Falling back to single thread mode.
2023-08-22T20:06:28.364290+08:00 19490974 [Warning] [MY-013526] [InnoDB] Resource not available to create threads for parallel scan. Falling back to single thread mode.
2023-08-22T20:06:28.364878+08:00 19490975 [Warning] [MY-013526] [InnoDB] Resource not available to create threads for parallel scan. Falling back to single thread mode.
2023-08-23T10:28:06.694028+08:00 19711940 [Warning] [MY-013526] [InnoDB] Resource not available to create threads for parallel scan. Falling back to single thread mode.
2023-08-23T10:28:06.694288+08:00 19711931 [Warning] [MY-013526] [InnoDB] Resource not available to create threads for parallel scan. Falling back to single thread mode.
2023-08-23T14:46:08.436678+08:00 0 [ERROR] [MY-000000] [connection_h] Error log throttle:          5 'Can't create thread to handle new connection' error(s) suppressed
2023-08-23T14:46:10.187339+08:00 0 [ERROR] [MY-010249] [Server] Can't create thread to handle new connection(errno= 11)
2023-08-23T14:46:21.694109+08:00 0 [Warning] [MY-010140] [Server] Could not increase number of max_open_files to more than 60000 (request: 150000)
2023-08-23T14:46:21.694127+08:00 0 [Warning] [MY-010142] [Server] Changed limits: table_open_cache: 14995 (requested 30000)
2023-08-23T14:46:21.946318+08:00 0 [Warning] [MY-010915] [Server] 'NO_ZERO_DATE', 'NO_ZERO_IN_DATE' and 'ERROR_FOR_DIVISION_BY_ZERO' sql modes should be used with strict mode. They will be merged with strict mode in a future release.
2023-08-23T14:46:21.948571+08:00 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.20) starting as process 17847
2023-08-23T14:46:22.039578+08:00 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2023-08-23T14:46:25.043805+08:00 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2023-08-23T14:46:26.482279+08:00 0 [System] [MY-011323] [Server] X Plugin ready for connections. Socket: '/var/run/mysqld/mysqlx.sock' bind-address: '::' port: 33060
2023-08-23T14:46:30.062817+08:00 0 [System] [MY-010229] [Server] Starting XA crash recovery...
2023-08-23T14:46:30.093220+08:00 0 [System] [MY-010232] [Server] XA crash recovery finished.
2023-08-23T14:46:33.237451+08:00 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2023-08-23T14:46:33.294944+08:00 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.20'  socket: '/data/run/services/mysql/mysql.sock'  port: 61106  MySQL Community Server - GPL.
2023-08-23T14:46:33.294983+08:00 0 [System] [MY-013292] [Server] Admin interface ready for connections, address: '127.0.0.1'  port: 61105
# orchestrator的钩子服务日志
[2023-08-23 14:46:29.053] [main.go:47] [main.main] [ERROR] ------------------split line ----------------------
[2023-08-23 14:46:29.056] [mysql.go:26] [mysql.SelectDomain_dao.func1] [INFO] db result: commit
[2023-08-23 14:46:29.056] [logic.go:37] [logic.(*Info).Run] [INFO] =====【data】=====
[2023-08-23 14:46:29.056] [logic.go:38] [logic.(*Info).Run] [INFO] RWDomain: mysql-web-rw.db.uat14.dom
[2023-08-23 14:46:29.056] [logic.go:39] [logic.(*Info).Run] [INFO] RODomain: mysql-web-ro.db.uat14.dom
[2023-08-23 14:46:29.056] [logic.go:40] [logic.(*Info).Run] [INFO] OldMaster: 10.0.14.142
[2023-08-23 14:46:29.056] [logic.go:41] [logic.(*Info).Run] [INFO] NewMaster: 10.0.14.141
[2023-08-23 14:46:29.056] [logic.go:42] [logic.(*Info).Run] [INFO] Port: 61106

[2023-08-23 14:46:29.122] [dnsapi.go:48] [logic.(*Info).dnsapi_get] [INFO] dnsapi_get success. hosts:
[2023-08-23 14:46:29.122] [dnsapi.go:50] [logic.(*Info).dnsapi_get] [INFO] [10.0.14.142]
[2023-08-23 14:46:29.230] [dnsapi.go:102] [logic.(*Info).dnsapi_update] [INFO] dnsapi_update success. domain: [mysql-web-rw.db.uat14.dom] old: [10.0.14.142] new: [10.0.14.141]
[2023-08-23 14:46:29.250] [dnsapi.go:48] [logic.(*Info).dnsapi_get] [INFO] dnsapi_get success. hosts:
[2023-08-23 14:46:29.250] [dnsapi.go:50] [logic.(*Info).dnsapi_get] [INFO] [10.0.14.141]
[2023-08-23 14:46:29.250] [logic.go:77] [logic.(*Info).Run] [INFO] mysql-web-ro.db.uat14.dom a record <= 1, so do not to del.
[2023-08-23 14:46:29.251] [logic.go:108] [logic.(*Info).CheckPort] [ERROR] 10.0.14.142:61106 closed
[2023-08-23 14:46:29.251] [logic.go:93] [logic.(*Info).Run] [ERROR] self.CheckPort failed, so don't kill conn on old master.
[2023-08-23 14:46:29.252] [mysql.go:89] [mysql.UpdateInfo] [INFO] sql: update info set ip = ? where domain= ? and rw = 1 and isdel=0; RowsAffected: 1
[2023-08-23 14:46:29.253] [mysql.go:120] [mysql.UpdateInfo] [INFO] sql: update info set isdel = 1 where domain= ? and ip = ? and rw = 0 and isdel=0 RowsAffected: 1
[2023-08-23 14:46:29.271] [mysql.go:67] [mysql.UpdateInfo.func1] [INFO] db result: commit
[2023-08-23 14:46:29.271] [logic.go:101] [logic.(*Info).Run] [INFO] UpdateInfo done.
# 业务侧应用日志
# 第一部分
Aug 23, 2023 @ 14:46:30.859	message:{"artisan":"time-response-consumer-new-topic","trace_id":"","span_id":"","level":"ERROR","ts":"2023-08-23 14:46:29.529731","msg":"monitor-error-data-to-db-exception","context":{"msg":"SQLSTATE[HY000] [2002] Connection refused (SQL: select * from `eeo_course_class` where `class_id` = 428155 limit 1)","file":"/data/apps/business/eeo_classroom_business/vendor/illuminate/database/Connection.php","line":712}}

# 第二部分
Aug 23, 2023 @ 14:57:59.405	message:{"artisan":"time-response-consumer-new-topic","trace_id":"","span_id":"","level":"ERROR","ts":"2023-08-23 14:57:50.818676","msg":"classroom-listen-time-response class not found","context":{"data":{"class_id":"428303","member_uid":"412140","time_stamp":"1692773869","time_type":"1","platform_type":"0","os_type":"3","identity":"4","client_id":"0"}}}

时间线梳理:

14:46:08.436678 数据库报错,无法新建连接,此时内存已经到临界点了

14:46:21 oom,数据库开始重启

14:46:29.053 钩子程序发现数据库错误,开始切换域名

14:46:29.230 域名切换完成

14:46:33.294983 数据库重启完成

3. 问题分析

3.1 连接被拒绝

从日志中可以看到,连接被拒绝的时间点,数据库oom,在重启,此时域名还未切换,数据库也未重启完成,所以会出现连接拒绝,等到数据库域名切换完成,数据库连接恢复正常,没有报错了。

3.2 主从数据不一致

image-20230823180055620

关键点在这里,我们用了自增id,并且有实际的业务意义,主从切换之前,数据写入old master,此时还没同步到从库,主oom,所以从库缺少几条数据。

新业务连到new master后写入数据,是从当前表的最大自增id开始写数据的,所以拿old master的id 去查new master是查不出数据的,除非new master的自增id追上old master(此时的追上并不是说从old master库拉了数据,而是新业务的写入,自增id追到了老主)。

从业务上观察,可以看到,是有一些数据不一致的,在主从架构恢复的时候,数据是有主键冲突的错误报出来的。

0

评论区