金仓数据库V8R6集群实践之data目录“被“迁移
之前在项目处理问题时遇到过一次正式生产环境在未停止集群服务的情况下直接将数据库的data目录mv到单独挂载的存储目里,集群出问题后在恢复集群时报出了control文件不一致的问题。
问题排查过程很简单,ps命令查看data目录位置,再通过ls -al查看真实的data目录位置就能发现问题所在。
因为当时是集群环境,可以通过高可用数据冗余解决此问题,接下来在本地环境复现下此问题。
一.查看集群状态(确定主库节点):
[v8r6c7b12@192-168-218-218 ~]$ repmgr service status
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node1 | primary | * running | | running | 10598 | no | n/a
2 | node2 | standby | running | node1 | running | 10178 | no | 0 second(s) ago
[v8r6c7b12@192-168-218-218 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:0c:29:55:4f:29 brd ff:ff:ff:ff:ff:ff
inet 192.168.218.218/24 brd 192.168.218.255 scope global noprefixroute ens32
valid_lft forever preferred_lft forever
inet 192.168.218.223/24 scope global secondary ens32:3
valid_lft forever preferred_lft forever
inet6 fe80::f9a9:abea:88a5:180a/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether 52:54:00:1f:9c:81 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
valid_lft forever preferred_lft forever
4: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master virbr0 state DOWN group default qlen 1000
link/ether 52:54:00:1f:9c:81 brd ff:ff:ff:ff:ff:ff
附:习惯性在执行完集群查看命令之后,执行ip a命令查看下数据库虚拟漂移ip来追加确定集群主节点。
二.直接暴力mv主库data目录到其他目录里
[v8r6c7b12@192-168-218-218 ~]$ ps -ef|grep kingbase|grep data
v8r6c7b+ 9986 1 0 20:39 ? 00:00:00 /home/v8r6c7b12/cluster/fcj/clusterdb/kingbase/bin/kingbase -D /home/v8r6c7b12/cluster/fcj/clusterdb/kingbase/data
[v8r6c7b12@192-168-218-218 ~]$ mkdir /home/v8r6c7b12/test/
[v8r6c7b12@192-168-218-218 ~]$ cp -af /home/v8r6c7b12/cluster/fcj/clusterdb/kingbase/data /home/v8r6c7b12/test/
[v8r6c7b12@192-168-218-218 ~]$ mv /home/v8r6c7b12/cluster/fcj/clusterdb/kingbase/data /home/v8r6c7b12/cluster/fcj/clusterdb/kingbase/data20250320
[v8r6c7b12@192-168-218-218 ~]$ ln -s /home/v8r6c7b12/test/data/ /home/v8r6c7b12/cluster/fcj/clusterdb/kingbase/data
[v8r6c7b12@192-168-218-218 ~]$
三.手动执行checkpoint命令,重启集群:
[v8r6c7b12@192-168-218-218 ~]$ sys_monitor.sh restart
2025-03-20 21:17:07 Ready to stop all DB ...
Service process "node_export" was killed at process 16369
Service process "postgres_ex" was killed at process 16370
Service process "node_export" was killed at process 14897
Service process "postgres_ex" was killed at process 14898
2025-03-20 21:17:10 begin to stop repmgrd on "[192.168.218.218]".
2025-03-20 21:17:11 repmgrd on "[192.168.218.218]" stop success.
2025-03-20 21:17:11 begin to stop repmgrd on "[192.168.218.219]".
2025-03-20 21:17:11 repmgrd on "[192.168.218.219]" stop success.
2025-03-20 21:17:11 begin to stop DB on "[192.168.218.219]".
等待服务器进程关闭 .... 完成
服务器进程已经关闭
2025-03-20 21:17:12 DB on "[192.168.218.219]" stop success.
2025-03-20 21:17:12 begin to stop DB on "[192.168.218.218]".
等待服务器进程关闭 ............................................................... 失败
sys_ctl: server进程没有关闭
2025-03-20 21:18:13 DB on "[192.168.218.218]" stop success.
2025-03-20 21:18:13 Done.
2025-03-20 21:18:13 Ready to start all DB ...
2025-03-20 21:18:13 begin to start DB on "[192.168.218.218]".
sys_ctl: 其他服务器进程可能正在运行; 尝试启动服务器进程
等待服务器进程启动 .............. 已停止等待
sys_ctl: 无法启动服务器进程
检查日志输出.
2025-03-20 21:18:24 execute to start DB on "[192.168.218.218]" failed.
2025-03-20 21:18:24 Start DB on localhost(192.168.218.218) failed, will do nothing and exit.
发现主库启动失败,查看数据库日志,发现有报错堆栈及其他信息如下
2025-03-20 21:18:40.075 CST,,,18825,,67dc15b0.4989,1,,2025-03-20 21:18:40 CST,,0,LOG,00000,"database system was shut down at 2025-03-20 21:17:12 CST",,,,,,,,,""
2025-03-20 21:18:40.075 CST,,,18825,,67dc15b0.4989,2,,2025-03-20 21:18:40 CST,,0,LOG,00000,"invalid primary checkpoint record",,,,,,,,,""
2025-03-20 21:18:40.075 CST,,,18825,,67dc15b0.4989,3,,2025-03-20 21:18:40 CST,,0,PANIC,XX000,"could not locate a valid checkpoint record",,,,,,,,,""
2025-03-20 21:18:40.076 CST,,,18825,,67dc15b0.4989,4,,2025-03-20 21:18:40 CST,,0,LOG,00000,"kingbase ran into a problem it couldn't handle,it needs to be shutdown to prevent damage to your data",,,,,,,,,""
2025-03-20 21:18:40.082 CST,,,18825,,67dc15b0.4989,5,,2025-03-20 21:18:40 CST,,0,WARNING,01000,"
ERROR: -----------------------stack error start-----------------------
ERROR: TIME: 2025-03-20 21:18:40.076144+08
ERROR: 1 18825 0x7f3b979b85a5 debug_backtrace (backtrace.so)
ERROR: 2 18825 0x7f3b979b864a <symbol not found> (backtrace.so)
ERROR: 3 18825 0x7f3ba3d615d0 <symbol not found> (libpthread.so.0)
ERROR: 4 18825 0x7f3ba2131207 gsignal (libc.so.6)
ERROR: 5 18825 0x7f3ba21328f8 abort (libc.so.6)
ERROR: 6 18825 0x94095a errfinish + 0x5e80e1aa
ERROR: 7 18825 0x559d47 StartupXLOG + 0x5e427597
ERROR: 8 18825 0x79fa91 StartupProcessMain + 0x5e66d2e1
ERROR: 9 18825 0x567265 AuxiliaryProcessMain + 0x5e434ab5
ERROR: 10 18825 0x79b3f9 StartChildProcess + 0x5e668c49
ERROR: 11 18825 0x79e439 PostmasterMain + 0x5e66bc89
ERROR: 12 18825 0x4b34fd main + 0x5e380d4d
ERROR: 13 18825 0x7f3ba211d3d5 __libc_start_main (libc.so.6)
ERROR: 14 18825 0x4b359b _start + 0x5e3962bb
",,,,,,,,,""
2025-03-20 21:18:42.276 CST,,,18823,,67dc15b0.4987,1,,2025-03-20 21:18:40 CST,,0,LOG,00000,"startup process (PID 18825) was terminated by signal 6: Aborted",,,,,,,,,""
2025-03-20 21:18:42.276 CST,,,18823,,67dc15b0.4987,2,,2025-03-20 21:18:40 CST,,0,LOG,00000,"aborting startup due to startup process failure",,,,,,,,,""
2025-03-20 21:18:42.353 CST,,,18823,,67dc15b0.4987,3,,2025-03-20 21:18:40 CST,,0,LOG,00000,"database system is shut down",,,,,,,,,""
才疏学浅且当时情况紧急,没有通过gdb去定位问题。看到了"invalid primary checkpoint record"和"could not locate a valid checkpoint record"的报错信息,便去查看sys_control文件的最后修改信息及大小等信息。通过sys_controldata工具查看与备库data目录相同的data目录,再通过sys_controldata工具查看data目录软链接源端所在目录的另一个修改时间就是当前时间的data备份目录,再结合备库中sys_controldata查看的信息以及data/base目录里的文件最后的修改时间,最终确定是由迁移data目录但没有提前停止数据库服务导致的问题。这种问题真是活久见,在本地测试半天都无法复现。
经过各种尝试在主库解决问题,最终以失败告终。突然想到尝试在备库执行重做备库,完成后手动执行主备切换命令,将原备库升级成为新主库后,再在原主库重新执行克隆命令达到恢复集群的目的,具体操作如下:
1)停止备库数据库服务:/opt/Kingbase/ES/V8/Server/bin/sys_ctl -D /opt/Kingbase/ES/V8/data -m fast stop
2)重命名data目录:mv /opt/Kingbase/ES/V8/data /opt/Kingbase/ES/V8/data.bak
3)重做备库:repmgr -h主库ip -Uesrep -desrep --fast-checkpoint standby clone
4)手动切换主备库:repmgr standby switchover
5)停止新备库(原主库)数据库服务:/opt/Kingbase/ES/V8/Server/bin/sys_ctl -D /opt/Kingbase/ES/V8/data -m fast stop
6)重做备库:repmgr -h主库ip -Uesrep -desrep --fast-checkpoint standby clone
7)重做完备库启动备库服务:/opt/Kingbase/ES/V8/Server/bin/sys_ctl -D /opt/Kingbase/ES/V8/data start
8)备库注册/跟随:repmgr -h主库ip -Uesrep -desrep -F standby register/follow
9)查看集群服务:repmgr cluster show或者repmgr service status
至此,此脑残问题终于解决。由于本地实现没办法复现当时那个大神的神操作,故无法截图相关复现操作,只能通过具体命令来提供解决思路。相信正常人都不会这种烧脑无厘头的脑残操作。