当前位置: 首页 > article >正文



  • 日志
  • 视图
  • WDR报告
  • 错误码
  • core文件
  • ffic日志



系统日志$GAUSSLOG/cm/cm_agentsystem_call-current.logcm_agent组件日志,记录CM Server下发的仲裁命令的提示信息。不带current标识符的文件是历史日志文件,带current标识符的文件是当前日志文件。
系统日志$GAUSSLOG/cm/om_monitorom_monitor-%Y-%m-%d_%H%M%S-current.logom_monitor组件日志,记录etcd和CM Agent的活跃状态。不带current标识符的文件是历史日志文件,带current标识符的文件是当前日志文件。
WAL日志<实例数据目录>/pg_xlog由24个十六进制组成,例如 00000001000000000000001CWAL日志的内容取决于记录事务的类型,在系统崩溃时可以利用WAL日志进行恢复

📖一般日志排查顺序:cm_ctl/gs_guc/gs_ctl日志 => om_monitor日志 => cm_agent日志 => cm_server日志 => 数据库运行日志

cm_agent日志中记录的到其他节点的CM Server访问失败信息:

2025-01-09 09:38:21.181 tid=365425 AgentConnServerMain_2 ASYN ERROR: connect to cm server failed! The 1st of cm server node id is = 2, listenCount(1: 1).
2025-01-09 09:38:21.181 tid=365424 AgentConnServerMain_1 ASYN ERROR: 309: connect to cm_server failed, host= port=30200 localhost= connect_timeout=1 node_id=2 node_name= remote_type=7. could not connect to server:
        Is the server running on host "" and accepting
        TCP/IP connections on port 30200?


2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: idx is 0, primary(1) ha heartbeat is 0.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: cmserver on node(1) is down, heartbeat_of_primary=0, and then choose to promte primary.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: g_pre_agent_conn_count=1, primaryNodeId is 1, curInstIdx is 1, g_cmRole is 2, g_delayTimeout is 13.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: idx=0, node=1, instId is 1, heartbeat=0, etcdHeartbeat=50895, primaryNodeId=1.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: idx=1, node=2, instId is 2, heartbeat=6, etcdHeartbeat=49822, primaryNodeId=1.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: minNodeId=2, nodeIndex=1, currentNode=2, role=2, minNodeIdForCmId=2.
2025-01-09 09:38:27.198 tid=3522231 ETCD_HA ASYN LOG: find the min cm id=2, the cm could be the best primary.
2025-01-09 09:38:27.199 tid=3522231 ETCD_HA ASYN LOG: cm_delay_arbitrate_time_out End: server_node_index = 1, g_pre_agent_conn_count=1, g_cmRole is 2.
2025-01-09 09:38:27.199 tid=3522231 ETCD_HA ASYN LOG: [CmSetPrimary2Etcd]: node(2) role is 2, ready to promote.
2025-01-09 09:38:27.203 tid=3522231 ETCD_HA ASYN LOG: [CmSetPrimary2Etcd]: node(2) last role is 2, promote to primary.
2025-01-09 09:38:27.203 tid=3522231 ETCD_HA ASYN LOG: [CmSetPrimary2Etcd]: pre_agent_count is 1, node(2) cm role is 2, to primary.


2025-01-09 09:38:29.070 tid=3522281 HA_MAIN ASYN LOG: current node is 2, change it's role to primary.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cm_server_current_role is 1. cm_server_last_role is 2.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: Promoted to PRIMARY. Do variable reset and reload.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cms change to primary, will clean finish redo time.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cms change to primary, will clean cma fault time.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: cms change to primary, will clean switchover command.
2025-01-09 09:38:29.071 tid=3522281 HA_MAIN ASYN LOG: Setting arbitration_majority_reelection_timeout to 10.


2025-01-09 09:38:34.062 tid=3522280 AGENT_WORKER ASYN LOG: [ChangeDnPrimaryMemberIndex]: line 54: instd(6002) instTypePur is (1: Primary), instTypeSor is (2: Standby), peerInstId is 0.
2025-01-09 09:38:34.062 tid=3522280 AGENT_WORKER ASYN LOG: [ChangeDnPrimaryMemberIndex]: 67: instance(6001) static role(Primary) will change to be Standby.
2025-01-09 09:38:34.062 tid=3522280 AGENT_WORKER ASYN LOG: [ChangeDnPrimaryMemberIndex]: 60: instance(6002) static role(Standby) will change to be Primary.
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: [KeyEvent: KEY_EVENT_FAILOVER] [Instance: 6002] [Details: Failover message has sent to instance 6002, term 103, sendFailoverTimes is 0.]
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: line 185: new arbitra node 0, instanceId 6001, static_role 2=Standby, local_dynamic_role 0=Unknown, local_term=0, local_redo_finished = 0, local_last_xlog_location=0/0, local_db_state 0=Unknown, local_sync_state=0,         build_reason 0=Normal, double_restarting=0, group_term=103, sendFailoverTimes=0
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: line 185: new arbitra node 1, instanceId 6002, static_role 1=Primary, local_dynamic_role 2=Standby, local_term=4, local_redo_finished = 1, local_last_xlog_location=0/44FA5010, local_db_state 2=Need repair, local_sy        nc_state=0, build_reason 2=Disconnected, double_restarting=0, group_term=103, sendFailoverTimes=0
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: line 185: new arbitra node 2, instanceId 6003, static_role 2=Standby, local_dynamic_role 2=Standby, local_term=4, local_redo_finished = 1, local_last_xlog_location=0/44FA5010, local_db_state 2=Need repair, local_sy        nc_state=0, build_reason 2=Disconnected, double_restarting=0, group_term=103, sendFailoverTimes=0
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN WARNING: this cluster has no coordinator, no need to notify cn.
2025-01-09 09:38:34.069 tid=3522280 AGENT_WORKER ASYN LOG: [SendFailoverQuarm], line 2139: Failover message has sent to instance 6002 in reduce standy condition(0), local promoting.
2025-01-09 09:38:34.069 tid=3522274 AGENT_IO ASYN LOG: cmserver send msg to node 2, msgtype: MSG_CM_AGENT_FAILOVER


2025-01-09 09:38:38.435 tid=3522266 Monitor ASYN LOG: arbitration_majority_reelection_timeout elapsed into 0. Majority re-election enabled now.
2025-01-09 09:38:38.435 tid=3522266 Monitor ASYN LOG: instance(6001) heartbeat timeout, heartbeat:11, threshold:6

故障节点恢复后om_monitor日志中记录的CM Agent和ETCD的启动记录:

2025-01-09 09:39:02.288 tid=2826  LOG: The CM Agent startup check is complete: cluster_manual_start=0, agent_config_file_r=1, agent_binary_file_x=1, config_change_flag=0, previous_status=0, start_count=0.
2025-01-09 09:39:02.288 tid=2826  LOG: cm_agent start, pid is 2827
[cm_agent]: cmserverNum is 3, and cmserver info is [0 node:1, cmserverId:1, cmServerIndex:0], [1 node:2, cmserverId:2, cmServerIndex:1], [2 node:3, cmserverId:3, cmServerIndex:2], .
 2025-01-09 09:39:02.383 tid=2826  LOG: run check etcd log-outputs command: /gauss/app/cluster/core/app/bin/etcd --help | grep "\-\-log-outputs" success
2025-01-09 09:39:02.384 tid=2826  LOG: run etcd command: umask=`umask`;umask 0077;/gauss/app/cluster/core/app/bin/etcd  -name ...


2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: auto switchover instanceid=6001, wait_seconds=120.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: [SetSwitchoverInSwitchoverProcess], instd(6001) localRole is (2: Standby), cmd[cmdPur(1: Primary), cmdSour(2: Standby), cmdPur(0: Unknown), peerIdx: 6002] timeout is 120, delayTime is 0.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: dn instanceid=6003, primaryFlushLocation=00000000/45BDE228, standbyReplayLocation=00000000/45BDE228, curGap=00000000/00000000, maxGap=00000001/2C000000.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: dn instanceid=6001, primaryFlushLocation=00000000/45BDE228, standbyReplayLocation=00000000/45BDE228, curGap=00000000/00000000, maxGap=00000001/2C000000.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: [SwitchoverDone]: inst(6001) is doing switchover.
2025-01-09 14:24:31.568 tid=3522276 CTL_WORKER ASYN LOG: the balance state is 3 by DN.
2025-01-09 14:24:31.951 tid=3522280 AGENT_WORKER ASYN LOG: instId(6002) may be doing switchover, switchoverInstId is 6001.
2025-01-09 14:24:31.951 tid=3522280 AGENT_WORKER ASYN LOG: [Primary], 6002: another instance (6001) is doing[3/11], pendStatus count is 1, cannot to do arbitrate.
2025-01-09 14:24:32.008 tid=3522279 AGENT_WORKER ASYN LOG: send switchover to instance(6001) for [1/4] times.
2025-01-09 14:24:32.008 tid=3522279 AGENT_WORKER ASYN LOG: [KeyEvent: KEY_EVENT_SWITCHOVER] [Instance: 6001] [Details: send switchover message, node=1, instance=6001]
2025-01-09 14:24:32.008 tid=3522273 AGENT_IO ASYN LOG: cmserver send msg to node 1, msgtype: MSG_CM_AGENT_SWITCHOVER


  • pg_stat_activity:可以查询当前实例上各个会话的状态。
  • pg_thread_wait_status:可以查询当前实例上各个线程的等待事件。
  • pg_locks:用于查询当前实例上的锁状态。


select datname,sessionid,usename,application_name,client_addr,client_hostname,state,query_id,query from pg_stat_activity;

select * from pg_thread_wait_status where wait_status<>'none';

select locktype,database,relation,virtualxid,transactionid,objid,sessionid,mode,granted from pg_locks;



select * from snapshot.snapshot;

select create_wdr_snapshot();

select * from pg_node_env;   --检查当前的节点名
\o /home/omm/wdr_20241122_node.html
select generate_wdr_report(504,505,'all','node','dn_6001');   


📖 GaussDB数据库错误码释义请参考:https://support.huaweicloud.com/errorcode-dws/dws_08_0003.html



🕷 开启core文件对性能有一定的影响,尤其是进程频繁异常时对性能的影响更大。

检查core dump文件是否开启:

gs_guc check -Z datanode -N all -I all -c "enable_bbox_dump"

开启core dump文件功能:

gs_guc set -Z datanode -N all -I all -c "enable_bbox_dump=on"

mkdir /gauss/corefiles
chmod 750 /gauss/corefiles
gs_guc set -Z datanode -N all -I all -c "bbox_dump_path='/gauss/corefiles'"

gs_guc set -Z datanode -N all -I all -c "bbox_dump_count=4"




gs_guc check -Z datanode -N all -I all -c "enable_ffic_log"


gs_guc set -Z datanode -N all -I all -c "enable_ffic_log=on"



  • 导致数据库进程故障的信号;
  • 故障线程的调用栈;
  • 触发进程故障的sql的unique sql id;
  • 故障时间点CPU寄存器信息;
  • 故障线程pc指针;
  • 故障时间点内存映射信息;
  • 数据库参数配置。



  • 詳細講一下mobx的在ReactNative中的用法,包含下載,配置。
  • java开发常用指令整理
  • 【jmeter】下载及使用教程【mac】
  • .NET Framework
  • 【Elasticsearch】RestClient操作文档
  • 数据库-多表查询
  • git远程仓库如何修改
  • 简单排序算法
  • MATLAB绘图时线段颜色、数据点形状与颜色等设置,介绍
  • 手机版扫描王导出 PDF、快速文本识别工具扫描纸张
  • 9. 神经网络(一.神经元模型)
  • 5.SQLAlchemy对两张有关联关系表查询
  • IM系统设计
  • 4.JoranConfigurator解析logbak.xml
  • IDEA中将String类型转json格式
  • 学python的第四天:输入(重制版)
  • 如何使用Python脚本将本地项目上传到 GitHub
  • C语言练习(19)
  • 学习笔记——动态规划
  • Math Reference Notes: 反函数