【Manager Handbook for Distributed AntDB-T】Data Collection Tasks and Node Monitoring Tasks
Data collection tasks
After the data collection task is started, the corresponding information will be collected by the agent on the host and stored in the relevant table of adbmgr. Refer to related sheets section for the introduction to the table.
Configure host resource collection task
add job usage_for_host (interval= 60,command = 'select monitor_get_hostinfo();');
Task description: At an interval of 60 seconds, collect host information: including cpu, memory, disk, network and other dimensions.
Configure the database resource collection task
add job usage_for_adb (interval= 60,command = 'select monitor_databaseitem_insert_data();');
Task description: 60 seconds interval, collect database information: including library size, archive information, commit rollback rate, stream replication latency, long transactions, and other information.
Configure database performance index collection task
add job tps_for_adb (interval= 60,command = 'select monitor_databasetps_insert_data();');
Task Description: Collect database TPS and QPS information at 60-second intervals.
Configure database slow SQL monitoring task
add job slowlog_for_adb (interval= 60,command = 'select monitor_slowlog_insert_data();');
Task description: collect slow SQL information at 60-second interval, need to use pg_stat_statement plugin.
Node monitoring tasks
Configure the coordinator monitoring task
add job mon_coord (interval = 5, status = true,command ='select monitor_handle_coordinator()' );
Task Description: Check if there is a failed node in the coordinator at an interval of 5 seconds, if so, retry the connection, and after three failed retries, remove the failed coordinator node from the cluster.
Configure gtmcoord monitoring task
add job mon_gtmcoord (interval = 5,status=true,command='select monitor_handle_gtmcoord()');
Task description: Check whether gtm is failed in 5 seconds interval, if it is, retry to connect, and perform failover gtmcoord operation after three failed retries.
Configure datanode monitoring task
add job mon_datanode (interval = 5,status=true,command='select monitor_handle_datanode()');
Task description: Check if datanode master has failed node in 5 seconds interval, if so, retry connection, after three failed retries, perform failover datanode operation. Each time the job runs, process a failed datanode master.
All the above use default parameters, if you want to modify the default parameter values, add them in the following way
add job mon_datanode (interval = 5,status=true,command='select monitor_handle_datanode('''',true,5,3,20)');
Note:
After adding a node monitoring task, it blocksstop allandstart all operation.
When the status of the monitoring task is true, these two operations are failures and give the prompt.
postgres=# stop all mode fast;
ERROR: on job table, the content of job "mon_coord" includes "monitor_handle_coordinator" string and its status is "on"; you need do "ALTER JOB "mon_coord" (STATUS=false);" to alter its status to "off" or set "adbmonitor=off" in postgresql.conf of ADBMGR to turn all job off which can be made effect by mgr_ctl reload
HINT: try "list job" for more information
postgres=# start all;
ERROR: on job table, the content of job "mon_coord" includes "monitor_handle_coordinator" string and its status is "on"; you need do "ALTER JOB "mon_coord" (STATUS=false);" to alter its status to "off" or set "adbmonitor=off" in postgresql.conf of ADBMGR to turn all job off which can be made effect by mgr_ctl reload
HINT: try "list job" for more information
postgres=#
Workaround: Set the node monitoring task to false temporarily, refer to Suspend Task described below.