Self-healing work example
After the self-healing function is turned on, if a node fails, it will try to perform pull-up and switchover operations to ensure the continuity of business.
Let's try to kill a datanode master node and see if the node will recover automatically.
• kill node
Selectdn2_1 to kill:
postgres=# monitor datanode master dn2_1;
nodename | nodetype | status | description | host | port | recovery | boot time
----------+-----------------+--------+-------------+--------------+-------+----------+-------------------------------
dn2_1 | datanode master | t | running | 10.21.20.175 | 52541 | false | 2019-10-16 16:20:06.225503+08
(1 row)
[antdb@intel175 ~]$ ps xuf|grep dn2_1
antdb 35846 0.0 0.0 112712 980 pts/56 S+ 16:54 0:00 \_ grep --color=auto dn2_1
antdb 11456 0.0 0.0 442624 92208 ? S 16:20 0:00 /data/danghb/app/adb50/bin/postgres --datanode -D /data/danghb/data/adb50/d1/dn2_1 -i
antdb 12788 0.0 0.0 358948 6908 ? Ss 16:22 0:00 \_ adbmgr: antdb doctor node monitor dn2_1
[antdb@intel175 ~]$ kill -9 11456
postgres=# monitor datanode master dn2_1;
WARNING: datanode master dn2_1 recovery status is unknown
nodename | nodetype | status | description | host | port | recovery | boot time
----------+-----------------+--------+-------------+--------------+-------+----------+-----------
dn2_1 | datanode master | f | not running | 10.21.20.175 | 52541 | unknown | unknow
(1 row)
• Observe the node status
After waiting for a few seconds, observe the node status again:
postgres=# monitor datanode master dn2_1;
nodename | nodetype | status | description | host | port | recovery | boot time
----------+-----------------+--------+-------------+--------------+-------+----------+-------------------------------
dn2_1 | datanode master | t | running | 10.21.20.175 | 52541 | false | 2019-10-16 16:55:10.935821+08
(1 row)
The node has recovered and the process information can be seen:
[antdb@intel175 ~]$ ps xuf|grep dn2_1
antdb 36484 0.0 0.0 112712 980 pts/56 S+ 16:55 0:00 \_ grep --color=auto dn2_1
antdb 36441 1.8 0.0 442624 92212 ? S 16:55 0:00 /data/danghb/app/adb50/bin/postgres --datanode -D /data/danghb/data/adb50/d1/dn2_1 -i
antdb 12788 0.0 0.0 359084 7664 ? Ss 16:22 0:00 \_ adbmgr: antdb doctor node monitor dn2_1
Corresponding adbmgr log information:
2019-10-16 16:55:03.315 CST,,,12788,,5da6d32e.31f4,6,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:server closed the connection u
nexpectedly
This probably means the server terminated abnormally
before or while processing the request.
",,,,,,,,,""
2019-10-16 16:55:05.818 CST,,,12788,,5da6d32e.31f4,7,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:could not connect to server: C
onnection refused
Is the server running on host ""10.21.20.175"" and accepting
TCP/IP connections on port 52541?
",,,,,,,,,""
2019-10-16 16:55:10.824 CST,,,12788,,5da6d32e.31f4,8,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:could not connect to server: C
onnection refused
Is the server running on host ""10.21.20.175"" and accepting
TCP/IP connections on port 52541?
",,,,,,,,,""
2019-10-16 16:55:10.824 CST,,,12788,,5da6d32e.31f4,9,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, node crashed",,,,,,,,,""
2019-10-16 16:55:10.826 CST,,,12788,,5da6d32e.31f4,10,,2019-10-16 16:22:06 CST,12/6,651,LOG,00000,"antdb doctor node monitor dn2_1, try to startup node",,,,,,,,,""
2019-10-16 16:55:11.044 CST,,,12788,,5da6d32e.31f4,11,,2019-10-16 16:22:06 CST,12/6,651,LOG,00000,"start dn2_1 /data/antdb/data/adb50/d1/dn2_1 successfully",,,,,,,,,""
2019-10-16 16:55:11.086 CST,,,12788,,5da6d32e.31f4,12,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, startup node successfully",,,,,,,,,""
2019-10-16 16:55:11.086 CST,,,12788,,5da6d32e.31f4,13,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, reset node monitor",,,,,,,,,""
2019-10-16 16:55:11.092 CST,,,12788,,5da6d32e.31f4,14,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, node running normally",,,,,,,,,""
You can see that the nodedn2_1 is back to normal after 7 seconds, and the recovery process does not require manual intervention.