【Manager Handbook for Distributed AntDB-T】Quick Recovery of the Standby Node-Antdb,Antdb database

English

简体中文

English

Home > About > News > Technical Column

【Manager Handbook for Distributed AntDB-T】Quick Recovery of the Standby Node

News-2023-09-01

Asiainfo Anhui Technologies

Quick recovery of the standby node

If the master fails, the failover command is executed, and the slave is upgraded to master, the user can have two options when he wants to re-add a new slave to the new master.

Method 1: add a brand new slave node using the append command, which requires data synchronization to ensure that the master and backup are consistent.

Method 2: Add the previously removed master to the cluster as a slave node.

The second way takes a short time because the original master node already has a lot of data, only a small amount of data synchronization is needed after rejoining the cluster. This is exactly what rewind can do to quickly restore the backup node. You can refer to the rewind chapter for details.

Note: To use the rewind feature, you need to set wal_log_hints and full_page_writes to on on the datanode.

set datanode all(wal_log_hints=on, full_page_writes=on);
rewind datanode slave dm1;

For example:

postgres=# add datanode slave dn1_1 for dn1_3 (host=adb01,port=52531,path='/data/antdb/data/adb50/d1/dn1_1');
ADD NODE
postgres=# rewind datanode slave dn1_1;
NOTICE: pg_ctl restart datanode slave "dn1_1"
NOTICE: 10.21.20.175, pg_ctl restart -D /data/antdb/data/adb50/d1/dn1_1 -Z datanode -m fast -o -i -w -c -l /data/antdb/data/adb50/d1/dn1_1/logfile
NOTICE: wait max 90 seconds to check datanode slave "dn1_1" running normal
NOTICE: pg_ctl stop datanode slave "dn1_1" with fast mode
NOTICE: 10.21.20.175, pg_ctl stop -D /data/antdb/data/adb50/d1/dn1_1 -Z datanode -m fast -o -i -w -c
NOTICE: wait max 90 seconds to check datanode slave "dn1_1" stop complete
NOTICE: update gtmcoord master "gcn1" pg_hba.conf for the rewind node dn1_1
NOTICE: update gtmcoord slave "gcn2" pg_hba.conf for the rewind node dn1_1
NOTICE: update datanode master "dn1_3" pg_hba.conf for the rewind node dn1_1
NOTICE: update datanode slave "dn1_2" pg_hba.conf for the rewind node dn1_1
NOTICE: on datanode master "dn1_3" execute "checkpoint"
NOTICE: 10.21.20.175, /data/antdb/app/adb50/bin/pg_controldata '/data/antdb/data/adb50/d2/dn1_3' | grep 'Minimum recovery ending location:' |awk '{print $5}'
NOTICE: receive msg: {"result":"0/0"}
NOTICE: 10.21.20.175, /data/antdb/app/adb50/bin/pg_controldata '/data/antdb/data/adb50/d2/dn1_3' |grep 'Min recovery ending loc' |awk '{print $6}'
NOTICE: receive msg: {"result":"0"}
NOTICE: 10.21.20.175, adb_rewind --target-pgdata /data/antdb/data/adb50/d1/dn1_1 --source-server='host=10.21.20.175 port=52533 user=antdb dbname=postgres' -T dn1_1 -S dn1_3
NOTICE: receive msg: servers diverged at WAL location 0/40001B0 on timeline 1
rewinding from last common checkpoint at 0/4000140 on timeline 1
Done!

NOTICE: refresh mastername of datanode slave "dn1_1" in the node table
NOTICE: set parameters in postgresql.conf of datanode slave "dn1_1"
NOTICE: refresh recovery.conf of datanode slave "dn1_1"
NOTICE: pg_ctl start -Z datanode -D /data/antdb/data/adb50/d1/dn1_1 -o -i -w -c -l /data/antdb/data/adb50/d1/dn1_1/logfile
NOTICE: 10.21.20.175, pg_ctl start -Z datanode -D /data/antdb/data/adb50/d1/dn1_1 -o -i -w -c -l /data/antdb/data/adb50/d1/dn1_1/logfile
NOTICE: refresh datanode master "dn1_3" synchronous_standby_names='1 (dn1_2,dn1_1)'
mgr_failover_manual_rewind_func
---------------------------------
t
(1 row)

postgres=# list node dn1_3;
name | host |      type       | mastername | port | sync_state |               path               | initialized | incluster | readonly | zone
-------+-------+-----------------+------------+-------+------------+----------------------------------+-------------+-----------+----------+-------
dn1_3 | adb01 | datanode master |            | 52533 |            | /data/antdb/data/adb50/d2/dn1_3 | t           | t         | f        | local
(1 row)

postgres=# list node dn1_1;
name | host |      type      | mastername | port | sync_state |               path               | initialized | incluster | readonly | zone
-------+-------+----------------+------------+-------+------------+----------------------------------+-------------+-----------+----------+-------
dn1_1 | adb01 | datanode slave | dn1_3      | 52531 | potential | /data/antdb/data/adb50/d1/dn1_1 | t           | t         | f        | local
(1 row)

Again, in the latest version, if the self-healing doctor is configured, none of the above operations require manual intervention.

Hello！

Tell us what you need.

Consultation

antdb@asiainfo.com

flyingserver@asiainfo.com

AntDB
Carrier-level core transaction database

AntDB has been providing online services for more than 1 billion subscribers in 24provinces across the country on the operator's core system since 2008.

Boasting features such as high performance, flexible expansion and high reliability, AntDB can handle millions of communication core transactions per second at peak.

Besides, it has been successfully commercialized in communications, finance, transportation, energy Internet of Things and other industries.