Capacity expansion background
Distributed database usually faces massive data scenarios, and the continuous incremental data tends to reach the storage bottleneck in advance, at this time we need to add nodes to the cluster for expansion operation to ensure the healthy and stable operation of the database cluster.
Capacity expansion preparation
The expansion process will take up more network and IO resources, if the database operation cannot be reduced during the expansion period, some parameters need to be adjusted appropriately to ensure the successful completion of the expansion.
I. Appropriate adjustment of the number of wal keep according to the database load.
set datanode mastere all (wal_keep_size = 102400);
Although wal_keep_size is configured, in many cases, due to the huge amount of data in master node, the number of wal generated during the expansion may far exceed 102GB. From a more secure perspective, it is recommended to turn on archive mode.
set datanode master all (archive_mode = "on");
set datanode master dm0 (archive_command = "rsync -a %p /data2/antdb/data/arch/dm0/%f");
2.The default value of wal_sender_timeout is 60 seconds. The server will disconnect replication connections that are inactive for more than the specified number of milliseconds.
If, for some reason, the chain building fails within 60s during the establishment of stream replication, the whole expansing process will be affected.
It is recommended to adjust this parameter to 0 during the expansion period and keep waiting for connections to be established.
After the expansion is finished, then revert to the default 60s.
set datanode all (wal_sender_timeout = 0);
Expansion process
Add a new host
Suppose there are currently two servers in the cluster and a new server is added through expansion.
The actual expansion scenario is mostly multiplying capacity, for example, 2 servers to 4 servers, so that the data can continue to be balanced.
[local]:6432 antdb@postgres=# list host;
name | user | port | protocol | agentport | address | adbhome
-------+-------+------+----------+-----------+-----------------+----------------
adb01 | antdb | 22 | ssh | 8432 | 192.168.149.131 | /opt/app/antdb
adb02 | antdb | 22 | ssh | 8432 | 192.168.149.132 | /opt/app/antdb
add host Adds a new server.
[local]:6432 antdb@postgres=# add host adb03(port=22,protocol='ssh',adbhome='/opt/app/antdb',address="192.168.149.133",agentport=8432,user='antdb');
ADD HOST
**Note:** The new host should also be configured with the same environment variables as the other hosts.
deploy the database binaries to the new server.
[local]:6432 antdb@postgres=# deploy adb03;
hostname | status | description
----------+--------+-------------
adb03 | t | success
Start the agent process of the new server.
[local]:6432 antdb@postgres=# start agent adb03;
hostname | status | description
----------+--------+-------------
adb03 | t | success
At this point, the new server has been added to the cluster.
Add datanode
Create a datanode master on the new server.
[local]:6432 antdb@postgres=# add datanode master dn3_1(host='adb03', port=14332,path = '/home/antdb/data/dn3_1');
ADD NODE
You need to modify the corresponding parameters according to the actual situation.
Relocate data
Expand the data in dn1_1 to the new dn3_1
expand datanode master dn1_1 to dn3_1;
If there are multiple nodes to be expanded here, you can execute the command here several times separately, e.g.:
expand datanode master dn1_1 to dn3_1;
expand datanode master dn2_1 to dn4_1;
Activate new node
In this step, activate all new nodes that were expanded in the previous step.
expand activate;
At this point, the expansion work has been completed, some data has been redistributed to the new nodes, and the newly inserted table data will be distributed to the new datanode according to the slicing rules.
Data removal
Redundant data will be generated during the data replication and relocation process, and the redundant data is marked for removal by the following commands.
expand clean;
At this time, the expansion has been completed, and users can check the status of the cluster through the monitor all command.
For example:
-- Add New Host
add host adb03(port=22,protocol='ssh',adbhome='/opt/app/antdb',address="192.168.149.133",agentport=8432,user='antdb');
add host adb04(port=22,protocol='ssh',adbhome='/opt/app/antdb',address="192.168.149.134",agentport=8432,user='antdb');
-- deploy
deploy adb03 password '123456';
deploy adb04 password '123456';
-- add nodal
add datanode master dn3_1(host='adb03', port=14332,path = '/home/antdb/data/dn3_1');
add datanode master dn3_1(host='adb04', port=14332,path = '/home/antdb/data/dn4_1');
-- expand
expand datanode master dn1_1 to dn3_1;
expand datanode master dn2_1 to dn4_1;
-- activate
expand activate;
-- clean
expand clean;
Handle expansion exceptions
When the user fails to expand the capacity due to improper operation, you can execute the command recover according to the prompt.
expand datanode exception
• basebackup failed
• Solution: Execute "expand recover basebackup fail src to dst", if the command fails, manually delete the data directory and re-execute the expand datanode command.
• Failed to update mgr information
• Solution: Execute the command "expand recover basebackup success src to dst".
expand activate exception
• Failed to promote the target datanode
• Solution: Delete the datanode directory and execute the expand command again.
• Failed to restart the datanode
• Solution: Start the datanode manually
• Failed to update pgxc_node
• Solution: Execute the "expand activate recover dopromote success dst" command