1.Overview
AntDB is a home-grown distributed database with MPP architecture, highly compatible with Oracle syntax, which is widely used in many industries such as communication, finance and transportation. In the process of using AntDB, users often delete data or update data by mistake due to misoperation or application bugs, which affects the normal use of business. Almost all databases will encounter similar problems, and most databases will provide a “Data Flashback” tool, which can be used to quickly recover misoperated data.
According to the type of misoperation, data flashback can be divided into DML data flashback and DDL data flashback. DML data flashback is mainly for scenarios like insert, update and delete , while DDL data flashback is mainly for scenarios like drop table and truncate table, and AntDB can support both DML and DDL data flashback. This article will mainly introduce the implementation details of DML data flashback.
2.Solution Selection
There are two mainstream ways to implement DML data flashback in AntDB. The first one is WAL log-based data flashback, which has two prerequisites: the table structure has not been changed and enough WAL logs are kept, and the WAL logs are parsed for data recovery with the table structure metadata. The second one is implemented based on MVCC mechanism, AntDB’s update and delete records are not really deleted, but generate new records and mark the old ones, so that the old records can be retrieved by using the multi-version information of the records and transaction submission information before vacuum execution. Both approaches have their own applicable scenarios and drawbacks.
WAL log-based data flashback requires keeping a large number of WAL log files, which can take up a lot of storage space. When the number of WAL files is large, it needs to traverse all WAL logs, which is time consuming, and the table cannot do DDL operations. It has the advantage of simple logic, easier in technical implementation, and there are readily available open source implementations in the community for reference.
The MVCC-based mechanism, on the other hand, needs to ensure that vacuum does not clear unused multi-version tuples, and vacuum-related parameters need to be set carefully. Since the multi-version tuple cannot be cleared and released in time, it will also bring the problem of taking up a lot of space. However, the MVCC-based implementation is more complex and requires deeper kernel code changes, so after investigating the advantages and disadvantages of the two solutions, we choose the WAL log-based DML data flashback implementation.
3.Implementation Principle
3.1 WAL Log Composition
The WAL log consists of a number of Records, which is divided into different types, as shown in Figure 1:
Figure 1: types of WAL Record
The underlying storage of AntDB is a Heap table. From the perspective of data flashback, only the Heap table and transaction-related types need to be concerned, other Record types do not affect the result of data flashback. The transaction-related Record records the status of the transaction, commit or abort, and the data of the transaction in abort state does not need to be flashed back, so only the transaction in commit state needs to be concerned.
Heap type Record records the transaction data generated by insert, update, and delete operations, which can be further subdivided according to the DML type, as shown in Figure 2:
Figure 2: subtypes of WAL Record
3.2 Parsing Process
DML data flashback supports flashback by table name, time interval, transaction number XID and LSN interval. From the internal implementation, all these flashback methods are eventually converted into an LSN interval containing the starting LSN and ending LSN, and if a table name is specified, it will be filtered by table name when parsing WAL Record records. In Figure 3, lsn3~lsn4 are the LSN interval to be parsed.
Figure 3: Diagram of LSN interval
The WAL log-based data flashback requires two important parameters: wal_level and full_page_writes. The first parameter wal_level which needs to be set to replica or logical, minimal level saves less information and is not enough to recover data. The second parameter full_page_writes needs to be set to on. AntDB will write the full page for the first update to the WAL log after each checkpoint. This mechanism can ensure no data loss under abnormal scenarios such as downtime, for example, if a crash occurs during a dirty page swipe and the original page is found to be corrupted during recovery, then the data can be recovered through full page inside the WAL. The Full Page Write (FPW) page holds the raw data of the page, which is essential for parsing the WAL data. some Record recording its own data is not sufficient for parsing out the complete forward and reverse SQL, and needs to rely on the raw data provided by the FPW. The FPW writes after one checkpoint, and when flashing back data, the LSN interval selected should be before an available checkpoint. In the code implementation, even if the LSN interval is after a checkpoint, it will search forward until an available checkpoint is searched.
After the LSN interval of WAL is determined, the WAL Record within this interval can be parsed. According to the different Record types, the original SQL and the flashback SQL are stitched together with the metadata and FPW and recorded into the result table. The overall parsing process is shown in Figure 4:
Figure 4: WAL parsing process
In the process of parsing WAL logs, the parser uses transactions as the basic unit for parsing. A transaction corresponds to a TransactionEntry object, and a transaction can contain multiple SQL statements, each of which corresponds to an element of the change chain table. On transaction submission, all elements of the change chain table are stitched together to produce the forward SQL text, as well as the reverse SQL text.
3.3 Toast Table Parsing
A difficult part of the WAL parsing process is the toast table. AntDB uses the toast mechanism to store tables with large fields. Toast is called The OverSized Attribute Storage Technique. The main idea of the toast mechanism is to use additional toast tables to store large field data, avoiding a record spanning multiple pages, and storing the oid of the corresponding toast table and the chunk_id of its data rows in the original field storage area.
The reltoastrelid field of the pg_class table records the oid of its corresponding toast table, and the table name of the toast table is suffixed with the oid of its parent table. For example, a regular table t with oid 10022077 and reltoastrelid 10022080, the corresponding toast table for table t is: pg_toast_10022077 and oid 10022080. The regular table is distinguished from the toast table in pg_class by the field relkind, which is 'r' for the regular table and 't' for the toast table.
In terms of a DML operation performed on a regular table involving large fields, the order of wal written to that table and its corresponding toast table is as follows:
· Insert operation- insert toast table records first, and then write to the regular table
· Update operation- insert toast table records first, then delete toast table records, and finally update the regular table
· Delete operation- delete the regular table records first, then delete toast table records
As you can see above, both insert and update operate on the toast table first, and then operate on the regular table, while delete is the opposite, which is crucial for handling the flashback of toast table data in the delete scenario.
3.4 Flashback Data Storage
The SQL parsed from the WAL logs will be stored in a table, the structure of which is shown in Table 1:
Table 1: parsing result table
Field Name | Field Type | Field Description | Remarks |
sqlno | int | SQL Serial Number | The number of SQL in a transaction |
xid | bigint | transaction number |
|
topxid | bigint | parent transaction number | the parent transaction number of the subtransaction |
sqlkind | int | SQL Type |
|
minerd | bool | successfully parsed or not |
|
timestamp | timestampTz | committed time of transaction |
|
op_text | text | Original SQL |
|
undo_text | text | Flashback SQL |
|
complete | bool | completed or not |
|
schema | text | schema name |
|
relation | text | table name |
|
start_lsn | pg_lsn | start LSN |
|
commit_lsn | pg_lsn | transaction commission LSN |
|
op_text is the original SQL statement and undo_text is the flashback SQL statement. If you want to recover the data quickly, you need to export undo_text and execute it once in reverse chronological order to recover the data for this period.
3.Usage Example
Create a table t, insert 3 rows of records, and then delete all the records of table t, as shown in Figure 5:
Figure 5: operation of deleting data by mistake
Execute the flashback function wal2sql of DML data, specifying the table name t as the parameter. It is able to see the results of the DML data flashback from the wal2sql_contents table after the execution is completed. op_text indicates the original SQL, and undo_text indicates the reverse flashback SQL, as shown in Figure 6:
Figure 6: operation of data flashback
Improvements
AntDB’s DML data flashback tool solves a series of scenario-related problems during the development process, mainly including toast table parsing, subtransaction parsing, Oracle compatibility issues; it also solves scenario-related performance problems, such as performance problems in massive table scenarios and performance problems in large data volume scenarios. AntDB’s DML data flashback tool not only supports standalone version, but also supports data flashback in AntDB distributed cluster scenario.
(1)Toast table parsing problem is mainly due to the processing order of the original table and toast table in the delete scenario is different from that in the insert and update scenarios, which requires special treatment, otherwise it will lead to incorrect parsing results of toast table in the delete scenario.
(2)Sub-transaction is a scenario that AntDB uses more often. The parsing of sub-transaction needs to strictly consider the relationship between parent and child transactions and the transaction commit status, and the accuracy of sub-transaction should be ensured when parsing sub-transaction, and no part of sub-transaction should be lost.
(3)Compatibility with Oracle is a major feature of AntDB. AntDB is highly compatible with Oracle syntax and storage types, such as rowid of Oracle type. Data flashback needs to consider compatibility with Oracle to ensure that tables created and data generated under Oracle syntax can be flashed back correctly.
(4)Data flashback needs to support distributed cluster.With AntDB’s distributed architecture, data is stored in different data nodes and DML data flashback can support data flashback of all data nodes. In the use method, you only need to connect to any CN node to execute the data flashback command, which is exactly the same as the standalone version in the use method, and it is easy to use.
(5)The performance improvement of data flashback mainly includes the performance improvement of execution in massive table scenario, for instance, the database contains more than 1 million tables with huge amount of metadata, and this scenario needs to ensure that the parsing performance will not have a big drop. In addition, the performance of execution is optimized in large transaction scenario. 1 transaction containing a large number of SQL and large fields involving a large number of toast table parsing, is required to ensure that the performance will not be decreased too much.
Conclusion
This article introduces a way of AntDB to implement DML data flashback based on WAL logs, discusses its implementation principle, and makes a lot of improvements and enhancements based on the community open source implementation. Data flashback is a very important tool, although usually used infrequently, but in an emergency can save data. Data flashback is not limited to one way, AntDB will explore MVCC-based multi-version control data flashback in the future to provide more implementation paths for data recovery.
About AntDB
AntDB was founded in 2008. On the core system of operators, AntDB provides online services for more than 1 billion users in 24 provinces across the country. With product features such as high performance, elastic expansion and high reliability, AntDB can process one million core communications transactions per second at peak, ensuring the continuous and stable operation of the system for nearly ten years, and is successfully implemented for commercial purpose in communication, finance, transportation, energy, Internet of Things and other industries.