Pax test source -380 by tuhaihe · Pull Request #60 · tuhaihe/cloudberrydb

tuhaihe · 2025-04-14T11:23:17Z

Fixes #ISSUE_Number

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

In the previous encoder and decoder structure design, the encoder and decoder allow the caller to use the streaming interface. This means that the caller can decode only part of the row. But this also has disadvantages: 1. Not all encoding algorithms are supported 2. There is no scenario that requires streaming decoding 3. Code complexity increases. In the current commit: - change the steaming encoding function to blocking function. - keep the origin buffer when caller using the encoding

Change the repo address in gitsubmodule to the gitlab

Previously, we save the oids of compare functions to validate whether the comparison is meaningful. However the oids of the compare functions varies for different data types. This commit save the opfamily for validation. We assume that the compare functions are always consistent as long as the opfamily is the same.

When the execution engine is a vectorized execution engine, PAX's current ORC storage format always requires format conversion before it can be used. So PAX has derived a new storage format: orc_vec orc_vec is more friendly to vectorized execution engines. In most scenarios, column data can be filled into recordbatch without copying, which also means that we do not need to perform format conversion. Compared with orc, the difference of orc_vec is: - Non-fixed-length columns implement offset streaming instead of length streaming - There is no datum header for non-fixed-length columns. - Fixed-length columns are filled with null

The support functions include: 1. Build PartitionKey from PAX saved catalog 2. Build PartitionDesc by re-using pg code 3. Validate partition bounds and check overlap 4. Sort the partition bounds before saving 5. Map the TupleTableSlot to the partition 6. Provide info about whether partitions are continuous 7. Add UDF to dump partition ranges 8. Support EVERY syntax for range partition

Added support for OpExpr(AND) in sparse filtering - Flat the OpExpr(AND) - Split `NOTNULL` and `OpExpr` into static method

To support index(typically btree), we should be clear about two things: 1. Define the order of micro partitions, used by CTID 2. Define how to interpret the CTID * The first issue we use the sequential number to name the micro partition. Each micro partition has a unique number as its filename. The order of micro partitions is the same as the sequential number. * The CTID is defined as (BlockNumber : 24, TupleOffset 23). The CTID has 48 bits, but one bit is used to make the CTID valid for all available offsets. See the details in AOTupleId.

Group offset may be incorrect in some scenarios. This can cause some data to be read incorrectly. - Remove the `current_row_offset_` - Support non-sequential read group

After PAX supports CTID, "fake" CTIDs are no longer used back to the vectorized execution engine.

Implement partition writer in table and micro-partition layer, which will be able to be divided into multiple files when writing in PAX, which will make min/max statistics more effective. - Support MicroPartitionWriter and MicroPartitionStats merge - Support partition writer which used to support partition_by options

`GetTuple` will be used by the index. `ReadTable` can only read row by row, while `GetTuple` can quickly locate a row by offset and read it. - `MicroPartitionReader` interface support `GetTuple` - `Group` interface support `GetTuple`

The block number of the CTID is mapped to the file name when ENABLE_LOCAL_INDEX is set. The sequence number is maintained in pg_pax_fastsequence.

Add implementation for index build range scan. Note: we don't support partial index build. Partial index build is only used by brin index now. It's ugly to support partial index build now. Because the current micro partition is opened after beginscan is called, supporting partial index build will check, close the current micro partition reader and open another micro partition reader.

Some of file FD will not be closed after transaction abort. Use resowner to close file FD when transaction abort. Otherwise, the file FD will be leak.

The vec format will not store a val datum with header, but direct store the raw buffer. So we don't need do align with the raw buffer.

The group which been merged will loss the group stats after merge. Should use the origin group stats when merge happen

Catched pg error when long jump happend. In this way, we can see more error information in PSQL, which can help us debug better.

OrcWriter support no column, and we should not `assert(natts > 0)`

PAX has no handled file permission management. Passing the correct file permissions in `open()` can prevent overwriting from happening. The reader use O_RDONLY to open file and the writer use O_CREAT|O_WDONLY|O_EXCL to open file.

`index_unique_check` used to check if a tuple exists for a given tid obtained from an index. Current commit implements the `index_unique_check` AM function for PAX index.

We copy a full regress tests from CBDB. To run the tests, enter the command: CBDB_DIR=<cbdb_src_dir> USE_PGXS=1 make installcheck

Replace `missing attrno` logic with `slot_getmissingattrs`. The row filter reader should deal with `missing attrno` problem.

The `Open` called in `OpenMicroPartition` have not pass the file flag.

After enable partition options, pax will get `IO Error` in `MergeGroup` It's because the open flag in table reader is `write`.

pax.so should have RUNPATH set, not rpath. The value should be the same as the value for shared library in cbdb, i.e. $ORIGIN:$ORIGIN/..

variable definition and initialization in the header file will cause a redefinition error in release mode.

Move the system catalog pg_catalog.pg_pax_tables to pg_ext_aux.pg_pax_tables in the PAX extension. The relation oid is fixed, so we could easily and clearly remove the database objects that pax table depends. Now the system cache of pg_pax_tables is gone, there may be a little performance loss. If the performance is concerned, we could support dynamic syscache, like custom object class, and register our pg_pax_tables in the kernel to use syscache.

Adapt TransformColumnEncodingClauses to new implementation

Current PR combined becuase current PAX repo move to the other place. PAX is a column based storage for CBDB. It's mainly optimized for OLAP. PAX is implemented as an extension of a table access method. The storage format is heavily similar to apache orc. The data is splited into a group of rows. The group of rows are stored column-by-column. So, a bunch of values in the same type can have much better compression ratio, compared with row-based compression. A table in PAX is composed by a set of micro partitions. Each micro partition is a data file, like apache orc file. Writing tuples will add one or more micro partition files in PAX. The association between the PAX table and micro partitions is recorded in pg_ext_aux.pg_pax_tables and pax specific auxiliary relation, called `pg_ext_aux.pg_pax_blocks_<oid>`. pg_ext_aux.pg_pax_tables records the mapping relationship from pax table to its auxiliary relation. The auxiliary relation stored a list of micro partitions, so scanning pax table knows where the data files come from. Co-authored-by: Hao Wu <gfphoenix78@gmail.com> Co-authored-by: zhoujiaqi <zhoujiaqi@hashdata.cn> Co-authored-by: gongxun <gongxun@hashdata.cn> Co-authored-by: Max Yang <yangyu@hashdata.cn>

The tests in pax, there are some common methods, such as the CreateCTupleSlot which used to create test tuples. In the current changes, it has been uniformly replaced with the same set of interfaces.

Add isolation2 test in pax ci FIXME: ignore gdd testcases Different tuples in the same micro partition can't be updated in pax, which is very different from heap tables. We'll review and possibly rewrite the gdd tests later.

This commit fixes 3 things in PAX: 1. Support downloading artifact files from ci for isolation2 2. Remove generated sql files and answer files for isolation2 3. s/-- start_ignore flaky test/-- start_ignore/ in cbdb_parallel

…ory log. When we use table space for the first time in the database, the directory of database may not exist. we need to create the directory when we replay the xlog for the first time

Impl manifest api through current auxiliary table. Currently, PAX's catalog can be divided into two implementation methods: - Auxiliary table: borrowed from the mvcc capabilities provided by the heap table - Manifest: independent mvcc implementation

The previous implementation uses an auxiliary heap table to record the meta info for all micro partitions. In this commit, we'll adapt the existing code to the manifest API. The next step is to add implementation of manifest file to manage meta info for all micro partitions.

After the memory management commit(PAX: Refactor memory management to allow thread-safe scan), pax_make_toast cannot determine whether the current datum is empty. If compress fails, the current tts_value should not be set to empty. After the memory management commit(7c0c6c9), pax_make_toast cannot determine whether the current datum is empty. If compress fails, the current tts_value should not be set to empty.

PAXPY relies on CBDB install before it can be built. In the current changes, PAXPY no longer relies on CBDB install before building, but directly links storageformat.so.However, since PAXPY is not tested in CI, PAXPY may not have changed when some APIs of PAX itself are changed. Consider adding it to CI later.

In an insert transaction, after obtaining a block_id from pax_fastsequence, the block_id file and block_id.toast file will be created when file_system ->Open() calls are successful. When the transaction aborts, since pax_fastsequence is implemented as update-in-place, whether the block_id increases depends on whether the data has been flushed to disk. As the transaction aborts and xlog won't be flushed, the next allocation may either reuse the current block_id or use the next one. Due to this non-deterministic behavior, the data file and toast file created earlier may become orphaned files when the block_id is reused. Therefore, O_CREAT | O_TRUNC is specified: if the file does not exist, it will be created; if it exists, its previous content will be truncated to avoid orphaned files.

Add a marker file indicating successful compilation. Do not only depends on the cpp files. It may create an incomplete pax.so file

In the current change, vec.max_batch is no longer use to determine the number of rows returned by record batch. But for the class VecAdapter, the range interface is still retained. For the PAX, the cost of splitting by range is small. If the range interface of the class VecAdapter is no longer needed in the future, the related interface parameters will also be removed.

Can't call pax to make dataset in vectorization, so we work around by passing context. Co-Author: Dongxiao Song songdongxiao@hashdata.cn

Internal partition of PAX is no longer used, so we remove this feature from PAX. After this commit, the reloption 'partition_by' and 'partition_ranges' are also removed.

1. Due to the change of kernel behavior, there are many changes in the execution plan generated by orca. 2. Disable optimizer_trace_fallback to avoid orca fallback generating unstable output.

Several bugs are found and fixed in this commit: 1. Wrong result by filtering bpchar values with bloomfilter, Calculating and testing ignores trailing spaces for bpchar bytes. 2. Fix index counting when copying bit values to a buffer. Increasing the index counter no matter whether the current value is null or not. 3. Run sparse filter with group stats before reading the group. 4. Guard pax_enable_sparse_filter when initializing ParallelScanDesc. Besides fixing the above issue, the pax tests run two pass. One turns off vectorization, while the other one turns on vectorization.

This commit adds a new manifest implementation for catalog. The new implementation used manifest files(regular file), see the third type below. The interface of manifest API is saved in contrib/pax_storage/src/cpp/catalog/manifest_api.h We have 3 implementations for pax catalog: 1. Use the original pax catalog directly, i.e. call the catalog functions in pax code. No intermediate interface is introduced. The catalog table pg_ext_aux.pg_pax_tables is required. Set USE_MANIFEST_API=OFF USE_PAX_CATALOG=ON to enable it. 2. Use the original pax catalog through manifest API. All catalog access is done through the manifest API. The manifest API is implemented by the original pax catalog. pg_ext_aux.pg_pax_tables is also required. Set USE_MANIFEST_API=ON USE_PAX_CATALOG=ON to enable it. 3. Use manifest files to manage catalog for PAX through manifest API. All catalog access is done through the manifest API. The original catalog pg_ext_aux.pg_pax_tables is no more required. The per-table auxiliary table is also changed from storing micro partition info to saving the path of manifest file path. Set USE_MANIFEST_API=ON USE_PAX_CATALOG=OFF to enable it. Each pax table now uses a single manifest file to store the catalog indicating all micro partition info. The design disallows to write, i.e. insert/delete/update, concurrently. To avoid concurrent write, a heavy lock must be taken before writing. The steps of accessing catalog is: 1. Build the auxiliary table name by the oid of pax table. 2. Open the auxiliary table of the pax table, and fetch the path for manifest file. The auxiliary table has only one effective tuple. 3. Open the manifest file and deserialize content to json object. 4. Access manifest API and return result by the internal manifest json object.

This commit adds Apache license 2.0 to all source files and header files in PAX.

In PAX `PORC` format, length streaming is used to record the length of each DATUM in a non-fixed-length column. The composition of length streaming is equivalent to the length array, and its size equal to the number of rows. When reading thenon-fixed-length column, PAX needs to use the length array to calculate the offset array in advance, the offset array can help the format reader quickly locate the middle row. In commit "Performance/improve pax insert performance", PAX no longer builds offsets array during the write phase, which actually breaks the assumption of column: only some column methods distinguish between read and write paths. In the current commit, the length streaming of PAX is changed to the offset streaming. - On the read path, non-fixed-length column no need to build the offset array - On the read path, only using the offset array is more cache-friendly - On the write path, only the offset array needs to be built. And the performance is comparable to building the length array Offset streaming also has disadvantages: the compression rate of offset streaming is likely to be lower than that of length streaming. Currently, PAX does not support DELTA encoding. Once DELTA encoding is supported, this disadvantages may can resolved.

PAX is no longer support object storage in lightning. The implementation of RemoteFileSystem is moved from lightning to cloud to support object storage access by the abstract file API defined in PAX. Despite pax doesn't support RemoteFileSystem, we still disallow to use the dfs tablespace for PAX table.

1. the toast table of the auxiliary table should also be in the pg_aux_ext namespace 2. use GetCatalogSnapshot() as a snapshot when querying auxiliary tables

A doc directory has been added to the PAX project, which will contain documentation for the modules in PAX. - Introduce - Project description - Meta data - Storage format - Toast - Clustering - Filter

After commit(588f5c9) and commit(ca9379e). Access method adds two callbacks that must be implemented. - relation_get_block_sequences: Returns the block sequences contained in this relation. See BlockSequence for details. Currently used by BRIN. - relation_get_block_sequence: Determines the block sequence in which the logical heap 'blkNumber' falls. See BlockSequence for details. Currently used by BRIN. Currently, PAX does not support brin index. So these AM method have been added in the current commit but have not been implemented. After CBDB cherry-picks the complete changes of brin index, consider making PAX support Brin index.

In prev commit(8cf1aba) removed the `am->swap_relation_files` in function `swap_relation_files`. This will cause problems in rewrite table case for the custom AM(like PAX). ex. ``` CREATE TABLE list_parted (a numeric, b int, c int8) PARTITION BY list (a) using pax; CREATE TABLE sub_parted PARTITION OF list_parted for VALUES in (1) PARTITION BY list (b); CREATE TABLE sub_part1(b int, c int8, a numeric) DISTRIBUTED BY (a); ALTER TABLE sub_parted ATTACH PARTITION sub_part1 for VALUES in (1); CREATE TABLE sub_part2(b int, c int8, a numeric) distributed by (a); ALTER TABLE sub_parted ATTACH PARTITION sub_part2 for VALUES in (2); INSERT into list_parted VALUES (2,5,50); INSERT into list_parted VALUES (3,6,60); INSERT into sub_parted VALUES (1,1,60); INSERT into sub_parted VALUES (1,2,10); ALTER TABLE list_parted SET DISTRIBUTED BY (c); select * from list_parted; -- wrong result ``` The `Alter DISTRIBUTED BY` will rewrite the data into a temp table and exchange the relfilenode in temp table with origin table. But without `am->swap_relation_files` call, some of meta or data won't be swap.

After CBDB revert the '64 bit relfilenode', PAX still need adapt the change.

In PAX, The naming convention for visibility maps is: <blocknum>_<generation>_<tag>.visimap - `blocknum` is the current data file name - `generation` is the current visimap generation number. Each deletion of this data file will increase the generation number by 1 - `tag` is the current transaction id. This field is used to ensure the uniqueness of the visimap file name. When the USE_ASSERT_CHECKING is undefined, `generation` cannot be incremented. So if we in the same transaction and update same row twice, then PAX will open the same `.visimap` file.

PAX was based on an older Cloudberry version with copied/modified regression tests in `contrib/pax_storage/src/test/regress/` Updated PAX regression tests to align with the latest version and fixed failed cases: - Synchronized PAX regression tests with current test suite(src/test/regress/) - Fixed ORCA plan differences caused by cherry-picking features: Dynamic Index/Bitmap/Seq Scan, multi-groupset, query parameters, and so on - Resolved planner plan diff - Addressed result diff by marking unsupported test cases

PAX was based on an older Cloudberry version with copied/modified isolation2 tests in `contrib/pax_storage/src/test/isolation2/` Updated PAX isolation2 tests to align with the latest version and fixed failed cases: - Synchronized PAX isolation2 tests with current test suite(src/test/isolation2/) - Change the `uao` test case to PAX which is used to running with AO/AOCS - Remove the unused test cases(like check the gp_aoseg, gp_fastsequence ...) - Fix the plan diffs

The extension vectorization is not open source yet, and the open source version of PAX has removed vectorization-related test cases.

Previously PAX used the internal gitlab repository as a submodule. Now it has switched to using the github repository.

Add the upgrade job in the pipeline This job tests the extension upgrading from 1.0 to 2.0.

jiaqizho and others added 30 commits April 11, 2025 13:57

PAX: Fix CI submodule can't pull

f7cc7ec

Change the repo address in gitsubmodule to the gitlab

Feature: filter support AND oper which not been flatten

fe27d83

Added support for OpExpr(AND) in sparse filtering - Flat the OpExpr(AND) - Split `NOTNULL` and `OpExpr` into static method

Fix: group offset not right

38de4b8

Group offset may be incorrect in some scenarios. This can cause some data to be read incorrectly. - Remove the `current_row_offset_` - Support non-sequential read group

Fix: use new item ptr to build vec ctid

ec50286

After PAX supports CTID, "fake" CTIDs are no longer used back to the vectorized execution engine.

Generate file name by fast-sequence to support index

43c8b3e

The block number of the CTID is mapped to the file name when ENABLE_LOCAL_INDEX is set. The sequence number is maintained in pg_pax_fastsequence.

Feature: support resowner to manage fd

c936631

Some of file FD will not be closed after transaction abort. Use resowner to close file FD when transaction abort. Otherwise, the file FD will be leak.

Fix: VEC format no need do align in non-fixed column

5747396

The vec format will not store a val datum with header, but direct store the raw buffer. So we don't need do align with the raw buffer.

Fix: group stats is not right after merge

1d68886

The group which been merged will loss the group stats after merge. Should use the origin group stats when merge happen

Feature: catch pg error when long jump happend

a42b452

Catched pg error when long jump happend. In this way, we can see more error information in PSQL, which can help us debug better.

PAX: Allow empty columns in PAX table

737429e

OrcWriter support no column, and we should not `assert(natts > 0)`

PAX: Use the correctness flags when open file

d6c2d73

PAX has no handled file permission management. Passing the correct file permissions in `open()` can prevent overwriting from happening. The reader use O_RDONLY to open file and the writer use O_CREAT|O_WDONLY|O_EXCL to open file.

PAX: Implement index_unique_check

ec8d8c7

`index_unique_check` used to check if a tuple exists for a given tid obtained from an index. Current commit implements the `index_unique_check` AM function for PAX index.

PAX: Add regress test

b8939e1

We copy a full regress tests from CBDB. To run the tests, enter the command: CBDB_DIR=<cbdb_src_dir> USE_PGXS=1 make installcheck

Fix: missing attrno in row filter reader

964f84c

Replace `missing attrno` logic with `slot_getmissingattrs`. The row filter reader should deal with `missing attrno` problem.

PAX: Fix compile error

07698e2

The `Open` called in `OpenMicroPartition` have not pass the file flag.

Fix: Invalid read in MergeGroup

69877e0

After enable partition options, pax will get `IO Error` in `MergeGroup` It's because the open flag in table reader is `write`.

Fix: RUNPATH in the so file

3368382

pax.so should have RUNPATH set, not rpath. The value should be the same as the value for shared library in cbdb, i.e. $ORIGIN:$ORIGIN/..

Fix compile error in release mode

79d4338

variable definition and initialization in the header file will cause a redefinition error in release mode.

Update the signature of TransformColumnEncodingClauses

a634514

Adapt TransformColumnEncodingClauses to new implementation

Op: cleanup some of pax tests, replace with the same function

e9d3130

The tests in pax, there are some common methods, such as the CreateCTupleSlot which used to create test tuples. In the current changes, it has been uniformly replaced with the same set of interfaces.

gongxun0928 and others added 28 commits April 11, 2025 13:57

PAX: Add isolation2 tests in ci

0362265

Add isolation2 test in pax ci FIXME: ignore gdd testcases Different tuples in the same micro partition can't be updated in pax, which is very different from heap tables. We'll review and possibly rewrite the gdd tests later.

PAX: Fix isolation2 for test files

8dfee92

This commit fixes 3 things in PAX: 1. Support downloading artifact files from ci for isolation2 2. Remove generated sql files and answer files for isolation2 3. s/-- start_ignore flaky test/-- start_ignore/ in cbdb_parallel

PAX: create a per-tablespace subdirectory when redo pax create direct…

b4d7b9d

…ory log. When we use table space for the first time in the database, the directory of database may not exist. we need to create the directory when we replay the xlog for the first time

PAX: Add manifest api

a812479

Impl manifest api through current auxiliary table. Currently, PAX's catalog can be divided into two implementation methods: - Auxiliary table: borrowed from the mvcc capabilities provided by the heap table - Manifest: independent mvcc implementation

build: pax supports incremental compilation.

dc3859f

Add a marker file indicating successful compilation. Do not only depends on the cpp files. It may create an incomplete pax.so file

Update vectoried scan interface

205b552

Can't call pax to make dataset in vectorization, so we work around by passing context. Co-Author: Dongxiao Song songdongxiao@hashdata.cn

PAX: Remove internal partition support

b551c92

Internal partition of PAX is no longer used, so we remove this feature from PAX. After this commit, the reloption 'partition_by' and 'partition_ranges' are also removed.

Fix the failure case in icw test.

8e144ec

1. Due to the change of kernel behavior, there are many changes in the execution plan generated by orca. 2. Disable optimizer_trace_fallback to avoid orca fallback generating unstable output.

Add Apache license to pax files

ec06aab

This commit adds Apache license 2.0 to all source files and header files in PAX.

PAX: fix failture case in isolation2 testcase

2db88bc

1. the toast table of the auxiliary table should also be in the pg_aux_ext namespace 2. use GetCatalogSnapshot() as a snapshot when querying auxiliary tables

PAX: Add doc/* and update README

40bd4d3

A doc directory has been added to the PAX project, which will contain documentation for the modules in PAX. - Introduce - Project description - Meta data - Storage format - Toast - Clustering - Filter

PAX: Adapt to 32-bit relnode

5da3337

After CBDB revert the '64 bit relfilenode', PAX still need adapt the change.

PAX: Remove vectorization related tests

b93beb6

The extension vectorization is not open source yet, and the open source version of PAX has removed vectorization-related test cases.

PAX: update submodule

9972a30

Previously PAX used the internal gitlab repository as a submodule. Now it has switched to using the github repository.

tuhaihe pushed a commit that referenced this pull request Feb 27, 2026

Add upgrade job on pipeline(#60)

ad7cbe6

Add the upgrade job in the pipeline This job tests the extension upgrading from 1.0 to 2.0.

tuhaihe pushed a commit that referenced this pull request Mar 23, 2026

Add upgrade job on pipeline(#60)

ea5a33f

Add the upgrade job in the pipeline This job tests the extension upgrading from 1.0 to 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pax test source -380#60

Pax test source -380#60
tuhaihe wants to merge 378 commits into380-commit-testfrom
pax-test-source

tuhaihe commented Apr 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

tuhaihe commented Apr 14, 2025

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants