Feature: introduce a high-performance hybrid row-columnar storage engine (2/4) by jiaqizho · Pull Request #1042 · apache/cloudberry

jiaqizho · 2025-04-10T11:33:29Z

Fixes #ISSUE_Number

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

`micro_partition_stats` should be included in paxformat. Generate micro_partition_stats.pb.h and micro_partition_stats.pb.cc when build libpaxformat.so

…ablespace case. Problem: The PAX file path directory structure is not consistency for set new tablesapce cases. Currently if the source file directory is emptry then it will skip to copy data as well as the destination directory creation. The problem will be raised when the next time set new table space DDL executed, if the source directory is empty, the assert in ListDirectory will be raised and caused exception. Fix: The fix should be creating the destination pax directoy even if source directory is empty, which makes the source/destination directory structure consistency in set new tablesapce DDL execution. ``` std::vector<std::string> LocalFileSystem::ListDirectory( const std::string &path) const { ... Assert(filepath != NULL && filepath[0] != '\0'); dir = opendir(filepath); CBDB_CHECK(dir, cbdb::CException::ExType::kExTypeFileOperationError); ... } ``` ``` ,sx1,"ERROR","XX000","ERROR: RelationCopyData (pax_access_handle.cc:194)",,,,,,"ALTER TABLE ALL IN TABLESPACE regress_tblspace_renamed SET TABLESPACE pg_default;",0,,"pax_access_handle.cc",194,"Stack trace: 1 0x7f6aa6f1f8e8 libpostgres.so errstart + 0x208 2 0x7f6a998f2fc5 pax.so _ZN3pax17CCPaxAccessMethod16RelationCopyDataEP12RelationDataPK11RelFileNode + 0x9d 3 0x7f6aa6b2c889 libpostgres.so <symbol not found> + 0xa6b2c889 4 0x7f6aa6b3d3ca libpostgres.so <symbol not found> + 0xa6b3d3ca 5 0x7f6aa6b3e76f libpostgres.so AlterTableMoveAll + 0x2bf 6 0x7f6aa6dbfb39 libpostgres.so <symbol not found> + 0xa6dbfb39 7 0x7f6aa6dbe4e9 libpostgres.so standard_ProcessUtility + 0x169 8 0x7f6aa6dbcb13 libpostgres.so <symbol not found> + 0xa6dbcb13 9 0x7f6aa6dbcc64 libpostgres.so <symbol not found> + 0xa6dbcc64 10 0x7f6aa6dbd3cb libpostgres.so PortalRun + 0x2bb 11 0x7f6aa6db75c4 libpostgres.so <symbol not found> + 0xa6db75c4 12 0x7f6aa6dbac9a libpostgres.so PostgresMain + 0x1fda 13 0x7f6aa6d031d3 libpostgres.so <symbol not found> + 0xa6d031d3 14 0x7f6aa6d04305 libpostgres.so PostmasterMain + 0xe45 15 0x4017c0 postgres main (main.c:198) 16 0x7f6aa5fdb555 libc.so.6 __libc_start_main + 0xf5 17 0x4019c4 postgres <symbol not found> + 0x4019c4 ``` Test: ``` postgres=# \!rm -rf /tmp/t_pax73 postgres=# \!mkdir /tmp/t_pax73 postgres=# drop tablespace test73; ERROR: tablespace "test73" does not exist postgres=# create tablespace test73 location '/tmp/t_pax73'; CREATE TABLESPACE postgres=# create table t_pax73(a int, b int) using pax; NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'a' as the Cloudberry Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE postgres=# insert into t_pax73 select i ,i from generate_series(1,1000) i; INSERT 0 1000 postgres=# alter table t_pax73 set tablespace test73; ALTER TABLE postgres=# insert into t_pax73 select i ,i from generate_series(1,1000) i; INSERT 0 1000 postgres=# select count(1) from t_pax73; count ------- 2000 (1 row) postgres=# alter table t_pax73 set tablespace pg_default; ALTER TABLE postgres=# insert into t_pax73 select i ,i from generate_series(1,1000) i; INSERT 0 1000 postgres=# alter table t_pax73 set tablespace test73; ALTER TABLE postgres=# select count(1) from t_pax73; count ------- 3000 (1 row) ```

In previous commits, PAX can already create encoding columns. The current commit allows users to specify the ENCODING options of a column using the ENCODING syntax. example: ``` CREATE TABLE t1 ( c1 int ENCODING (compresstype=zstd), c2 char ENCODING (compresstype=zlib), c3 char) using pax; ``` more cases see the file `src/test/regress/sql/am_encoding.sql`

In previous commits, PAX can already create encoding columns and specify the ENCODING options of a column. The current commit allows users to specify the encoding type in reloptions. The column ENCODING options will inhert the reloptions. example: ``` CREATE TABLE t1 (c1 int, c2 char, c3 char) using pax with (COMPRESSTYPE=zstd, compresslevel=5); ``` more cases see the file `src/test/regress/sql/am_encoding.sql`

`ERROR: operator 1209 is not a member of opfamily 4056 (lsyscache.c:164)` cause the by `like clauses`. ex. ``` select * from table where text_type_column like 'Unknown%'; ```

Data file may be empty in our current implementation. In this case, the micro partition stats is also initialized, but the typid is not set yet. The typid assert should only happen if the typid is set when inserting the first tuple. We relax this constraint to allow empty data file now.

The previous TAM callback scan_analyze_next_tuple is not implemented correctly, as appendonly tables. This patch fix the behavior and implements the callback semantically identical as appendonly tables. Snapshot may be NULL in scan, remove the Assert from MicroPartitionIterator.

There are some logical errors in RLE encoding: - several of types in `RLE` will share the contex and need to be reset uniformly. - fixed `ENABLE_DEBUG` name - remove unreasonable assert

In the storage engine, if the user often reads some columns of the same table, the storage engine will no longer need to fetch the data from the disk if there is a cache. - Added cache interface - Added plasma implements

Use plasma to cache the pax columns - read cache if column already in cache - write cache if column read from disk Also added a GUC to disable/enable cache.

ERROR: cannot unpin a segment that is not pinned (dsm.c:983) (seg2 slice17 127.0.1.1:7004 pid=7331) (dsm.c:983) In scan path, if have not `delete` operator in target list, Then should not build block bitmap. GetBlockNumber is not a process-safe function. Should not call it when caller not in update or delete operator.

In PAX, bitmap is implemented through byte array. Changing to bit array can reduce the memory and improve performance.

The current commit encapsulates the `elog`. Log needs to reimplement and more robust.

When reading a PAX file in, multiple IO are needed to read the meta info of the file. - Read the length of postscript - Read the post script - Read the file footer - Read the meta if exist In fact, we don't need to use multiple IO to read the tail structure. - Support 32K read to load post_script + footer + meta - Split writer/reader into different file

PAX changed from multiple stripe(groups) to a single stripe(group). But when PAX supports sparse filtering, dividing into multiple groups can make the min/max statistical distribution in a single group more even. In sparse filtering, if a single file is not filtered, there are also cases where the group is filtered. So we decided to add multiple groups back. - support multi groups in mirco-partition writer and reader - refactor reader code

Create new auxlilary table `pg_pax_fastsequence` for pax file number allocation usage for every single pax table. `pg_pax_fastsequence_index` is created on column objid as well, which helps for searching effeciency of pax file fastsequence id.

The vectorized execution engine does not require a real CTID. Just use CTID to distinguish different rows. In the current commit, PAX supports the return of CTID, but the returned CTID is only unique in the SEQ SCAN operator.

…r ..." hashdata/arrow changed the core data structure ArrowRecordBatch in 47dd3a96f8ebcb87 from ```c struct ArrowRecordBatch { struct ArrowArray *batch; struct ArrowSchema *schema; }; ``` to ```c struct ArrowRecordBatch { struct ArrowArray batch; struct ArrowSchema schema; }; ``` This commit adapts pax to the above change.

min-max filter assumes that the left type is the same as the right type. It's not truth in some cases. For example: where d_date between '1999-4-01' and (cast('1999-4-01' as date) + '60 days'::interval) The left type is date, while the right type is timestamp in the second comparison. The comparing functions are not the same. NOTE: The right type is kept in ScanKeyData.sk_subtype.

export comm/bitmap.h. For the header files in the comm directory, keep the minimum export, and export all the header files in the storage directory.

`type_align` in `pg_type` should follow: - address alignment: the datum which return need alignment with type_align - datum padding: the datum need padding with type_align The Non-fixed column without encoding will not following address alignment in some case. So we need a padding before data part, Then after reading from disk, the address always be aligned.

The current commit is a supplement to "Feature: mirco-partition support multi groups" - VEC reader support read with group - Support new bitmap implements in vec adpater

Refactor the signature of scan_begin_extractcolumns to contain execution context, so that the scan node is able to filter data by evaluating expressions in low level. Low-level filter may have less cost than extracting all columns in a row and filtered out by the upper execution node.

PAX already supports the gmock(google mock), but gmock does not support mocking private functions or changing C function pointers. In the unit test of PAX, some CBDB functions always need to be called, but it is difficult for this kind of CBDB function to construct the returns required in the test case. So PAX introduces a new library named `cpp-stub`. cpp-stub can replace some global C functions or private functions with PAX implementations, making test cases easier to implement.

The `test_main` still need link the `libarrow.so`

Add two reloptions to support partition: partition_by and partition_ranges. The syntax of them is partition_by = "..." The raw string is stored in pg_class.reloptions like any other options. But we also store a transformed structure(data type is pg_node_tree) in pg_pax_table.partitionspec.

Constant values are not directly saved in PartitionBoundSpec. For range partitions, the lowerdatums and upperdatums save the list of PartitionRangeDatums, not Const. The transform happens after parser, but we transform directly here. Now, the partitionspec in pg_pax_tables are completely the same as normal pg_class.relpartbound for partition tables.

Pax uses standard new/delete to manage memory for C++ objects, but implements the global operator functions for new/delete. This commit replaces all new/delete by the template functions for later refactor. We uniform the behavior on how to manage the C++ objects in the new template functions. It has two major advantages: 1. The memory management may also change in the future, we may change it all in one. 2. We don't use global operator function any more.

1. delete zstd submodule and use the same zstd dependency as cloudberrydb 2. updated cpp-stub(remove the submodules insided) 3. merge dependencies into the same cmake file and check dependencies in advance

When pax writes data, two kinds of statistical information level will be generated. One is file-level statistical information, and the other is group-level statistical information. But after in WriteTuple, both levels of statistics are updated. In fact, we only need to update the group-level statistics, then file-level statistics can merge it.

Current lighting pipeline will use CBDB_BUILD_TYPE TO define build type.

CTupleSlot was originally used to encapsulate some operations on TupleTableSlot. In fact, this is an additional object introduced. There are already enough functions in PG to modify TupleTableSlot. In the current version, this object will not only generate additional memory, but also make the interface more complex. So the current change decide to remove it.

remove the BUILD_PAX_FORMAT option and compile paxformat.so and pax.so at the same time by default. When compiling paxformat.so, add -DBUILD_PAX_FORMAT

BuildPaxFilePath is not a thread-safe function. And it will be call multi-times in write/read path. Change to get the relation path before call BuildPaxFilePath, when we need build the file path, we don’t need to build the relation path every time.

The original namespace of the generated protobuf class conflict with the official ORC, which will use wrong class in PAX code. We use a different namespace to avoid this case.

The ${prefix} is the path to install the databse when you configure. We install libpax.so, libpaxformat.so, and headers to ${prefix}. So that other extension can link to it.

The file name is previously generated by uuid, which means the data file is unordered. The ctid is hard to define. The ctid used by DELETE/UPDATE is built temporarily on dynamic shared memory. However, manipulating dynamic shared memory is unfriendly to parallel scan, because we use the pg functions to operate them. We'll no longer maintain the old behavior that uses uuid as file name, so the file name will become the block number as part of the ctid.

PAX always calculates min/max values in the micro partitions, so we could apply these values to filter the whole micro partition. There is no way to bypass the filtering process. However, there's a case that we need to disable the filtering, like BUGs. The guc is on by default. We should manually turn it off.

We should use the macro defined by extension vectorization, other than hard-coded flag.

The member `IndexFetchTableData base_` MUST be the first field of class PaxIndexScanDesc. The class object MUST be able to convert to the postgres struct IndexFetchTableData. On the other hand, the class PaxIndexScanDesc is not allowed to have any virtual function.

`make -j` will use infinite number of jobs that may exhaust memory.

If the current footer length is too long, stack over flow may occur. 1. use scoped_ptr to release buffer avoid memory leak 2. fix compile error

pax_itemptr.cc is unused in libpaxformat.so but cause unexpected compile error.

The GUC name should not contain dot `.`, because setting non-exist guc name doesn't raise an error. It's confusing if the user spells the wrong guc name. ``` gpadmin=# set pax.abcdefg = on; SET gpadmin=# set pax_abcdefg = on; ERROR: unrecognized configuration parameter "pax_abcdefg" ```

This commit adds several basic regression tests for PAX. When running ICW test, these basic tests will run, so some basic DLL and DML queries could be tested.

When run configure script, we can pass `--enable-pax` to enable pax support, or `--disable-pax` to disable pax. The default behavior is to disable pax support.

After pax implements some of PG min/max operators, we no longer allow call detoast in pax operators. We need to perform the detoast operation before storing the min/max value or non-fixed datum.

The guc `pax_scan_reuse_buffer_size` is marked that must be sync from QD to QE. This commit marks the guc need not to sync. Here are two reasons: 1. The guc is not necessary to forcely sync to QE. Normal `SET` will also set the guc to the QE. It's totally valid that the guc values are different among QD and QEs. 2. The current behavior requires that PAX must be loaded before setting gucs from QD on the QE. If the PAX module is not loaded, syncing `pax_scan_reuse_buffer_size` before loading pax.so will cause undefined guc error. ``` FATAL: unrecognized configuration parameter "pax_scan_reuse_buffer_size" ```

PAX needs to define the StdRdOptions instead of vl_len. Because many places in CBDB assume that option in relation can be cast into StdRdOptions.

`DataBuffer` is the basic buffer manager of pax. Make frequently called methods to inline can reduce some instruction calls.

When cbdb using `alter table` with PAX table, the `PaxObjectAccessHook` will be call. Then sub_id will not be 0, because `attrnum` is not 0.

The compress level in group footer always be 0, because pax will not set it. We should record this field, even if it is not used during reading. Without record compress level will make the file written unaware of the compress level of the group.

In the current implementation, the length of a single tuple field will not exceed 2GB, so a single element of the lengths stream can use int32 to represent the length, so that each variable-length field can save 4 bytes. In the large-wide table test with 500 variable-length columns and 1 million tuples, 2GB of storage space was saved.

…lemented sk_strategy If current operator not support in pax operator. Then the all_null and has_null will be lost in file level statistics. Also Pax should not Assert(false) when accept a unimplemented sk_strategy. Current change also implements the bpchar operator.

jiaqizho · 2025-04-14T13:56:20Z

2/4 part of #1002.

Merging a pull request using the "Rebase and merge" option is limited to 100 commits. PR 1002 contains 380 commits.

jiaqizho force-pushed the pax-split-380-commit-2 branch from 0d1519f to 285b3c9 Compare April 10, 2025 12:16

tuhaihe approved these changes Apr 14, 2025

View reviewed changes

my-ship-it approved these changes Apr 14, 2025

View reviewed changes

gongxun0928 and others added 27 commits April 14, 2025 21:20

Fix compile errror when BUILD_PAX_FORMAT is ON

9120135

`micro_partition_stats` should be included in paxformat. Generate micro_partition_stats.pb.h and micro_partition_stats.pb.cc when build libpaxformat.so

Fix: operator may not match opfamily

fa8097e

`ERROR: operator 1209 is not a member of opfamily 4056 (lsyscache.c:164)` cause the by `like clauses`. ex. ``` select * from table where text_type_column like 'Unknown%'; ```

Fix: RLE encoding out-of-bound problem

0826d9c

There are some logical errors in RLE encoding: - several of types in `RLE` will share the contex and need to be reset uniformly. - fixed `ENABLE_DEBUG` name - remove unreasonable assert

Feature: Introduce pax cache

cbd544c

In the storage engine, if the user often reads some columns of the same table, the storage engine will no longer need to fetch the data from the disk if there is a cache. - Added cache interface - Added plasma implements

Feature: support cache pax columns result

2474288

Use plasma to cache the pax columns - read cache if column already in cache - write cache if column read from disk Also added a GUC to disable/enable cache.

PAX: Reimplement bitmap using bits

14cf60a

In PAX, bitmap is implemented through byte array. Changing to bit array can reduce the memory and improve performance.

Add GUC pax.enable_debug and dump debug info

a50ea90

The current commit encapsulates the `elog`. Log needs to reimplement and more robust.

Feature: Add PAX fastsequence system table

ac29d96

Create new auxlilary table `pg_pax_fastsequence` for pax file number allocation usage for every single pax table. `pg_pax_fastsequence_index` is created on column objid as well, which helps for searching effeciency of pax file fastsequence id.

Fix: VEC reader support read with ctid

52605b9

The vectorized execution engine does not require a real CTID. Just use CTID to distinguish different rows. In the current commit, PAX supports the return of CTID, but the returned CTID is only unique in the SEQ SCAN operator.

Bugfix: Export the header file that storage_am relies when compiling

e9c7e8c

export comm/bitmap.h. For the header files in the comm directory, keep the minimum export, and export all the header files in the storage directory.

Fix: Vec reader support group and new bitmap

72e905f

The current commit is a supplement to "Feature: mirco-partition support multi groups" - VEC reader support read with group - Support new bitmap implements in vec adpater

Fix: build with vec will got undefined symbol

95c93b8

The `test_main` still need link the `libarrow.so`

wuhao and others added 27 commits April 14, 2025 21:20

Pax/remove zstd submodule

e9707b6

1. delete zstd submodule and use the same zstd dependency as cloudberrydb 2. updated cpp-stub(remove the submodules insided) 3. merge dependencies into the same cmake file and check dependencies in advance

Pax build type follow lighting pipeline

1e7ad7c

Current lighting pipeline will use CBDB_BUILD_TYPE TO define build type.

PAX: both build pax.so and paxformat.so

91fec3e

remove the BUILD_PAX_FORMAT option and compile paxformat.so and pax.so at the same time by default. When compiling paxformat.so, add -DBUILD_PAX_FORMAT

PAX: Rename the namespace of orc proto objects

3306fd3

The original namespace of the generated protobuf class conflict with the official ORC, which will use wrong class in PAX code. We use a different namespace to avoid this case.

Fix: install the libpax.so, libpaxformat.so, and headers into ${prefix}

9389d74

The ${prefix} is the path to install the databse when you configure. We install libpax.so, libpaxformat.so, and headers to ${prefix}. So that other extension can link to it.

PAX: Use macro SO_TYPE_VECTOR instead of hard-coded value

7460e9a

We should use the macro defined by extension vectorization, other than hard-coded flag.

PAX: Limit the maximum number of concurrent jobs to 8

3f13a14

`make -j` will use infinite number of jobs that may exhaust memory.

Pax: Avoid stack overflow when reading footer

d32152d

If the current footer length is too long, stack over flow may occur. 1. use scoped_ptr to release buffer avoid memory leak 2. fix compile error

Fix: delete pax_itemptr.cc in build libpaxformat.so

072dcef

pax_itemptr.cc is unused in libpaxformat.so but cause unexpected compile error.

Pax: Add regression tests

0b1c3a9

This commit adds several basic regression tests for PAX. When running ICW test, these basic tests will run, so some basic DLL and DML queries could be tested.

PAX: Add a configure option to build pax optionally.

a6d9d2e

When run configure script, we can pass `--enable-pax` to enable pax support, or `--disable-pax` to disable pax. The default behavior is to disable pax support.

Fix: toast will make pax operators not work

d849729

After pax implements some of PG min/max operators, we no longer allow call detoast in pax operators. We need to perform the detoast operation before storing the min/max value or non-fixed datum.

Fix: alter PAX table will got cast problem

3273f75

PAX needs to define the StdRdOptions instead of vl_len. Because many places in CBDB assume that option in relation can be cast into StdRdOptions.

Op: Make the functions in DataBuffer inline

f029c39

`DataBuffer` is the basic buffer manager of pax. Make frequently called methods to inline can reduce some instruction calls.

Fix: pax object access hook assert false

152ebe7

When cbdb using `alter table` with PAX table, the `PaxObjectAccessHook` will be call. Then sub_id will not be 0, because `attrnum` is not 0.

jiaqizho force-pushed the pax-split-380-commit-2 branch from 285b3c9 to 3396ce7 Compare April 14, 2025 13:20

jiaqizho merged commit f967400 into apache:main Apr 14, 2025
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: introduce a high-performance hybrid row-columnar storage engine (2/4)#1042

Feature: introduce a high-performance hybrid row-columnar storage engine (2/4)#1042
jiaqizho merged 97 commits intoapache:mainfrom
jiaqizho:pax-split-380-commit-2

jiaqizho commented Apr 10, 2025

Uh oh!

jiaqizho commented Apr 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jiaqizho commented Apr 10, 2025

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

Uh oh!

jiaqizho commented Apr 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants