Feature: introduce a high-performance hybrid row-columnar storage engine (3/4) by jiaqizho · Pull Request #1043 · apache/cloudberry

jiaqizho · 2025-04-10T11:34:32Z

Fixes #ISSUE_Number

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

If current TupleTableSlot is resued, Then pax will got Assert(false) in ExecStoreVirtualTuple.

When pax uses TableParitionWriter to insert data, the file merging operation will be performed in the Close. If the current table already build the indexes, Pax can only use delete-update to update the indexes. If the current table is not indexed, pax assumes that the CTID in the TableTupleSlot(using SetBlockNumber + SetTupleOffset) has no meaning. which means that we can directly call writer->MergeTo to speed up the merge process.

Add back PAX unittests into CI

There is a big difference between numeric defined in arrow and numeric defined in pg. Only two decimal types of arrow: decimal128 and decimal256, which fix the number of numeric digits. In order to be compatible with the pg format and make vectorization processing faster, pax stores the adapted 128 bits format defined by arrow. Users can use reloption: numeric_vec_storage to enable this storage method.

When exception happen in CreateMicroPartitionWriter, Then the MicroPartitionWriter in TableWriter may become nullptr in this time. But if caller call the TableWriter->Close(), Then caller will got nullptr error.

…mestamptz types In the mapping from pg type to arrow type,the implementation of ConvSchemaAndDataToVec in pax extension needs to be consistent with the implementation of the PGTypeToArrowID function in the vectorization extension. Arrow::timestamp is used to represent the timestamp/timestampz/time type in pg.

When access method function swap_relation_files call, Pax should swap the fast sequence.

The pax aux table will used the oid(7064) as namespace which is one of system table namespaces. After create a pax table, the pax aux table will set the table space to invalid oid(0), no matter currently table space is pg_default or not. This is because the table space of the pax aux table is unchangeable (system table permission). Whenever an pax aux table is created, the index of the pax aux table is also created. But the index of the aux table is set to the same table space as the pax table (current table space). Then the user can not successfully move the index to another table space.

We use the directory build generated by make to avoid compile again.

…'s less than 64 Check the value of gp_interconnect_queue_depth and warning if it's less than 64. the motion performance is poor when the gp_interconnect_queue_depth value is small.

When executing sql containing spi logic, ReleaseCurrentSubTransaction will be called at the end of spi. At this time, FdHandleAbortCallback will be called, We should not release the parent's resources at this time. So it need to check the owner of the resource.

Once pax cluster a index in transaction. The function CPaxCopyPaxBlockEntry may cause the tuple in old aux table changed. So we should do copy rather than direct used old tuple. Also we should update fast sequence when pax do cluster indexes.

In PAX, there are two splitting rules for file dimensions: 1. Split by numbers of tuple 2. Split by file size But every time the stripe in file is written, the physical size of the current file will be reset.

There are some statistical functions for tables in PG, such as the number of tuples inserted, the number of tuples read, etc. This type of statistics also needs to be update in pax. noted that current pg statistics does not affect the results of analyze.

Null test should be checked first before inspecting column-specific attributes. `attno == 0` is the special case for null test. It checks whether all columns are all null or all not null.

Special handling for C locale comparing.

In TPCH or TPCDS, If pax uses the default file splitting configuration, it will make the data file too small and affect the reading performance. But if it is in other case, using current file splitting configuration may not be a problem. Current change make the file splitting configuration to a GUC value, which allows users to configure different configurations according to different case, and the default value of file splitting configuration changed to a larger value. Also, Pax will no longer set a default encoding for columns.

Fix incorrect micro partition file skip logic, if scan_key->sk_collation is not equal to attr->attcollation, do not skip micro paritition file.

The CustomObjectClass in pax have not implements the callback object_type_desc and object_identity_parts which used in trigger call. If current event_trigger registered, then CBDB can't drop the pax table. Because the event_trigger try get the object description when table dropped. But pax does not implement the callback function, so an error will be raise during this process.

Under CBDB execution, for the vec_numeric type, we need to perform format conversion. The newly allocated numeric is not freed.

Add an array of projection column index to quickly skip unnecessary columns when reading projection columns. In tables with a large number of columns, this modification can bring good performance improvements when reading a small number of columns.

1. tuple in storage_am is marked for deletion. For vectorized query, we need to filter out invisible tuple data based on visibility map. 2. The BPCHAR type is no longer assert, and it is processed like varchar

Currently, we use AM handler's oid to judge whether the AM support vectorization, generate the vectorization plan if AM does so. For those new AMs defined in plugins, we may need to access the oid through the plugin's interface, it is not convenient because of the dependency. We use AM's callback 'scan_flags' instead of AM, and each new AM in plugins only need to return the flags represented the supported features.

PAX should preload, so the GUCs could be set in QD before creating QE processes.

PAX has a very different storage model and IO model from the heap table. The gist/brin index can't work for PAX. The unsupported index may be even more. We only guarantee that PAX support btree index. Disabling creating other indexes on PAX table avoids wrong result on unsupported index.

Before storage-am using the row filter reader, PAX can not build CTID in `update` sql. Because storage-am will fetch all of tuple and build the CTID. But after storage-am using the row filter reader, it won't fetch all of tuples in `update` sql. Then without setting CTID, storage-am can't generate CTID by itself.

Must deal with projection info if current projection indexes have not build.

In GetTuple, if multiple tuple are fetched, PAX will call CountNulls multiple times. The method CountNulls will do foreach and accumulate the number of NULL. In the current change, PAX will pre-calculate the number of null values for each position, so that only one for-each is needed when we call the GetTuple.

When length 0 and all null bpchar occurs, we will fill a nullptr. This will cause the vectorized execution engine to not handle it correctly.

1. refactor filesystem interface 2. add dfs_tablespace support

The current vec adapter becomes more bloated as the logic increases. It mainly contains three parts of logic 1. Convert pax columns(in memory) to RecordBatch 2. Convert the buffer in the pax column(PORC format) to arrow::array 3. Convert the buffer in the pax column(PORC_VEC format) to arrow::array In the current changes, these parts of logic are split into different files and are independent of each others.

PAX previously used plasma to cache column-projected data. However, the effect of this process is not very good, mainly for the following reasons: 1. Plasma is not an efficient solution. 2. Different data filtering schemes will cause cache failure. 3. The page cache already cached in read path. Coupled with the lack of maintenance of multiple versions, PAX has to no longer support the plasma caching solution.

This commit fixes several issues about PAX with visimap and adds test cases for all kinds of column types for both porc and porc_vec. 1. Scanning tuples for storage_format=porc_vec 2. Add test cases for all types of column types Co-authored-by: jiaqizho <zhoujiaqi@hashdata.cn>

Current changes enable that PAX support ICW on vectorization. The current test is modified from contrib/vectorization/src/test/regress/parallel_schedule_aocs. The reason why greenplum_schedule/parallel_schedule is not used in PAX is that the greenplum_schedule/parallel_schedule in vectorization directory still does not use vectorization to run. The current testing changes include the following parts: 1. Plan change: AOCS was run in the original test, AOCS does not perform index scan/index only scan by default. This causes most plan diffs 2. Change the sql/out file name: In the previous test set, it was always _aocs.sql/_aocs.out as the suffix. And some files named 1_aocs.out are invalid(should be aocs_1.out). Current changes will restore all file names. 3. Adapt the test to the pax-test:* CI job.

…ot -1 The fix is for typlen is not -1 but typbyval is not true, for example address type.

The exceptions in PAX was simple, relying only on the exceptions type to prompt information. But when debugging is needed, we often need some accompanying contextual information to quickly check the current error problem. Therefore, in the current changes, exception is strengthened and all the contextual information we need added in exception.

Enable the storage format porc_vec ICW tests. Current test cases will running after porc format test.

In the previous PAX toast design, we uniformly named the toast files generated by PAX to <blockname>.toast. This allows PAX to check whether the toast file exists or not by toast_file->exist(). Also the implementation of pax toast is unaware of storage_am/pg. But after PAX was connected to gopher, the exist() interface has become unavailable. Because if current file not exist, then gopher will do the throw -> try...catch -> return false which is very inefficient. This forces PAX to record the toast_exist field(in the aux table) to avoid calling the exist() interface.

Current change is mainly support for `storage_am`. For now, the PAX can be use the COUNT to bypass the `count(*)`. In the future, if PAX supports `custom scan`, then we can also bypass part of the `count/sum` sql.

They are copied from CBDB, but each time we add a catalog table, we will have to modify all of them. All catalogs have been checked in CBDB, and they are unnecessary, just ignore them to make it easy. Authored-by: Zhang Mingli avamingli@gmail.com

Fix some issue in current changes: 1. existexttoast won't be query in the aux table. 2. the sum stats may be wrong after merge group Also current changes re-enable the sql tests which defined in pax-tests target.

After pax support `min/max/count/sum` statistics in group/file. Then every update/delete in single block will make that partially file-level statistics be invalid(count and sum). Current change will update the file-level statistics after update/delete happend. Notice that: the group-level statistics will never be updated

After PAX support the sum/count stats. Then the pb stats combine function(which provider to storage_am) should also been updated. Current change split the MicroPartitionStatisticsInfoCombine function to these parts: - The PrepareStatisticsInfoCombine function used to check the pb stats struct is valid. - The CommStatisticsInfoCombine function used to combine the required filed: count/hasnull/allnull. - The MinMaxStatisticsInfoCombine function used to combine the min/max. - The SumStatisticsInfoCombine function used to combine the sum.

After PAX supports toast storage, pax_dump also needs to specify toast file to ensure that we can correctly parse the data part. The current change does not support parsing toast datum, but supports inputting toast files to open PAX files which exist toast.

The remote file can't use the kReadWriteMode. Added a write_only flag to make sure that current file is write only

In some customer environments, we may not be allowed to get/access the customer's data files, or even use the shell. Therefore, the current changes also support the use of UDF to dump user data. At the same time, UDF can connect to object storage to support storage_am debugging which in not support in pax_dump.

PAX not support brin/gist/spgist index.This is because in these Indexes, it is assumed that the distribution of data is managed by PAGE, which will cause the function of the index itself to not meet expectations. Current changes no longer allow PAX to create these indexes.

The pax table supports the cluster syntax based on the btree index and behaves the same as the aocs table.

Support multiple files to be rewritten in order according to the z-order curve Because the cluster implementation of Postgres depends on the index, we support two implementations in PAX AM: index cluster and column-based zorder cluster. Only one of them can be executed at the same time. The default sorting of zorder cluster is in ascending order of z-value. ``` -- zorder cluster create table t1(c1 int, c2 int) using pax with(cluster_columns='c1'); insert into t1 select i,i from generate_series(10,1,-1) i; table t1; cluster t1; table t1; ```

Which would cause potential memory leak.

The operator of varchar does not exist in pg_operator.dat, but it have same oper with text. This is because the type which can be cast picked in the oper() function. Current change support the varchar min/max operator in PAX.

Unlike the DELETE keyword in C++, `cbdb:pfree` does not allow nullptr to be passed in. Therefore, the current commit checks whether it is nullptr before calling `cbdb:pfree`.

The type of ptblockname is changed to int type, because the naming of table block files is implemented through the self-incrementing id, and using int index for query is more efficient.

jiaqizho · 2025-04-15T01:32:46Z

3/4 part of #1002.

Merging a pull request using the "Rebase and merge" option is limited to 100 commits. PR 1002 contains 380 commits.

jiaqizho force-pushed the pax-split-380-commit-3 branch from dcac80a to 8fd5438 Compare April 10, 2025 12:17

tuhaihe approved these changes Apr 14, 2025

View reviewed changes

my-ship-it approved these changes Apr 14, 2025

View reviewed changes

jiaqizho and others added 27 commits April 14, 2025 22:25

Add ExecClearTuple before read or fetch index tuple

4e2bd7b

If current TupleTableSlot is resued, Then pax will got Assert(false) in ExecStoreVirtualTuple.

Fix: pax broken gtest and add it into CI

fcc5e88

Add back PAX unittests into CI

Fix: pax got nullptr in table writer when exception happen

956ad09

When exception happen in CreateMicroPartitionWriter, Then the MicroPartitionWriter in TableWriter may become nullptr in this time. But if caller call the TableWriter->Close(), Then caller will got nullptr error.

Fix: pax fast sequence will lost when swap table happen

d581003

When access method function swap_relation_files call, Pax should swap the fast sequence.

Fix: fix pax-storage make install will compile again

74a395f

We use the directory build generated by make to avoid compile again.

PAX: check the value of gp_interconnect_queue_depth and warning if it…

c9736a7

…'s less than 64 Check the value of gp_interconnect_queue_depth and warning if it's less than 64. the motion performance is poor when the gp_interconnect_queue_depth value is small.

Fix: PAX split by file size does not calculate the written sizes

20f27d1

In PAX, there are two splitting rules for file dimensions: 1. Split by numbers of tuple 2. Split by file size But every time the stripe in file is written, the physical size of the current file will be reset.

Enable pg statistics in pax

6d70bb6

There are some statistical functions for tables in PG, such as the number of tuples inserted, the number of tuples read, etc. This type of statistics also needs to be update in pax. noted that current pg statistics does not affect the results of analyze.

PAX: Fix null test for all columns

7c22cda

Null test should be checked first before inspecting column-specific attributes. `attno == 0` is the special case for null test. It checks whether all columns are all null or all not null.

Fix: PAX will cache the wrong text locale

16773dc

Special handling for C locale comparing.

Bugfix: fix incorrect micro partition file skip logic

6ac3a31

Fix incorrect micro partition file skip logic, if scan_key->sk_collation is not equal to attr->attcollation, do not skip micro paritition file.

Fix: PAX numeric vec format memory leak

e8b1ea3

Under CBDB execution, for the vec_numeric type, we need to perform format conversion. The newly allocated numeric is not freed.

PAX: paxformat supports vectorized scan operators

59fd657

1. tuple in storage_am is marked for deletion. For vectorized query, we need to filter out invisible tuple data based on visibility map. 2. The BPCHAR type is no longer assert, and it is processed like varchar

PAX: Preload pax extension and sync some GUCs

ed4e1e6

PAX should preload, so the GUCs could be set in QD before creating QE processes.

Fix: the projection in orc group not right

7647929

Must deal with projection info if current projection indexes have not build.

jiaqizho and others added 27 commits April 14, 2025 22:26

Fix: length 0 and all null bpchar in record batch

b8d0170

When length 0 and all null bpchar occurs, we will fill a nullptr. This will cause the vectorized execution engine to not handle it correctly.

feature: support dfs_tablespace in pax storage

daa8095

1. refactor filesystem interface 2. add dfs_tablespace support

PAX: support porc_vec format no filling datum header when typlen is n…

36fb849

…ot -1 The fix is for typlen is not -1 but typbyval is not true, for example address type.

PAX: enable porc_vec format ICW

8f6fb14

Enable the storage format porc_vec ICW tests. Current test cases will running after porc format test.

PAX: Support SUM and COUNT in statistics

1922ac5

Current change is mainly support for `storage_am`. For now, the PAX can be use the COUNT to bypass the `count(*)`. In the future, if PAX supports `custom scan`, then we can also bypass part of the `count/sum` sql.

PAX: re-enable the sql tests which defined in pax-tests target

52b50a0

Fix some issue in current changes: 1. existexttoast won't be query in the aux table. 2. the sum stats may be wrong after merge group Also current changes re-enable the sql tests which defined in pax-tests target.

Regularly synchronize CBDB and update CI

efc78e8

PAX: AM can't insert into table which use object storage

d2c1de0

The remote file can't use the kReadWriteMode. Added a write_only flag to make sure that current file is write only

pax: suport index cluster

9ce9c35

The pax table supports the cluster syntax based on the btree index and behaves the same as the aocs table.

PAX: reset row filter memory context to avoid OOM

c70781b

Which would cause potential memory leak.

PAX: support varchar min/max operator

729a987

The operator of varchar does not exist in pg_operator.dat, but it have same oper with text. This is because the type which can be cast picked in the oper() function. Current change support the varchar min/max operator in PAX.

Fix: pfree may free NULL datum

42d383c

Unlike the DELETE keyword in C++, `cbdb:pfree` does not allow nullptr to be passed in. Therefore, the current commit checks whether it is nullptr before calling `cbdb:pfree`.

PAX: The type of ptblockname is changed to int type

a159ae5

The type of ptblockname is changed to int type, because the naming of table block files is implemented through the self-incrementing id, and using int index for query is more efficient.

jiaqizho force-pushed the pax-split-380-commit-3 branch from 8fd5438 to a159ae5 Compare April 14, 2025 14:26

jiaqizho merged commit 90a96f5 into apache:main Apr 14, 2025
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: introduce a high-performance hybrid row-columnar storage engine (3/4)#1043

Feature: introduce a high-performance hybrid row-columnar storage engine (3/4)#1043
jiaqizho merged 98 commits intoapache:mainfrom
jiaqizho:pax-split-380-commit-3

jiaqizho commented Apr 10, 2025

Uh oh!

Uh oh!

jiaqizho commented Apr 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

jiaqizho commented Apr 10, 2025

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

Uh oh!

Uh oh!

jiaqizho commented Apr 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants