Feature: introduce a high-performance hybrid row-columnar storage engine (4/4) by jiaqizho · Pull Request #1044 · apache/cloudberry

jiaqizho · 2025-04-10T11:35:10Z

Fixes #ISSUE_Number

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

reshke · 2025-04-11T21:03:56Z

Here is problems i get with compilations:


/home/reshke/cloudberry/contrib/pax_storage/src/cpp/storage/micro_partition.cc: In member function ‘virtual std::unique_ptr<pax::ColumnStatsProvider> pax::MicroPartitionReaderProxy::GetGroupStatsInfo(size_t)’:
/home/reshke/cloudberry/contrib/pax_storage/src/cpp/storage/micro_partition.cc:94:19: error: moving a temporary object prevents copy elision [-Werror=pessimizing-move]
   94 |   return std::move(reader_->GetGroupStatsInfo(group_index));
      |          ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/reshke/cloudberry/contrib/pax_storage/src/cpp/storage/micro_partition.cc:94:19: note: remove ‘std::move’ call
[ 37%] Building CXX object src/cpp/CMakeFiles/test_main.dir/storage/micro_partition_stats_updater.cc.o
[ 37%] Building CXX object src/cpp/CMakeFiles/test_main.dir/storage/micro_partition_udf.cc.o
cc1plus: all warnings being treated as errors
make[4]: *** [src/cpp/CMakeFiles/test_main.dir/build.make:496: src/cpp/CMakeFiles/test_main.dir/storage/micro_partition.cc.o] Error 1
make[4]: *** Waiting for unfinished jobs....
[ 38%] Building CXX object src/cpp/CMakeFiles/bench_main.dir/storage/micro_partition_file_factory.cc.o
/home/reshke/cloudberry/contrib/pax_storage/src/cpp/storage/micro_partition.cc: In member function ‘virtual std::unique_ptr<pax::ColumnStatsProvider> pax::MicroPartitionReaderProxy::GetGroupStatsInfo(size_t)’:
/home/reshke/cloudberry/contrib/pax_storage/src/cpp/storage/micro_partition.cc:94:19: error: moving a temporary object prevents copy elision [-Werror=pessimizing-move]
   94 |   return std::move(reader_->GetGroupStatsInfo(group_index));
      |          ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/reshke/cloudberry/contrib/pax_storage/src/cpp/storage/micro_partition.cc:94:19: note: remove ‘std::move’ call```

/home/reshke/cloudberry/contrib/pax_storage/src/cpp/storage/micro_partition_iterator.cc: In member function ‘virtual pax::MicroPartitionMetadata pax::internal::MicroPartitionInfoIterator::Next()’:
/home/reshke/cloudberry/contrib/pax_storage/src/cpp/storage/micro_partition_iterator.cc:167:19: error: moving a temporary object prevents copy elision [-Werror=pessimizing-move]
167 | return std::move(ToValue(tuple));
| ~~~~~~~~~^~~~~~~~~~~~~~~~
/home/reshke/cloudberry/contrib/pax_storage/src/cpp/storage/micro_partition_iterator.cc:167:19: note: remove ‘std::move’ call
[ 55%] Building CXX object src/cpp/CMakeFiles/bench_main.dir/access/paxc_rel_options.cc.o
/home/reshke/cloudberry/contrib/pax_storage/src/cpp/storage/micro_partition_iterator.cc: In member function ‘virtual pax::MicroPartitionMetadata pax::internal::MicroPartitionInfoParallelIterator::Next()’:
/home/reshke/cloudberry/contrib/pax_storage/src/cpp/storage/micro_partition_iterator.cc:378:19: error: moving a temporary object prevents copy elision [-Werror=pessimizing-move]
378 | return std::move(ToValue(tuple));
| ~~~~~~~~~^~~~~~~~~~~~~~~~
/home/reshke/cloudberry/contrib/pax_storage/src/cpp/storage/micro_partition_iterator.cc:378:19: note: remove ‘std::move’ call
[ 56%] Building CXX object src/cpp/CMakeFiles/bench_main.dir/access/pax_access_handle.cc.o

jiaqizho · 2025-04-14T02:20:30Z

Here is problems i get with compilations:

HI @reshke , would you like change the contrib/pax_storage/CMakeLists.txt:L8? Already have a commit(not push yet) change the -Werror relatived logical.

reshke · 2025-04-14T05:31:28Z

Here is problems i get with compilations:

HI @reshke , would you like change the contrib/pax_storage/CMakeLists.txt:L8? Already have a commit(not push yet) change the -Werror relatived logical.

So, this already fixed? Ok. Locally I removed std::move - and it compiles ok.

Next step, i get errors:


reshke@yezzey-cbdb-bench:~/cloudberry$ createdb
psql
reshke@yezzey-cbdb-bench:~/cloudberry$ psql
psql (14.4, server 14.4)
Type "help" for help.

reshke=# set allow_system_table_mods to true;
SET
reshke=# \i /usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:3: NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'relid' as the Apache Cloudberry data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:3: ERROR:  relation "pg_pax_tables" already exists
DELETE 0
UPDATE 1
UPDATE 0
UPDATE 10
UPDATE 1
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:15: ERROR:  relation "pg_pax_tables_relid_index" already exists
UPDATE 1
UPDATE 1
UPDATE 1
UPDATE 1
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:27: ERROR:  duplicate key value violates unique constraint "pg_proc_oid_index"
DETAIL:  Key (oid)=(7600) already exists.
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:28: ERROR:  duplicate key value violates unique constraint "pg_am_name_index"
DETAIL:  Key (amname)=(pax) already exists.
COMMENT
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:30: ERROR:  duplicate key value violates unique constraint "pg_proc_oid_index"
DETAIL:  Key (oid)=(7601) already exists.
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:31: ERROR:  duplicate key value violates unique constraint "pg_proc_oid_index"
DETAIL:  Key (oid)=(7602) already exists.
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:32: ERROR:  duplicate key value violates unique constraint "pg_type_oid_index"
DETAIL:  Key (oid)=(7603) already exists.
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:34: NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'objid' as the Apache Cloudberry data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:34: ERROR:  relation "pg_pax_fastsequence" already exists
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:35: ERROR:  relation "pg_pax_fastsequence_objid_idx" already exists
UPDATE 0
UPDATE 1
UPDATE 0
UPDATE 1
UPDATE 10
UPDATE 1
UPDATE 1
UPDATE 1
UPDATE 1
reshke=# select ^C
reshke=# select * from pg_proc where oid = 7600;
 oid  |       proname       | pronamespace | proowner | prolang | procost | prorows | provariadic | prosupport | prokind | prosecdef | proleakproof | proisstrict | proretset | provolatile | proparallel |
pronargs | pronargdefaults | prorettype | proargtypes | proallargtypes | proargmodes | proargnames | proargdefaults | protrftypes |       prosrc        |   probin    | prosqlbody | proconfig | proacl | pr
odataaccess | proexeclocation
------+---------------------+--------------+----------+---------+---------+---------+-------------+------------+---------+-----------+--------------+-------------+-----------+-------------+-------------+-
---------+-----------------+------------+-------------+----------------+-------------+-------------+----------------+-------------+---------------------+-------------+------------+-----------+--------+---
------------+-----------------
 7600 | pax_tableam_handler |           11 |       10 |      13 |       1 |       0 |           0 | -          | f       | f         | f            | t           | f         | s           | u           |
       1 |               0 |        269 | 2281        |                |             |             |                |             | pax_tableam_handler | $libdir/pax |            |           |        | n
            | a
(1 row)

But creating/inserting into pax table works after this, so maybe this errors do not change much.

jiaqizho · 2025-04-14T05:55:47Z

So, this already fixed? Ok. Locally I removed std::move - and it compiles ok.

Next step, i get errors:


reshke@yezzey-cbdb-bench:~/cloudberry$ createdb
psql
reshke@yezzey-cbdb-bench:~/cloudberry$ psql
psql (14.4, server 14.4)
Type "help" for help.

reshke=# set allow_system_table_mods to true;
SET
reshke=# \i /usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:3: NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'relid' as the Apache Cloudberry data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:3: ERROR:  relation "pg_pax_tables" already exists
DELETE 0
UPDATE 1
UPDATE 0
UPDATE 10
UPDATE 1
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:15: ERROR:  relation "pg_pax_tables_relid_index" already exists
UPDATE 1
UPDATE 1
UPDATE 1
UPDATE 1
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:27: ERROR:  duplicate key value violates unique constraint "pg_proc_oid_index"
DETAIL:  Key (oid)=(7600) already exists.
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:28: ERROR:  duplicate key value violates unique constraint "pg_am_name_index"
DETAIL:  Key (amname)=(pax) already exists.
COMMENT
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:30: ERROR:  duplicate key value violates unique constraint "pg_proc_oid_index"
DETAIL:  Key (oid)=(7601) already exists.
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:31: ERROR:  duplicate key value violates unique constraint "pg_proc_oid_index"
DETAIL:  Key (oid)=(7602) already exists.
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:32: ERROR:  duplicate key value violates unique constraint "pg_type_oid_index"
DETAIL:  Key (oid)=(7603) already exists.
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:34: NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'objid' as the Apache Cloudberry data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:34: ERROR:  relation "pg_pax_fastsequence" already exists
psql:/usr/local/cbdb/share/postgresql/cdb_init.d/pax-cdbinit--1.0.sql:35: ERROR:  relation "pg_pax_fastsequence_objid_idx" already exists
UPDATE 0
UPDATE 1
UPDATE 0
UPDATE 1
UPDATE 10
UPDATE 1
UPDATE 1
UPDATE 1
UPDATE 1
reshke=# select ^C
reshke=# select * from pg_proc where oid = 7600;
 oid  |       proname       | pronamespace | proowner | prolang | procost | prorows | provariadic | prosupport | prokind | prosecdef | proleakproof | proisstrict | proretset | provolatile | proparallel |
pronargs | pronargdefaults | prorettype | proargtypes | proallargtypes | proargmodes | proargnames | proargdefaults | protrftypes |       prosrc        |   probin    | prosqlbody | proconfig | proacl | pr
odataaccess | proexeclocation
------+---------------------+--------------+----------+---------+---------+---------+-------------+------------+---------+-----------+--------------+-------------+-----------+-------------+-------------+-
---------+-----------------+------------+-------------+----------------+-------------+-------------+----------------+-------------+---------------------+-------------+------------+-----------+--------+---
------------+-----------------
 7600 | pax_tableam_handler |           11 |       10 |      13 |       1 |       0 |           0 | -          | f       | f         | f            | t           | f         | s           | u           |
       1 |               0 |        269 | 2281        |                |             |             |                |             | pax_tableam_handler | $libdir/pax |            |           |        | n
            | a
(1 row)

But creating/inserting into pax table works after this, so maybe this errors do not change much.

not finished yet, It is still under discussion. If you are interested, you can submit the PR after the current PR merge (after pax merge). :)

You don't need exec the pax-cdbinit--1.0.sql, the sql script will be exec if pax enabled. So u can direct use the PAX as table access method.

reshke · 2025-04-14T06:07:55Z

I don't get it much. So, pax itself is contrib module (= optional), but is pre-installed in every database? Thats a little bit strange, more clean solution would be a contrib module that modifies catalog only on install (when CREATE EXTENSION)

reshke · 2025-04-14T07:24:20Z

Here is patch for pretty-print PAX_INSERT filenode name in waldump

Before:

reshke@yezzey-cbdb-bench:~/cloudberry/contrib/pax_storage$ pg_waldump /home/reshke/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_wal/00000001000000000000000D | grep pax
rmgr: pax         len (rec/tot): 1573014/1573014, tx:        887, lsn: 0/35515F90, prev 0/35515EE0, desc: PAX_INSERT PAX_INSERT, filename = 8, offset = 0, dataLen = 1572960

After:

reshke@yezzey-cbdb-bench:~/cloudberry$ pg_waldump /home/reshke/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_wal/00000001000000000000000D | grep pax
rmgr: pax         len (rec/tot): 1573014/1573014, tx:        887, lsn: 0/35515F90, prev 0/35515EE0, desc: PAX_INSERT PAX_INSERT, filename = base/17068/16389.8, offset = 0, dataLen = 1572960

patch:

reshke@yezzey-cbdb-bench:~/cloudberry$ git diff contrib/pax_storage/src/cpp/storage/wal/paxc_desc.c
diff --git a/contrib/pax_storage/src/cpp/storage/wal/paxc_desc.c b/contrib/pax_storage/src/cpp/storage/wal/paxc_desc.c
index c1457fc4787..81b398f1762 100644
--- a/contrib/pax_storage/src/cpp/storage/wal/paxc_desc.c
+++ b/contrib/pax_storage/src/cpp/storage/wal/paxc_desc.c
@@ -30,6 +30,7 @@
 #else
 #include "storage/wal/paxc_desc.h"
 #endif
+#include "common/relpath.h"

 void pax_rmgr_desc(StringInfo buf, XLogReaderState *record) {
   char *rec = XLogRecGetData(record);
@@ -38,14 +39,22 @@ void pax_rmgr_desc(StringInfo buf, XLogReaderState *record) {
   switch (info) {
     case XLOG_PAX_INSERT: {
       char filename[MAX_PATH_FILE_NAME_LEN];
+      char *relpathPart;
+      size_t relpathPartSz;

       char *rec = XLogRecGetData(record);
       xl_pax_insert *xlrec = (xl_pax_insert *)rec;

       Assert(xlrec->target.file_name_len < MAX_PATH_FILE_NAME_LEN);

-      memcpy(filename, rec + SizeOfPAXInsert, xlrec->target.file_name_len);
-      filename[xlrec->target.file_name_len] = '\0';
+      relpathPart = relpathbackend(xlrec->target.node, InvalidBackendId, MAIN_FORKNUM);
+      relpathPartSz = strlen(relpathPart);
+
+      memcpy(filename, relpathPart, relpathPartSz);
+
+      memcpy(filename + relpathPartSz, ".", 1);
+      memcpy(filename + relpathPartSz + 1, rec + SizeOfPAXInsert, xlrec->target.file_name_len);
+      filename[relpathPartSz + xlrec->target.file_name_len + 1] = '\0';

       int32 bufferLen = XLogRecGetDataLen(record) - SizeOfPAXInsert -
                         xlrec->target.file_name_len;

I will open PR later, because current PR is already too big.

jiaqizho · 2025-04-14T11:42:56Z

Here is patch for pretty-print PAX_INSERT filenode name in waldump

Before:

reshke@yezzey-cbdb-bench:~/cloudberry/contrib/pax_storage$ pg_waldump /home/reshke/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_wal/00000001000000000000000D | grep pax
rmgr: pax         len (rec/tot): 1573014/1573014, tx:        887, lsn: 0/35515F90, prev 0/35515EE0, desc: PAX_INSERT PAX_INSERT, filename = 8, offset = 0, dataLen = 1572960

After:

reshke@yezzey-cbdb-bench:~/cloudberry$ pg_waldump /home/reshke/cloudberry/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0/pg_wal/00000001000000000000000D | grep pax
rmgr: pax         len (rec/tot): 1573014/1573014, tx:        887, lsn: 0/35515F90, prev 0/35515EE0, desc: PAX_INSERT PAX_INSERT, filename = base/17068/16389.8, offset = 0, dataLen = 1572960

patch:

reshke@yezzey-cbdb-bench:~/cloudberry$ git diff contrib/pax_storage/src/cpp/storage/wal/paxc_desc.c
diff --git a/contrib/pax_storage/src/cpp/storage/wal/paxc_desc.c b/contrib/pax_storage/src/cpp/storage/wal/paxc_desc.c
index c1457fc4787..81b398f1762 100644
--- a/contrib/pax_storage/src/cpp/storage/wal/paxc_desc.c
+++ b/contrib/pax_storage/src/cpp/storage/wal/paxc_desc.c
@@ -30,6 +30,7 @@
 #else
 #include "storage/wal/paxc_desc.h"
 #endif
+#include "common/relpath.h"

 void pax_rmgr_desc(StringInfo buf, XLogReaderState *record) {
   char *rec = XLogRecGetData(record);
@@ -38,14 +39,22 @@ void pax_rmgr_desc(StringInfo buf, XLogReaderState *record) {
   switch (info) {
     case XLOG_PAX_INSERT: {
       char filename[MAX_PATH_FILE_NAME_LEN];
+      char *relpathPart;
+      size_t relpathPartSz;

       char *rec = XLogRecGetData(record);
       xl_pax_insert *xlrec = (xl_pax_insert *)rec;

       Assert(xlrec->target.file_name_len < MAX_PATH_FILE_NAME_LEN);

-      memcpy(filename, rec + SizeOfPAXInsert, xlrec->target.file_name_len);
-      filename[xlrec->target.file_name_len] = '\0';
+      relpathPart = relpathbackend(xlrec->target.node, InvalidBackendId, MAIN_FORKNUM);
+      relpathPartSz = strlen(relpathPart);
+
+      memcpy(filename, relpathPart, relpathPartSz);
+
+      memcpy(filename + relpathPartSz, ".", 1);
+      memcpy(filename + relpathPartSz + 1, rec + SizeOfPAXInsert, xlrec->target.file_name_len);
+      filename[relpathPartSz + xlrec->target.file_name_len + 1] = '\0';

       int32 bufferLen = XLogRecGetDataLen(record) - SizeOfPAXInsert -
                         xlrec->target.file_name_len;

I will open PR later, because current PR is already too big.

cool , cc @gongxun0928

Dict encoding is an experimental feature. `arrow` allows passing dictionary encoding into record batch middle. After the dict encoding column is filled in the record batch, the join time can be effectively reduced.

Current change support PAX direct convect the dictionary encoding into the arrow dictionary part.

The parallel scanning of pax is similar to appendonly table. It also uses a single block file as the scanning unit and implements parallel scanning of files by accessing the auxiliary table in parallel.

if the update strategy of pax table is mark deletion, when a tuple in a block file is marked for deletetion, in the bitmap scan, all data after the tuple which marked in the block will be discarded.

The target of this commit is to allow thread-safe scan. To achieve the target, we refactor in mainly ways: 1. Use smart pointer to manage C++ objects, not memory context any more. We also remove the global placement new/delete operator functions. Other C++ code will not be affected by the unexpected global placement new/delete operator functions, unexpectedly. 2. Some top-level objects will be released in the end of statements. 3. pg functions that release some resources can't be called in the destructor functions. The previous destructor functions of C++ objects are optionally called if the exception happens. All memory is released by memory context, and all opened files are closed by resource owners. Now all memory allocated for c++ objects will be freed from the top-level objects(PaxScanDesc/ PaxIndexScanDesc) to its class members if exception happens.

In pax filter, after we obtain the right value type from expr, we will try to construct a scankey through the brinindex method. The scan key saves the funcid/left type/right type which will used to do call the min/max functions. In fact, building scankey through brinindex method may be incomplete. For example, the bool type cannot construct the scankey. In the current changes, brinindex is removed and we only verified the opfuncid(in expr) exist in pg_operator.

In the customer case: customer will use the varchar store as enum(few set of value in the column). So in current changes, bloom filter is supported in PAX. The enum case will supporbe ted as a subset of bloom filter. bloom filter can also do more filter. In addition, the current changes support filtering of ScalarArrayOpExpr, currently only bloom filter is supported for processing ScalarArrayOpExpr. There are plans to handle different *opexpr with supported filter types in PAX.

…espace to pax PAX disabled stmts such as vacuum full/alter table t1 set tablespace/alter database d1 set tablespace. These logics should not affect HashData Cloud, so they are moved to the PAX hook because PAX AM and Storage AM will not be loaded at the same time.

Context: Scanning pax table could only run in a single thread, file by file, because scan is driven by get_next_tuple of Table Access Method(TAM). The scanning model introduces latency when fetching a batch of tuples if IO operation is needed. On the other hand, vectorization-based execution may run parallelly by threads. So, pax is expected to scan tuples safely in threads. Design: To support parallel scan, new APIs are introduced, to avoid TAM. There are two key concepts: * DatasetInterface It represents scanning a pax table, providing methods to manage the required resources. There are 3 main methods: arrow::Status Initialize() void Release() arrow::Result<arrow::dataset::FragmentIterator> GetFragmentsImpl() Initialize() and Release() are expected to be called in the main thread, they may call some pg functions to acquire or release resource. It's disallowed to call these functions in non-main threads. The third method returns an iterator for fragment. Each fragment corresponds to a data file. A worker thread fetches an item from the iterator and scanning tuples, batch by batch. * FragemntInterface FragmentInterface represents scanning a single data file of a set of data files in pax table. It contains only one useful method to return a RecordBatch iterator: arrow::Result<arrow::RecordBatchIterator> ScanBatchesAsyncImpl() Tuples of a data file are return batch by batch, as required by upstream. All RecordBatches in the iterator come from the same data file. NOTE: it's UNSAFE to call batch_iterator->Next() in parallel. Co-authored-by: Hao Wu [gfphoenix78@gmail.com] Co-authored-by: yangkaidi [yangkaidi@hashdata.cn]

In the current PR, some of pax operators are expanded to make them close to the operators defined by pg_operator.

in commit 56e1084109d50aeff572773dda6048115b2365ad, which set enable_incremental_sort to off and changed src/test/regress cases, apply those changes into vectorization and pax_storge In commit 2caf296e89174c690d6819aab96c379ff90a4e12, the catalog changed, and cbdb_parallel test modified. Fix test parallel_retrieve_cursor/explain: The stats info of pg_class maybe changged when the autovacuum running, and can let the explain command generate different plan. So analyze pg_class before explain command to make the case stable.

Implement a lexical sorting method to optimize single-column sorting with the same prefix

Parallel scan uses a set of new APIs to scan pax tables. Predicates are passed by arrow::compute::Expression. Some data types are not supported in the first edition. This commit supports string types(text/varchar/char) and time types (time/timestamp/timestamptz).

Add icw test for pax table: 1. Fixed the test case with optimizor=off. 2. Add some new testcase in pax_schedule 3. Some test cases with optimizor=off that do not match the answer file are skipped and marked as Unstable 4. Some test cases with optimizor=on that do not match the answer file are skipped and marked as Orca

…oredump `gopherCloseFile` may fail, which may cause data to fail to be written to the object storage. We need to check its return value, also we should move the free of the ufile pointer out of function UFileClose so that we can get the reason for the close error.

After PAX removed the overloads of `new/delete`, we no longer need to mock `Palloc/Palloc0/Pfree` to ensure that there is no coredump happen in `__attribute__((constructor))` function(defined in `protobufer`).

The previous implementation uses an auxiliary heap table to record the meta info for all micro partitions. In this commit, we'll adapt the existing code to the manifest API. The next step is to add implementation of manifest file to manage meta info for all micro partitions.

After the memory management commit(PAX: Refactor memory management to allow thread-safe scan), pax_make_toast cannot determine whether the current datum is empty. If compress fails, the current tts_value should not be set to empty. After the memory management commit(7c0c6c9), pax_make_toast cannot determine whether the current datum is empty. If compress fails, the current tts_value should not be set to empty.

PAXPY relies on CBDB install before it can be built. In the current changes, PAXPY no longer relies on CBDB install before building, but directly links storageformat.so.However, since PAXPY is not tested in CI, PAXPY may not have changed when some APIs of PAX itself are changed. Consider adding it to CI later.

In an insert transaction, after obtaining a block_id from pax_fastsequence, the block_id file and block_id.toast file will be created when file_system ->Open() calls are successful. When the transaction aborts, since pax_fastsequence is implemented as update-in-place, whether the block_id increases depends on whether the data has been flushed to disk. As the transaction aborts and xlog won't be flushed, the next allocation may either reuse the current block_id or use the next one. Due to this non-deterministic behavior, the data file and toast file created earlier may become orphaned files when the block_id is reused. Therefore, O_CREAT | O_TRUNC is specified: if the file does not exist, it will be created; if it exists, its previous content will be truncated to avoid orphaned files.

Add a marker file indicating successful compilation. Do not only depends on the cpp files. It may create an incomplete pax.so file

In the current change, vec.max_batch is no longer use to determine the number of rows returned by record batch. But for the class VecAdapter, the range interface is still retained. For the PAX, the cost of splitting by range is small. If the range interface of the class VecAdapter is no longer needed in the future, the related interface parameters will also be removed.

Can't call pax to make dataset in vectorization, so we work around by passing context. Co-Author: Dongxiao Song songdongxiao@hashdata.cn

Internal partition of PAX is no longer used, so we remove this feature from PAX. After this commit, the reloption 'partition_by' and 'partition_ranges' are also removed.

1. Due to the change of kernel behavior, there are many changes in the execution plan generated by orca. 2. Disable optimizer_trace_fallback to avoid orca fallback generating unstable output.

Several bugs are found and fixed in this commit: 1. Wrong result by filtering bpchar values with bloomfilter, Calculating and testing ignores trailing spaces for bpchar bytes. 2. Fix index counting when copying bit values to a buffer. Increasing the index counter no matter whether the current value is null or not. 3. Run sparse filter with group stats before reading the group. 4. Guard pax_enable_sparse_filter when initializing ParallelScanDesc. Besides fixing the above issue, the pax tests run two pass. One turns off vectorization, while the other one turns on vectorization.

This commit adds a new manifest implementation for catalog. The new implementation used manifest files(regular file), see the third type below. The interface of manifest API is saved in contrib/pax_storage/src/cpp/catalog/manifest_api.h We have 3 implementations for pax catalog: 1. Use the original pax catalog directly, i.e. call the catalog functions in pax code. No intermediate interface is introduced. The catalog table pg_ext_aux.pg_pax_tables is required. Set USE_MANIFEST_API=OFF USE_PAX_CATALOG=ON to enable it. 2. Use the original pax catalog through manifest API. All catalog access is done through the manifest API. The manifest API is implemented by the original pax catalog. pg_ext_aux.pg_pax_tables is also required. Set USE_MANIFEST_API=ON USE_PAX_CATALOG=ON to enable it. 3. Use manifest files to manage catalog for PAX through manifest API. All catalog access is done through the manifest API. The original catalog pg_ext_aux.pg_pax_tables is no more required. The per-table auxiliary table is also changed from storing micro partition info to saving the path of manifest file path. Set USE_MANIFEST_API=ON USE_PAX_CATALOG=OFF to enable it. Each pax table now uses a single manifest file to store the catalog indicating all micro partition info. The design disallows to write, i.e. insert/delete/update, concurrently. To avoid concurrent write, a heavy lock must be taken before writing. The steps of accessing catalog is: 1. Build the auxiliary table name by the oid of pax table. 2. Open the auxiliary table of the pax table, and fetch the path for manifest file. The auxiliary table has only one effective tuple. 3. Open the manifest file and deserialize content to json object. 4. Access manifest API and return result by the internal manifest json object.

This commit adds Apache license 2.0 to all source files and header files in PAX.

In PAX `PORC` format, length streaming is used to record the length of each DATUM in a non-fixed-length column. The composition of length streaming is equivalent to the length array, and its size equal to the number of rows. When reading thenon-fixed-length column, PAX needs to use the length array to calculate the offset array in advance, the offset array can help the format reader quickly locate the middle row. In commit "Performance/improve pax insert performance", PAX no longer builds offsets array during the write phase, which actually breaks the assumption of column: only some column methods distinguish between read and write paths. In the current commit, the length streaming of PAX is changed to the offset streaming. - On the read path, non-fixed-length column no need to build the offset array - On the read path, only using the offset array is more cache-friendly - On the write path, only the offset array needs to be built. And the performance is comparable to building the length array Offset streaming also has disadvantages: the compression rate of offset streaming is likely to be lower than that of length streaming. Currently, PAX does not support DELTA encoding. Once DELTA encoding is supported, this disadvantages may can resolved.

PAX is no longer support object storage in lightning. The implementation of RemoteFileSystem is moved from lightning to cloud to support object storage access by the abstract file API defined in PAX. Despite pax doesn't support RemoteFileSystem, we still disallow to use the dfs tablespace for PAX table.

1. the toast table of the auxiliary table should also be in the pg_aux_ext namespace 2. use GetCatalogSnapshot() as a snapshot when querying auxiliary tables

A doc directory has been added to the PAX project, which will contain documentation for the modules in PAX. - Introduce - Project description - Meta data - Storage format - Toast - Clustering - Filter

After commit(588f5c9) and commit(ca9379e). Access method adds two callbacks that must be implemented. - relation_get_block_sequences: Returns the block sequences contained in this relation. See BlockSequence for details. Currently used by BRIN. - relation_get_block_sequence: Determines the block sequence in which the logical heap 'blkNumber' falls. See BlockSequence for details. Currently used by BRIN. Currently, PAX does not support brin index. So these AM method have been added in the current commit but have not been implemented. After CBDB cherry-picks the complete changes of brin index, consider making PAX support Brin index.

In prev commit(8cf1aba) removed the `am->swap_relation_files` in function `swap_relation_files`. This will cause problems in rewrite table case for the custom AM(like PAX). ex. ``` CREATE TABLE list_parted (a numeric, b int, c int8) PARTITION BY list (a) using pax; CREATE TABLE sub_parted PARTITION OF list_parted for VALUES in (1) PARTITION BY list (b); CREATE TABLE sub_part1(b int, c int8, a numeric) DISTRIBUTED BY (a); ALTER TABLE sub_parted ATTACH PARTITION sub_part1 for VALUES in (1); CREATE TABLE sub_part2(b int, c int8, a numeric) distributed by (a); ALTER TABLE sub_parted ATTACH PARTITION sub_part2 for VALUES in (2); INSERT into list_parted VALUES (2,5,50); INSERT into list_parted VALUES (3,6,60); INSERT into sub_parted VALUES (1,1,60); INSERT into sub_parted VALUES (1,2,10); ALTER TABLE list_parted SET DISTRIBUTED BY (c); select * from list_parted; -- wrong result ``` The `Alter DISTRIBUTED BY` will rewrite the data into a temp table and exchange the relfilenode in temp table with origin table. But without `am->swap_relation_files` call, some of meta or data won't be swap.

After CBDB revert the '64 bit relfilenode', PAX still need adapt the change.

In PAX, The naming convention for visibility maps is: <blocknum>_<generation>_<tag>.visimap - `blocknum` is the current data file name - `generation` is the current visimap generation number. Each deletion of this data file will increase the generation number by 1 - `tag` is the current transaction id. This field is used to ensure the uniqueness of the visimap file name. When the USE_ASSERT_CHECKING is undefined, `generation` cannot be incremented. So if we in the same transaction and update same row twice, then PAX will open the same `.visimap` file.

PAX was based on an older Cloudberry version with copied/modified regression tests in `contrib/pax_storage/src/test/regress/` Updated PAX regression tests to align with the latest version and fixed failed cases: - Synchronized PAX regression tests with current test suite(src/test/regress/) - Fixed ORCA plan differences caused by cherry-picking features: Dynamic Index/Bitmap/Seq Scan, multi-groupset, query parameters, and so on - Resolved planner plan diff - Addressed result diff by marking unsupported test cases

PAX was based on an older Cloudberry version with copied/modified isolation2 tests in `contrib/pax_storage/src/test/isolation2/` Updated PAX isolation2 tests to align with the latest version and fixed failed cases: - Synchronized PAX isolation2 tests with current test suite(src/test/isolation2/) - Change the `uao` test case to PAX which is used to running with AO/AOCS - Remove the unused test cases(like check the gp_aoseg, gp_fastsequence ...) - Fix the plan diffs

The extension vectorization is not open source yet, and the open source version of PAX has removed vectorization-related test cases.

Previously PAX used the internal gitlab repository as a submodule. Now it has switched to using the github repository.

Manifest metadata implementation in PAX is still an experimental feature. Current commit changes the default catalog implementation to the auxiliary table implementation. Also, PAX tests are added to cloudberry's CI

Removed some useless file and change fix the pax icw_test in github CI Also changed the `Cloudberry Database` to `Apache Cloudberry`

jiaqizho · 2025-04-15T01:32:58Z

4/4 part of #1002.

Merging a pull request using the "Rebase and merge" option is limited to 100 commits. PR 1002 contains 380 commits.

jiaqizho force-pushed the pax-split-380-commit-4 branch from 69561d1 to 7354771 Compare April 10, 2025 12:19

tuhaihe self-requested a review April 14, 2025 11:57

tuhaihe approved these changes Apr 14, 2025

View reviewed changes

my-ship-it approved these changes Apr 14, 2025

View reviewed changes

jiaqizho force-pushed the pax-split-380-commit-4 branch from 7354771 to 6313096 Compare April 14, 2025 14:27

jiaqizho and others added 18 commits April 14, 2025 23:25

PAX: Support dict encoding

3eaa823

Dict encoding is an experimental feature. `arrow` allows passing dictionary encoding into record batch middle. After the dict encoding column is filled in the record batch, the join time can be effectively reduced.

PAX: support record batch return with dictionary

e0fb69f

Current change support PAX direct convect the dictionary encoding into the arrow dictionary part.

Regularly synchronize CBDB and update CI

8d5b7d7

PAX: Support parallel-scan

1828479

The parallel scanning of pax is similar to appendonly table. It also uses a single block file as the scanning unit and implements parallel scanning of files by accessing the auxiliary table in parallel.

bugfix: tuples loss issue in bitmap index scan

618910b

if the update strategy of pax table is mark deletion, when a tuple in a block file is marked for deletetion, in the bitmap scan, all data after the tuple which marked in the block will be discarded.

PAX: support almost min/max oper

1ec45c1

In the current PR, some of pax operators are expanded to make them close to the operators defined by pg_operator.

PAX: Support lexical cluster

b01c17d

Implement a lexical sorting method to optimize single-column sorting with the same prefix

Regularly synchronize CBDB and update CI

27fd8d1

PAX: Remove the gtest mock Palloc/Palloc0/Pfree

1d11128

After PAX removed the overloads of `new/delete`, we no longer need to mock `Palloc/Palloc0/Pfree` to ensure that there is no coredump happen in `__attribute__((constructor))` function(defined in `protobufer`).

gfphoenix78 and others added 26 commits April 14, 2025 23:26

build: pax supports incremental compilation.

9b2f8f6

Add a marker file indicating successful compilation. Do not only depends on the cpp files. It may create an incomplete pax.so file

Update vectoried scan interface

c333632

Can't call pax to make dataset in vectorization, so we work around by passing context. Co-Author: Dongxiao Song songdongxiao@hashdata.cn

PAX: Remove internal partition support

62c57ca

Internal partition of PAX is no longer used, so we remove this feature from PAX. After this commit, the reloption 'partition_by' and 'partition_ranges' are also removed.

Fix the failure case in icw test.

5e7e34c

1. Due to the change of kernel behavior, there are many changes in the execution plan generated by orca. 2. Disable optimizer_trace_fallback to avoid orca fallback generating unstable output.

Add Apache license to pax files

f4fd7ea

This commit adds Apache license 2.0 to all source files and header files in PAX.

PAX: fix failture case in isolation2 testcase

5033b5e

1. the toast table of the auxiliary table should also be in the pg_aux_ext namespace 2. use GetCatalogSnapshot() as a snapshot when querying auxiliary tables

PAX: Add doc/* and update README

4ff00a7

A doc directory has been added to the PAX project, which will contain documentation for the modules in PAX. - Introduce - Project description - Meta data - Storage format - Toast - Clustering - Filter

PAX: Adapt to 32-bit relnode

9d5c9aa

After CBDB revert the '64 bit relfilenode', PAX still need adapt the change.

PAX: Remove vectorization related tests

e0613c1

The extension vectorization is not open source yet, and the open source version of PAX has removed vectorization-related test cases.

PAX: update submodule

d8f8146

Previously PAX used the internal gitlab repository as a submodule. Now it has switched to using the github repository.

PAX: Change the default catalog to auxiliary table and enable CI

56c57e8

Manifest metadata implementation in PAX is still an experimental feature. Current commit changes the default catalog implementation to the auxiliary table implementation. Also, PAX tests are added to cloudberry's CI

PAX: fix icw_test in github CI

247e475

Removed some useless file and change fix the pax icw_test in github CI Also changed the `Cloudberry Database` to `Apache Cloudberry`

jiaqizho force-pushed the pax-split-380-commit-4 branch from 6313096 to 247e475 Compare April 14, 2025 15:26

jiaqizho merged commit 26b9cda into apache:main Apr 14, 2025
26 checks passed

reshke mentioned this pull request Apr 15, 2025

Show pax relation relfilenode in pg_waldump #1048

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: introduce a high-performance hybrid row-columnar storage engine (4/4)#1044

Feature: introduce a high-performance hybrid row-columnar storage engine (4/4)#1044
jiaqizho merged 86 commits intoapache:mainfrom
jiaqizho:pax-split-380-commit-4

jiaqizho commented Apr 10, 2025

Uh oh!

reshke commented Apr 11, 2025 •

edited

Loading

Uh oh!

jiaqizho commented Apr 14, 2025

Uh oh!

reshke commented Apr 14, 2025

Uh oh!

jiaqizho commented Apr 14, 2025

Uh oh!

reshke commented Apr 14, 2025

Uh oh!

reshke commented Apr 14, 2025

Uh oh!

jiaqizho commented Apr 14, 2025

Uh oh!

Uh oh!

jiaqizho commented Apr 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

jiaqizho commented Apr 10, 2025

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

Uh oh!

reshke commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiaqizho commented Apr 14, 2025

Uh oh!

reshke commented Apr 14, 2025

Uh oh!

jiaqizho commented Apr 14, 2025

Uh oh!

reshke commented Apr 14, 2025

Uh oh!

reshke commented Apr 14, 2025

Uh oh!

jiaqizho commented Apr 14, 2025

Uh oh!

Uh oh!

jiaqizho commented Apr 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

reshke commented Apr 11, 2025 •

edited

Loading