Feature: introduce a high-performance hybrid row-columnar storage engine (1/4) by jiaqizho · Pull Request #1041 · apache/cloudberry

jiaqizho · 2025-04-10T11:26:37Z

Fixes #ISSUE_Number

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

jiaqizho · 2025-04-10T11:28:40Z

1/4 part of #1002.

Merging a pull request using the "Rebase and merge" option is limited to 100 commits. PR 1002 contains 380 commits.

edespino

As the situation may arise were we encounter PRs with large commits, we should first discuss the preferred approach on the dev list. I am marking this with a "Request changes" to allow us to consider an alternate approach on the dev list.

reshke · 2025-04-11T20:29:29Z

As the situation may arise were we encounter PRs with large commits, we should first discuss the preferred approach on the dev list. I am marking this with a "Request changes" to allow us to consider an alternate approach on the dev list.

yes, but this is contrib-only change, no in-core changes here. So maybe we should merge this as-is?

"Core code structure: - Add extension relative code in src/data/ - Implement C++ core modules in src/cpp/ Development environment: - Integrate Cpplint for code style validation - Set up pre-commit Git hooks Common components: - Add shared utilities in src/cpp/comm/ - This contains the glue layer from PAX to CBDB Build system: - Add CMakeLists.txt for project configuration - Set up basic build configurations Others: - added .gitignore Co-Author: Hao Wu <gfphoenix78@gmail.com> Co-Author: Gong Xun <gongxun@hashdata.cn>"

…faces In the current commit, readers and writers for relation and file are implemented. Micro-partition level (File operations): - Added file writer/reader interface - Handle file metadata and checksum validation Table/Relation level (Relation operations): - Hold the micro-partition writer/reader - Used to implements the access method interface Co-Author: Hao Wu <gfphoenix78@gmail.com> Co-Author: Gong Xun <gongxun@hashdata.cn>

Added git submodule for google/googletest.which includes both Google Test (gtest) and Google Mock (gmock). - Google Test (gtest) is a framework for writing C++ test programs. It provides various features to simplify the process of writing and organizing tests. - Google Mock (gmock) extends gtest by allowing developer to mock C++ objects and functions, facilitating more comprehensive testing scenarios. Updated CMakeList.txt to include gtest and gmock.

Refactor and extend the `file_system.h` interface to support local file operations. - Implement the local file operations in `local_file_system.cc` - Ensured error handling and resource management are robust and consistent. - Included unit tests to validate the correctness of the new implementations.

with this pr, we can now insert tuples into pax storage. the path of insert is "CPaxAccess-> CPaxInsert->TableWriter->MicroPartitionWriter(a implement for origin orc file)->OrcFileWriter" Changes: - Add access method handle layer, only support tuple_insert - Add CPaxInserter implementation for tuple_insert - Add orc_native_micro_partition implementation, only support orc_writer Testing: - unit tests will TBD - manually tested the feature in my test environment

In the previous commit, we implemented the interfaces of micro-partition and table. In the current commit, connect the access method to the table interface. - add table meta and micro partition meta - add orc reader for table scan

CBDB can be successfully built libpostgres.so For PAX extension, we can link libpostgres.so to use some kernel functions. The current commit will change the PAX build script to link the libpostgres.so In addition, The macro PAX_INDEPENDENT_MODE has been removed and this macro is no longer relied upon to decide whether to rely on CBDB to compile PAX

In commit "Feature: support table scan in pax storage", have not passing the block_id into `WriterOptions`. It will cause the `block_id` to be illegal.

Trancate table support for PAX AM handler 1. relation_set_new_filenode = pax_relation_set_new_filenode, //p0 trancate related delete 2. relation_nontransactional_truncate = pax_relation_nontransactional_truncate, // p0 trancate related delete Issue: TODO1: cleanup related micro-partition data with pending-delete. TODO2: Truncate table with index case

When CBDB creates a PAX relation. The data size of the relation needs to be known.The current commit adds a column to the catalog to count the data size of the current relation. - table `pg_pax_blocks` add the ptblocksize attribute to record the size of micro partition block - callback summary will record the block size when micro partition writer close - adjust clang-format style, make the code more readable

Since Pax Extension could be built seperately from CBDB, so we could get include path independently instead of hard coded file path. Use pg_config option to provide include path. Since pg_config could provide other informations, like --libdir, --sharedir and etc, so we provide a function GET_PG_CONFIG here for future convenience purpose. Approved-by: wuhao <wuhao@hashdata.cn> Max Yang <yangyu@hashdata.cn>

In PAX, dml_init/dml_fini hooks are used to initialize the context required for Insert. But in the previous commits, `InitDmlState` and `FinishDmlState` will be called regardless of whether AM is a relation of PAX. `RelationIsPax` has been added to the current changes to ensure that `InitDmlState` and `FinishDmlState` will only be called when AM is PAX.

Access method functions in TableAmRoutine may use either C function or C++ functions. For C++ functions, it must take care of c++ exceptions and convert them to the standard PG ERROR, so the caller functions in PG knows to raise an error and caught by the upper handler. As a simple rule, all static functions in CCPaxAccessMethod are the outmost C++ functions, it must convert C++ exceptions to PG ERROR. Some access methods look better to use C functions to manipulate catalogs and avoid involving C++ exceptions, are grouped in class PaxAccessMethod.

`RelationSize/EstimateRelSize` is the interface used to count the current relation. Before this commits, we have added `ptblocksize` to the auxiliary table which used to count the size of the current PAX table. This commit implement the `RelationSize` and `EstimateRelSize` interfaces.

1. Implemented pax table rescan interface pax::CCPaxAccessMethod::ScanRescan 2. Implemented table meta pax::TableMetadata::Iterator::Seek, which supports the micro-partition file seek functionality. 3. Implemented unit test seek_iterator in TableMetadataTest. Approved-by: gongxun <gongxun@hashdata.cn> Tony Ying <yinglinhu@hashdata.cn>

Format the header and cleanup the code, also removed unneeded wrapper functions. The order of header introduction in PAX should be sorted alphabetically.

`ScanAnalyzeNextBlock/ScanAnalyzeNextTuple/ScanSampleNextBlock/ScanSampleNextTuple` are AM interfaces used for analyze and sampling. In PAX, the seek interface is implemented, including the following parts: - file layer supports `seek` function - micro-partition and table layer support `seektuple` function

Implemented reloptions in PAX,But note that some of the reloptions in the current commit have not been implemented. Currently PAX does not support reloptions such as compress/storage_format.But in the future, this part of the reloption will be implemented.

In PAX, two namespaces are defined - pax namespace: can be called directly in the c++ environment and may generate exceptions. - paxc namespace: The current method may produce a long jump (produced by erreport(error)), and the long jump must be converted into an exception before it can be called by C++. The current commit has fixed some incorrect namespace definitions and fixed some issues with mixed use of pax/paxc methods.

Implemented bulk insert in PAX. Bulk insert is currently only used in copy. ``` -- create pax table for testing CREATE TEMP TABLE x ( a serial, b int, c text not null default 'stuff', d text, e text ) using pax; copy table from stdin. COPY x (a, b, c, d, e) from stdin; 10006 22 32 42 52 10007 23 33 43 53 10008 24 34 44 54 10009 25 35 45 55 10010 26 36 46 56 \. select * from x; create table x1 using pax as select * from x; select * from x1; a | b | c | d | e -------+----+----+----+---- 10014 | 25 | 35 | 45 | 55 10013 | 24 | 34 | 44 | 54 10011 | 22 | 32 | 42 | 52 10012 | 23 | 33 | 43 | 53 10015 | 26 | 36 | 46 | 56 (5 rows) ```

Why we need porting ORC format into PAX? - PAX will support different storage format in the future. - PAX need some common interface: memory processing, encoding/decoding process, profiling ... - Adapting liborc is not a good choice: * `orc_file_stream.h` used file interface as `liborc output backend`. it will face a huge amount of IOPS pressure. * If `orc_file_stream.h` uses memory as `liborc output backend`. It will make multiple unnecessary memory copies. In the current pr, some of abstracted interfaces have been defined: - buffer(pax_buffer.h): Used to receive and process data in memory * Provide `a working pointer` which can help the caller handle memory better. * Support zero copy or create by itself. * Easier to manage the life cycle of large blocks of memory. - column(pax_column.h): abstraction of column memory slices for processing column data * Contains `fixed-length column`, `non-fixed-length column`, and `column sets` * `fixed-length column` and `non-fixed-length column` correspond to the memory-to-disk structure. * `column sets` is used to manage multiple columns which under the same schema. * The comment processing interface of PAX can be unified through the column structure. PAX can using column structure as input do encoding/decoding without conside storage format. * Fewer memory copies, support zero copy for non-fixed-length types, and there will be more types of zero copy support Also in the current pr, PAX chose ORC as the first supported storage format. - Full ORC format support - Full ORC write/read support - May not be fully support ORC in the future(For better performance and architecture)

added new types of DataBuffer, UntreatedDataBuffer and TreatedDataBuffer - UntreatedDataBuffer like a sliding window used to consume a batch of buffer. - TreatedDataBuffer can well distinguish the buffer of the consumed area and the unconsumed area. - introduce encoding && compress interface - defined encoding && compress interface in pax column - support ORC RLEV2 which is a streaming encoding with 4 encoding type - support ZSTD and Zlib - pax column changed - split pax column from storage/ - split PaxColumn and PaxColumns The implementation of RLEV2 is quite different from ORC. Mainly reflected in the following aspects: - use a state machine to divide the current encoding stream state. - less memory usage and memory copies during encoding. - during the decoding process, when there are null bitmap pass, reduce memory copying.

kExTypeFileOperationError without define an error message in exception_names. It will make memory out of bounds when kExTypeFileOperationError catch by CBDB_CATCH_COMM()

PAX RLE decoding support template - int1/int2/int4 column can keep a DataBuffer<T> as data part after rle decoding. - reduce some memory usage/copy in decoding.

Feature: add PAX projection filter functionality. * Filter read data by filtering with column projection info passed in PG kernel. * Support sequential data read optimization in case sequential column found in projection. For PAX table with 700M and write into single pax file case * For single column projection case 1/8 (e.g select a from table), the time spent in seqscan is about 30% comparing that without pax column filtering, the effeciency is improved by about 70% for this case. * For sequential column projection case 4/8 (select a,b,c,d from table), the time spent in seqscan is about 50% comparing that without column projection filtering, the effeciency is improved by about 50% for this case.

build pax release version with `cmake -DENBALE_DEBUG=off ..` also won't build `gtest target` in release build

…f cbdb for testing pax extension uses the 1X_STABLE_CP_FEATURE_PAX branch of cbdb for testing, so that pax can provide paxformat.so for storage_am testing, or it can be testing with an independent extension

Problem: The projection_info is not always available in Filter class, for example, Scan analyze will not call beginScan to init column information.

Current column projection design is very terrible - Split out a large number of public functions with unknown logic - Correctness problem - Destroyed the original easy-to-maintain interface There are too many problems in projection implements. So I revert orc.cc and orc.h to lastest version, and reimplement a new one. About new column projection implements: - Using a bool * to filter columns for non essential reads - Format independent, non-invasive to the MicroPartition interface - Also changed gtest in orc_test.cc(too many problems in last projection pr)

In CBDB, there is already a corresponding function implementation. Removed the implementation of BuildPaxDirectoryPath and BuildPaxFilePath from the filesystem interface.

Reconstruct the directory structure of pax extension and abstract it into access method layer, table format layer, and storage format layer. The lower layer does not have any dependence on the upper layer. In this way, the storage format layer can be compiled independently and provided for others extension.

Introduce the new column interface PaxEncodingColumn/PaxNonFixedEncodingColumn which can create encoding/compress columns. - PaxEncodingColumn is encoding/compress support for the fix-length pax column. - PaxNonFixedEncodingColumn is compress support for the non-fixed-length pax column. - For a non-fixed column, there is no encoding support - Added a a subclass named PaxIntColumn - It's a int* encoding column. - Default encoding method is `kTypeRLEV2`

Remove `pax.cc` from compiling `libpaxformat.so` `pax.cc` belongs to table layer. In paxformat, only the API of mirco-partition layer is provided. Also add the orc proto src files when building libpaxformat.so

*EncodingColumn*`, as the inherited classes of `*column*`, will encode and compress the current column before serialization. `PaxEncodingColumn/PaxNonFixedEncodingColumn` replaced `PaxCommColumn/PaxNonFixedColumn` as default column.

The analyze callback implementation uses the SeekTuple to skip tuples. It has two side effects: 1. The low-level storage reader needs to implement Seek feature. It complicates the low-level code. 2. The SeekTuple function is useless in normal queries. 3. The implementation is bad that doesn't use the buffer well. To skip the tuples, we now simply ignore the tuple that we've read until the tuple matches the count. Besides the above rewrite, the commit also removes some useless functions.

Removed the `seek*` interface in PAX. PAX no longer requires seek methods for analyze/sampling, but use `ReadTuple`.

This commit adds statistics info for micro partitions in auxilirary table. The stats currently contains {(allnull, hasnull), [minimal, maximum]}. The stats info will be used as brin index to skip scanning a whole micro partition file if possible. More stats info may be added later to make filtering more efficient.

``` postgres=# vacuum FULL vacumm_test; VACUUM Copy path (gdb) p src_path $3 = "base/13261/122884_pax/7f76cdaa-07d0-49f1-9e1b-ce1ccfb78712" (gdb) p dst_path $4 = "base/13261/122886_pax/7f76cdaa-07d0-49f1-9e1b-ce1ccfb78712" (gdb) n ```

The filter in PAX can do the sparse filtering by min/max statistics. 1. Support micro-partition-level filter 2. Cleanup the code for iterator and pax_aux_table

In some test cases of PAX Gtest a fixed path under /tmp is used, it may cause some file permission issues. The current change changes it to a relative path (relative to the execution path). In addition, some test cases will not delete the test files during the teardown stage, which will cause a problem when the test is repeatedly executed.

- Add GetRangeBuffer/GetRangeNonNullRows in pax cloumn. - also added test about these interface - Changed the SplitTupleNumbers which will split tuple number use 16384 * 10 - Changed some of interface to support vec

In the vectorized executor, we need to return the batch of rows(for the single column). In PAX, data needs to be transformed to record batch. This is because the organize of column data is different with record batch. - Fixed-length columns: PAX will not pad null datum by length. But record batch requires null padding by length. - Non-fixed-length columns: PAX will store the datum header and use length stream. Record batch requires removing the header and using offset array

Uncaught c++ exceptions are not allowed to be propagated to CBDB in PAX - All of AM function should do common catch or will lost the stack info - The ereport inside CBDB_CATCH_COMM will make CBDB_FINALLY not work - move CBDB_CATCH_COMM logic into CBDB_CATCH_DEFAULT

Unlike `cbdb::pfree`, the overloaded `delete` keyword checks whether the pointer is null. So we don't need to check if the current pointer is null when call the `delete`.

The target stats_generate_protobuf contained by if (BUILD_GTEST AND NOT BUILD_PAX_FORMAT) If pax build without GTEST, then target stats_generate_protobuf will be missed.

OrcIteratorReader became no longer applicable after multiple version iterations. Iterators exist in the table layer, so there is no need to use iterators in the micro-partition layer.

The vectorized condition in `MicroPartitionReader` will cause complicate the logic. - Remove function `ReadVecTuple` which is specialized logic - Added a new MicroPartitionReader named `PaxVecReader` to adapting read from vectorization version.

jiaqizho · 2025-04-14T11:58:35Z

As the situation may arise were we encounter PRs with large commits, we should first discuss the preferred approach on the dev list. I am marking this with a "Request changes" to allow us to consider an alternate approach on the dev list.

As we discussed in the dev list, I haven't seen any response so far, so we decided to merge pax by the spilted PR.

my-ship-it approved these changes Apr 10, 2025

View reviewed changes

tuhaihe approved these changes Apr 10, 2025

View reviewed changes

jiaqizho changed the title ~~[skip ci]Feature: introduce a high-performance hybrid row-columnar storage engine (1/4)~~ Feature: introduce a high-performance hybrid row-columnar storage engine (1/4) Apr 10, 2025

edespino self-requested a review April 10, 2025 13:20

edespino previously requested changes Apr 10, 2025

View reviewed changes

jiaqizho requested a review from edespino April 11, 2025 09:42

jiaqizho and others added 21 commits April 14, 2025 19:56

Feature: support table scan in pax storage

c31fbcc

In the previous commit, we implemented the interfaces of micro-partition and table. In the current commit, connect the access method to the table interface. - add table meta and micro partition meta - add orc reader for table scan

bugfix: block_id is empty in catalog table when insert tuple

cebc0c7

In commit "Feature: support table scan in pax storage", have not passing the block_id into `WriterOptions`. It will cause the `block_id` to be illegal.

PAX: format headers and their order

a61893e

Format the header and cleanup the code, also removed unneeded wrapper functions. The order of header introduction in PAX should be sorted alphabetically.

PAX: Add reloptions support

92371e2

Implemented reloptions in PAX,But note that some of the reloptions in the current commit have not been implemented. Currently PAX does not support reloptions such as compress/storage_format.But in the future, this part of the reloption will be implemented.

jiaqizho and others added 27 commits April 14, 2025 19:56

Fix: kExTypeFileOperationError exception missing error message

efefe20

kExTypeFileOperationError without define an error message in exception_names. It will make memory out of bounds when kExTypeFileOperationError catch by CBDB_CATCH_COMM()

Feature: RLE decoding support template

91ba84e

PAX RLE decoding support template - int1/int2/int4 column can keep a DataBuffer<T> as data part after rle decoding. - reduce some memory usage/copy in decoding.

CMake: allow release build if -DENBALE_DEBUG=off

f568683

build pax release version with `cmake -DENBALE_DEBUG=off ..` also won't build `gtest target` in release build

Enhancement: pax extension uses the 1X_STABLE_CP_FEATURE_PAX branch o…

678ccda

…f cbdb for testing pax extension uses the 1X_STABLE_CP_FEATURE_PAX branch of cbdb for testing, so that pax can provide paxformat.so for storage_am testing, or it can be testing with an independent extension

Bugfix: Fix ReadTuple in case scan analyze without projection info

6c02c31

Problem: The projection_info is not always available in Filter class, for example, Scan analyze will not call beginScan to init column information.

Enhancement: Remove unnecessary interface from filesystem class

d686334

In CBDB, there is already a corresponding function implementation. Removed the implementation of BuildPaxDirectoryPath and BuildPaxFilePath from the filesystem interface.

Bugfix: remove unnecessary files when compiling libpaxformat.so

4b9b11e

Remove `pax.cc` from compiling `libpaxformat.so` `pax.cc` belongs to table layer. In paxformat, only the API of mirco-partition layer is provided. Also add the orc proto src files when building libpaxformat.so

Feature: orc support encoding column

9b45231

*EncodingColumn*`, as the inherited classes of `*column*`, will encode and compress the current column before serialization. `PaxEncodingColumn/PaxNonFixedEncodingColumn` replaced `PaxCommColumn/PaxNonFixedColumn` as default column.

Simplify the iterator interface and cleanup some unused code

76d2043

Removed the `seek*` interface in PAX. PAX no longer requires seek methods for analyze/sampling, but use `ReadTuple`.

Feature: Implement API CopyForCluster

6ef8ba0

``` postgres=# vacuum FULL vacumm_test; VACUUM Copy path (gdb) p src_path $3 = "base/13261/122884_pax/7f76cdaa-07d0-49f1-9e1b-ce1ccfb78712" (gdb) p dst_path $4 = "base/13261/122886_pax/7f76cdaa-07d0-49f1-9e1b-ce1ccfb78712" (gdb) n ```

CI: change back to cbdb feature-pax branch

466ca86

Feature: Add support for micro-partition-level filter

1499e8a

The filter in PAX can do the sparse filtering by min/max statistics. 1. Support micro-partition-level filter 2. Cleanup the code for iterator and pax_aux_table

Feature: interface adjustment to support vec implements

7385b1c

- Add GetRangeBuffer/GetRangeNonNullRows in pax cloumn. - also added test about these interface - Changed the SplitTupleNumbers which will split tuple number use 16384 * 10 - Changed some of interface to support vec

Op: Remove previous check before delete

a497b81

Unlike `cbdb::pfree`, the overloaded `delete` keyword checks whether the pointer is null. So we don't need to check if the current pointer is null when call the `delete`.

Fix: Build got some error with -DBUILD_GTEST=OFF

c58e7af

The target stats_generate_protobuf contained by if (BUILD_GTEST AND NOT BUILD_PAX_FORMAT) If pax build without GTEST, then target stats_generate_protobuf will be missed.

Op: remove OrcIteratorReader

96d1a16

OrcIteratorReader became no longer applicable after multiple version iterations. Iterators exist in the table layer, so there is no need to use iterators in the micro-partition layer.

jiaqizho force-pushed the pax-split-380-commit-1 branch from 46d1fa9 to bba60fd Compare April 14, 2025 11:56

jiaqizho merged commit 22aeaed into apache:main Apr 14, 2025
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: introduce a high-performance hybrid row-columnar storage engine (1/4)#1041

Feature: introduce a high-performance hybrid row-columnar storage engine (1/4)#1041
jiaqizho merged 99 commits intoapache:mainfrom
jiaqizho:pax-split-380-commit-1

jiaqizho commented Apr 10, 2025

Uh oh!

jiaqizho commented Apr 10, 2025

Uh oh!

edespino left a comment

Uh oh!

reshke commented Apr 11, 2025

Uh oh!

jiaqizho commented Apr 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

jiaqizho commented Apr 10, 2025

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

Uh oh!

jiaqizho commented Apr 10, 2025

Uh oh!

edespino left a comment

Choose a reason for hiding this comment

Uh oh!

reshke commented Apr 11, 2025

Uh oh!

jiaqizho commented Apr 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants