Skip to content

Feature: introduce a high-performance hybrid row-columnar storage engine (1/4)#1041

Merged
jiaqizho merged 99 commits intoapache:mainfrom
jiaqizho:pax-split-380-commit-1
Apr 14, 2025
Merged

Feature: introduce a high-performance hybrid row-columnar storage engine (1/4)#1041
jiaqizho merged 99 commits intoapache:mainfrom
jiaqizho:pax-split-380-commit-1

Conversation

@jiaqizho
Copy link
Copy Markdown
Contributor

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@jiaqizho
Copy link
Copy Markdown
Contributor Author

1/4 part of #1002.

Merging a pull request using the "Rebase and merge" option is limited to 100 commits. PR 1002 contains 380 commits.

@jiaqizho jiaqizho changed the title [skip ci]Feature: introduce a high-performance hybrid row-columnar storage engine (1/4) Feature: introduce a high-performance hybrid row-columnar storage engine (1/4) Apr 10, 2025
@edespino edespino self-requested a review April 10, 2025 13:20
Copy link
Copy Markdown
Contributor

@edespino edespino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the situation may arise were we encounter PRs with large commits, we should first discuss the preferred approach on the dev list. I am marking this with a "Request changes" to allow us to consider an alternate approach on the dev list.

@reshke
Copy link
Copy Markdown
Contributor

reshke commented Apr 11, 2025

As the situation may arise were we encounter PRs with large commits, we should first discuss the preferred approach on the dev list. I am marking this with a "Request changes" to allow us to consider an alternate approach on the dev list.

yes, but this is contrib-only change, no in-core changes here. So maybe we should merge this as-is?

jiaqizho and others added 21 commits April 14, 2025 19:56
"Core code structure:
  - Add extension relative code in src/data/
  - Implement C++ core modules in src/cpp/

Development environment:
  - Integrate Cpplint for code style validation
  - Set up pre-commit Git hooks

Common components:
  - Add shared utilities in src/cpp/comm/
  - This contains the glue layer from PAX to CBDB

Build system:
  - Add CMakeLists.txt for project configuration
  - Set up basic build configurations

Others:
  - added .gitignore

Co-Author: Hao Wu <gfphoenix78@gmail.com>
Co-Author: Gong Xun <gongxun@hashdata.cn>"
…faces

In the current commit, readers and writers for relation and file are implemented.

Micro-partition level (File operations):
  - Added file writer/reader interface
  - Handle file metadata and checksum validation

Table/Relation level (Relation operations):
  - Hold the micro-partition writer/reader
  - Used to implements the access method interface

Co-Author: Hao Wu <gfphoenix78@gmail.com>
Co-Author: Gong Xun <gongxun@hashdata.cn>
Added git submodule for google/googletest.which includes both Google Test (gtest) and Google Mock (gmock).

  - Google Test (gtest) is a framework for writing C++ test programs.
     It provides various features to simplify the process of writing and
    organizing tests.

  - Google Mock (gmock) extends gtest by allowing developer to mock C++
    objects and functions, facilitating more comprehensive testing scenarios.

Updated CMakeList.txt to include gtest and gmock.
Refactor and extend the `file_system.h` interface to support local file operations.

  - Implement the local file operations in `local_file_system.cc`
  - Ensured error handling and resource management are robust and consistent.
  - Included unit tests to validate the correctness of the new implementations.
with this pr, we can now insert tuples into pax storage. the path of insert is
"CPaxAccess-> CPaxInsert->TableWriter->MicroPartitionWriter(a implement for origin orc file)->OrcFileWriter"

Changes:
- Add access method handle layer, only support tuple_insert
- Add CPaxInserter implementation for tuple_insert
- Add orc_native_micro_partition implementation, only support orc_writer

Testing:
- unit tests will TBD
- manually tested the feature in my test environment
In the previous commit, we implemented the interfaces of micro-partition and table.

In the current commit, connect the access method to the table interface.

   - add table meta and micro partition meta
   - add orc reader for table scan
CBDB can be successfully built libpostgres.so

For PAX extension, we can link libpostgres.so to use some kernel functions.
The current commit will change the PAX build script to link the libpostgres.so

In addition, The macro PAX_INDEPENDENT_MODE has been removed and this
macro is no longer relied upon to decide whether to rely on CBDB to compile PAX
In commit "Feature: support table scan in pax storage", have
not passing the block_id into `WriterOptions`. It will cause
the `block_id` to be illegal.
Trancate table support for PAX AM handler
1. relation_set_new_filenode = pax_relation_set_new_filenode, //p0 trancate related delete
2. relation_nontransactional_truncate = pax_relation_nontransactional_truncate, // p0 trancate related delete

Issue:
TODO1: cleanup related micro-partition data with pending-delete.
TODO2: Truncate table with index case
When CBDB creates a PAX relation. The data size of the
relation needs to be known.The current commit adds a
column to the catalog to count the data size of the
current relation.

  - table `pg_pax_blocks` add the ptblocksize attribute to record the size of micro partition block
  - callback summary will record the block size when micro partition writer close
  - adjust clang-format style, make the code more readable
Since Pax Extension could be built seperately from CBDB, so we could get
include path independently instead of hard coded file path.

Use pg_config option to provide include path. Since pg_config could
provide other informations, like --libdir, --sharedir and etc, so we
provide a function GET_PG_CONFIG here for future convenience purpose.

Approved-by: wuhao <wuhao@hashdata.cn>

Max Yang <yangyu@hashdata.cn>
In PAX, dml_init/dml_fini hooks are used to initialize the
context required for Insert.

But in the previous commits, `InitDmlState` and `FinishDmlState`
will be called regardless of whether AM is a relation of PAX.
`RelationIsPax` has been added to the current changes to ensure
that `InitDmlState` and `FinishDmlState` will only be called
when AM is PAX.
Access method functions in TableAmRoutine may use either C function
or C++ functions. For C++ functions, it must take care of c++ exceptions
and convert them to the standard PG ERROR, so the caller functions in
PG knows to raise an error and caught by the upper handler.

As a simple rule, all static functions in CCPaxAccessMethod are the
outmost C++ functions, it must convert C++ exceptions to PG ERROR.
Some access methods look better to use C functions to manipulate
catalogs and avoid involving C++ exceptions, are grouped in class
PaxAccessMethod.
`RelationSize/EstimateRelSize` is the interface used to count the current relation.

Before this commits, we have added `ptblocksize` to the auxiliary table
which used to count the size of the current PAX table. This commit implement
the `RelationSize` and `EstimateRelSize` interfaces.
1. Implemented pax table rescan interface pax::CCPaxAccessMethod::ScanRescan
2. Implemented table meta pax::TableMetadata::Iterator::Seek, which supports the micro-partition file seek functionality.
3. Implemented unit test seek_iterator in TableMetadataTest.

Approved-by: gongxun <gongxun@hashdata.cn>

Tony Ying <yinglinhu@hashdata.cn>
Format the header and cleanup the code, also removed unneeded wrapper functions.

The order of header introduction in PAX should be sorted alphabetically.
`ScanAnalyzeNextBlock/ScanAnalyzeNextTuple/ScanSampleNextBlock/ScanSampleNextTuple`
are AM interfaces used for analyze and sampling.

In PAX, the seek interface is implemented, including the following parts:
- file layer supports `seek` function
- micro-partition and table layer support `seektuple` function
Implemented reloptions in PAX,But note that some of the reloptions
in the current commit have not been implemented. Currently PAX does
not support reloptions such as compress/storage_format.But in the
future, this part of the reloption will be implemented.
In PAX, two namespaces are defined
- pax namespace: can be called directly in the c++ environment and may generate exceptions.
- paxc namespace: The current method may produce a long jump (produced by erreport(error)), and the long jump must be converted into an exception before it can be called by C++.

The current commit has fixed some incorrect namespace definitions and fixed some issues with mixed use of pax/paxc methods.
Implemented bulk insert in PAX. Bulk insert is currently only used in copy.

```
-- create pax table for testing
CREATE TEMP TABLE x (
        a serial,
        b int,
        c text not null default 'stuff',
        d text,
        e text
) using pax;

copy table from stdin.

COPY x (a, b, c, d, e) from stdin;
10006        22        32        42        52
10007        23        33        43        53
10008        24        34        44        54
10009        25        35        45        55
10010        26        36        46        56
\.

select * from x;

create table x1 using pax as select * from x;

select * from x1;
   a   | b  | c  | d  | e
-------+----+----+----+----
 10014 | 25 | 35 | 45 | 55
 10013 | 24 | 34 | 44 | 54
 10011 | 22 | 32 | 42 | 52
 10012 | 23 | 33 | 43 | 53
 10015 | 26 | 36 | 46 | 56
(5 rows)

```
Why we need porting ORC format into PAX?

  - PAX will support different storage format in the future.
  - PAX need some common interface: memory processing, encoding/decoding process, profiling ...
  - Adapting liborc is not a good choice:

    * `orc_file_stream.h` used file interface as `liborc output backend`. it will face a huge amount of IOPS pressure.
    * If `orc_file_stream.h` uses memory as `liborc output backend`. It will make multiple unnecessary memory copies.

In the current pr, some of abstracted interfaces have been defined:

  - buffer(pax_buffer.h): Used to receive and process data in memory

    * Provide `a working pointer` which can help the caller handle memory better.
    * Support zero copy or create by itself.
    * Easier to manage the life cycle of large blocks of memory.

  - column(pax_column.h): abstraction of column memory slices for processing column data

    * Contains `fixed-length column`, `non-fixed-length column`, and `column sets`
      * `fixed-length column` and `non-fixed-length column` correspond to the memory-to-disk structure.
      * `column sets` is used to manage multiple columns which under the same schema.
    * The comment processing interface of PAX can be unified through the column structure. PAX can using column structure as input do encoding/decoding without conside storage format.
    * Fewer memory copies, support zero copy for non-fixed-length types, and there will be more types of zero copy support

Also in the current pr, PAX chose ORC as the first supported storage format.

  - Full ORC format support
  - Full ORC write/read support
  - May not be fully support ORC in the future(For better performance and architecture)
jiaqizho and others added 27 commits April 14, 2025 19:56
added new types of DataBuffer, UntreatedDataBuffer and TreatedDataBuffer
  - UntreatedDataBuffer like a sliding window used to consume a batch of buffer.
  - TreatedDataBuffer can well distinguish the buffer of the consumed area and the unconsumed area.
- introduce encoding && compress interface
  - defined encoding && compress interface in pax column
  - support ORC RLEV2 which is a streaming encoding with 4 encoding type
  - support ZSTD and Zlib
- pax column changed
  - split pax column from storage/
  - split PaxColumn and PaxColumns

The implementation of RLEV2 is quite different from ORC. Mainly reflected in the following aspects:

- use a state machine to divide the current encoding stream state.
- less memory usage and memory copies during encoding.
- during the decoding process, when there are null bitmap pass, reduce memory copying.
kExTypeFileOperationError without define an error message in exception_names.
It will make memory out of bounds when kExTypeFileOperationError catch by CBDB_CATCH_COMM()
PAX RLE decoding support template
- int1/int2/int4 column can keep a DataBuffer<T> as data part after rle decoding.
- reduce some memory usage/copy in decoding.
Feature: add PAX projection filter functionality.

  * Filter read data by filtering with column projection info passed in PG kernel.
  * Support sequential data read optimization in case sequential column found in projection.

For PAX table with 700M and write into single pax file case

  * For single column projection case 1/8 (e.g select a from table), the time spent in seqscan is about 30% comparing that without pax column filtering, the effeciency is improved by about 70% for this case.
  * For sequential column projection case 4/8 (select a,b,c,d from table), the time spent in seqscan is about 50% comparing that without column projection filtering, the effeciency is improved by about 50% for this case.
build pax release version with `cmake -DENBALE_DEBUG=off ..`

also won't build `gtest target` in release build
…f cbdb for testing

pax extension uses the 1X_STABLE_CP_FEATURE_PAX branch of cbdb for testing,
so that pax can provide paxformat.so for storage_am testing, or it can be
testing with an independent extension
Problem: The projection_info is not always available in Filter class, for example,
Scan analyze will not call beginScan to init column information.
Current column projection design is very terrible
  - Split out a large number of public functions with unknown logic
  - Correctness problem
  - Destroyed the original easy-to-maintain interface

There are too many problems in projection implements.
So I revert orc.cc and orc.h to lastest version,
and reimplement a new one.

About new column projection implements:
  - Using a bool * to filter columns for non essential reads
  - Format independent, non-invasive to the MicroPartition interface
  - Also changed gtest in orc_test.cc(too many problems in last projection pr)
In CBDB, there is already a corresponding function implementation.

Removed the implementation of BuildPaxDirectoryPath and BuildPaxFilePath from the filesystem interface.
Reconstruct the directory structure of pax extension and abstract it into access method layer, table format layer, and storage format layer.
The lower layer does not have any dependence on the upper layer.
In this way, the storage format layer can be compiled independently and provided for others extension.
Introduce the new column interface PaxEncodingColumn/PaxNonFixedEncodingColumn which can create encoding/compress columns.

  - PaxEncodingColumn is encoding/compress support for the fix-length pax column.
  - PaxNonFixedEncodingColumn is compress support for the non-fixed-length pax column.
    - For a non-fixed column, there is no encoding support
  - Added a a subclass named PaxIntColumn
    - It's a int* encoding column.
    - Default encoding method is `kTypeRLEV2`
Remove `pax.cc` from compiling `libpaxformat.so`

`pax.cc` belongs to table layer. In paxformat, only the API of mirco-partition layer is provided.

Also add the orc proto src files when building libpaxformat.so
*EncodingColumn*`, as the inherited classes of `*column*`, will encode and compress the current column before serialization.

`PaxEncodingColumn/PaxNonFixedEncodingColumn` replaced `PaxCommColumn/PaxNonFixedColumn` as default column.
The analyze callback implementation uses the SeekTuple to skip tuples.
It has two side effects:
  1. The low-level storage reader needs to implement Seek feature. It complicates
     the low-level code.
  2. The SeekTuple function is useless in normal queries.
  3. The implementation is bad that doesn't use the buffer well.

To skip the tuples, we now simply ignore the tuple that we've read
until the tuple matches the count.

Besides the above rewrite, the commit also removes some useless functions.
Removed the `seek*` interface in PAX.

PAX no longer requires seek methods for analyze/sampling, but use `ReadTuple`.
This commit adds statistics info for micro partitions in auxilirary table.
The stats currently contains {(allnull, hasnull), [minimal, maximum]}.
The stats info will be used as brin index to skip scanning a whole micro
partition file if possible. More stats info may be added later to make
filtering more efficient.
```
postgres=# vacuum FULL vacumm_test;
VACUUM

Copy path

(gdb) p src_path
$3 = "base/13261/122884_pax/7f76cdaa-07d0-49f1-9e1b-ce1ccfb78712"
(gdb) p dst_path
$4 = "base/13261/122886_pax/7f76cdaa-07d0-49f1-9e1b-ce1ccfb78712"
(gdb) n
```
The filter in PAX can do the sparse filtering by min/max statistics.

1. Support micro-partition-level filter
2. Cleanup the code for iterator and pax_aux_table
In some test cases of PAX Gtest a fixed path under /tmp is used, it may cause some
file permission issues.

The current change changes it to a relative path (relative to the execution path).

In addition, some test cases will not delete the test files during the teardown stage,
which will cause a problem when the test is repeatedly executed.
- Add GetRangeBuffer/GetRangeNonNullRows in pax cloumn.
  - also added test about these interface
- Changed the SplitTupleNumbers which will split tuple number use 16384 * 10
- Changed some of interface to support vec
In the vectorized executor, we need to return the batch of rows(for the single column).

In PAX, data needs to be transformed to record batch. This is because the organize of
column data is different with record batch.

- Fixed-length columns: PAX will not pad null datum by length. But record batch requires
  null padding by length.
- Non-fixed-length columns: PAX will store the datum header and use length stream.
  Record batch requires removing the header and using offset array
Uncaught c++ exceptions are not allowed to be propagated to CBDB in PAX

- All of AM function should do common catch or will lost the stack info
- The ereport inside CBDB_CATCH_COMM will make CBDB_FINALLY not work
- move CBDB_CATCH_COMM logic into CBDB_CATCH_DEFAULT
Unlike `cbdb::pfree`, the overloaded `delete` keyword checks whether the pointer is null.

So we don't need to check if the current pointer is null when call the `delete`.
The target stats_generate_protobuf contained by if (BUILD_GTEST AND NOT BUILD_PAX_FORMAT)

If pax build without GTEST, then target stats_generate_protobuf will be missed.
OrcIteratorReader became no longer applicable after multiple version iterations.

Iterators exist in the table layer, so there is no need to use iterators in the micro-partition layer.
The vectorized condition in `MicroPartitionReader` will cause complicate the logic.

- Remove function `ReadVecTuple` which is specialized logic
- Added a new MicroPartitionReader named `PaxVecReader` to adapting read from vectorization version.
@jiaqizho jiaqizho force-pushed the pax-split-380-commit-1 branch from 46d1fa9 to bba60fd Compare April 14, 2025 11:56
@jiaqizho
Copy link
Copy Markdown
Contributor Author

As the situation may arise were we encounter PRs with large commits, we should first discuss the preferred approach on the dev list. I am marking this with a "Request changes" to allow us to consider an alternate approach on the dev list.

As we discussed in the dev list, I haven't seen any response so far, so we decided to merge pax by the spilted PR.

@jiaqizho jiaqizho merged commit 22aeaed into apache:main Apr 14, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants