Skip to content

Feature: introduce a high-performance hybrid row-columnar storage engine (2/4)#1042

Merged
jiaqizho merged 97 commits intoapache:mainfrom
jiaqizho:pax-split-380-commit-2
Apr 14, 2025
Merged

Feature: introduce a high-performance hybrid row-columnar storage engine (2/4)#1042
jiaqizho merged 97 commits intoapache:mainfrom
jiaqizho:pax-split-380-commit-2

Conversation

@jiaqizho
Copy link
Copy Markdown
Contributor

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@jiaqizho jiaqizho force-pushed the pax-split-380-commit-2 branch from 0d1519f to 285b3c9 Compare April 10, 2025 12:16
gongxun0928 and others added 27 commits April 14, 2025 21:20
`micro_partition_stats` should be included in paxformat.

Generate micro_partition_stats.pb.h and micro_partition_stats.pb.cc when build libpaxformat.so
…ablespace case.

Problem: The PAX file path directory structure is not consistency for set new tablesapce cases. Currently if the source file directory is emptry then it will skip to copy data as well as the destination directory creation. The problem will be raised when the next time set new table space DDL executed, if the source directory is empty, the assert in ListDirectory will be raised and caused exception.

Fix: The fix should be creating the destination pax directoy even if source directory is empty, which makes the source/destination directory structure consistency in set new tablesapce DDL execution.

```
std::vector<std::string> LocalFileSystem::ListDirectory(
    const std::string &path) const {
  ...

  Assert(filepath != NULL && filepath[0] != '\0');

  dir = opendir(filepath);
  CBDB_CHECK(dir, cbdb::CException::ExType::kExTypeFileOperationError);

  ...
}

```

```
,sx1,"ERROR","XX000","ERROR: RelationCopyData (pax_access_handle.cc:194)",,,,,,"ALTER TABLE ALL IN TABLESPACE regress_tblspace_renamed SET TABLESPACE pg_default;",0,,"pax_access_handle.cc",194,"Stack trace:
1    0x7f6aa6f1f8e8 libpostgres.so errstart + 0x208
2    0x7f6a998f2fc5 pax.so _ZN3pax17CCPaxAccessMethod16RelationCopyDataEP12RelationDataPK11RelFileNode + 0x9d
3    0x7f6aa6b2c889 libpostgres.so <symbol not found> + 0xa6b2c889
4    0x7f6aa6b3d3ca libpostgres.so <symbol not found> + 0xa6b3d3ca
5    0x7f6aa6b3e76f libpostgres.so AlterTableMoveAll + 0x2bf
6    0x7f6aa6dbfb39 libpostgres.so <symbol not found> + 0xa6dbfb39
7    0x7f6aa6dbe4e9 libpostgres.so standard_ProcessUtility + 0x169
8    0x7f6aa6dbcb13 libpostgres.so <symbol not found> + 0xa6dbcb13
9    0x7f6aa6dbcc64 libpostgres.so <symbol not found> + 0xa6dbcc64
10   0x7f6aa6dbd3cb libpostgres.so PortalRun + 0x2bb
11   0x7f6aa6db75c4 libpostgres.so <symbol not found> + 0xa6db75c4
12   0x7f6aa6dbac9a libpostgres.so PostgresMain + 0x1fda
13   0x7f6aa6d031d3 libpostgres.so <symbol not found> + 0xa6d031d3
14   0x7f6aa6d04305 libpostgres.so PostmasterMain + 0xe45
15   0x4017c0 postgres main (main.c:198)
16   0x7f6aa5fdb555 libc.so.6 __libc_start_main + 0xf5
17   0x4019c4 postgres <symbol not found> + 0x4019c4
```

Test:

```
postgres=# \!rm -rf /tmp/t_pax73
postgres=# \!mkdir /tmp/t_pax73
postgres=# drop tablespace test73;
ERROR:  tablespace "test73" does not exist
postgres=# create tablespace test73 location '/tmp/t_pax73';
CREATE TABLESPACE
postgres=# create table t_pax73(a int, b int) using pax;
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'a' as the Cloudberry Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
postgres=# insert into t_pax73 select i ,i from generate_series(1,1000) i;
INSERT 0 1000
postgres=# alter table t_pax73 set tablespace test73;
ALTER TABLE
postgres=# insert into t_pax73 select i ,i from generate_series(1,1000) i;
INSERT 0 1000
postgres=# select count(1) from t_pax73;
 count
-------
  2000
(1 row)

postgres=# alter table t_pax73 set tablespace pg_default;
ALTER TABLE
postgres=# insert into t_pax73 select i ,i from generate_series(1,1000) i;
INSERT 0 1000
postgres=# alter table t_pax73 set tablespace test73;
ALTER TABLE
postgres=# select count(1) from t_pax73;
 count
-------
  3000
(1 row)
```
In previous commits, PAX can already create encoding columns.

The current commit allows users to specify the ENCODING options
of a column using the ENCODING syntax.

example:
```
CREATE TABLE t1 (
        c1 int ENCODING (compresstype=zstd),
    c2 char ENCODING (compresstype=zlib),
    c3 char) using pax;
```

more cases see the file `src/test/regress/sql/am_encoding.sql`
In previous commits, PAX can already create encoding columns
and specify the ENCODING options of a column.

The current commit allows users to specify the encoding type
in reloptions. The column ENCODING options will inhert the
reloptions.

example:
```
CREATE TABLE t1 (c1 int, c2 char, c3 char) using pax with (COMPRESSTYPE=zstd, compresslevel=5);
```

more cases see the file `src/test/regress/sql/am_encoding.sql`
`ERROR:  operator 1209 is not a member of opfamily 4056 (lsyscache.c:164)`

cause the by `like clauses`.

ex.
```
select * from table where text_type_column like 'Unknown%';
```
Data file may be empty in our current implementation. In this case,
the micro partition stats is also initialized, but the typid is not set yet.
The typid assert should only happen if the typid is set when inserting
the first tuple.

We relax this constraint to allow empty data file now.
The previous TAM callback scan_analyze_next_tuple is not implemented correctly,
as appendonly tables. This patch fix the behavior and implements the callback
semantically identical as appendonly tables.

Snapshot may be NULL in scan, remove the Assert from MicroPartitionIterator.
There are some logical errors in RLE encoding:
  - several of types in `RLE` will share the contex and need to be reset uniformly.
  - fixed `ENABLE_DEBUG` name
  - remove unreasonable assert
In the storage engine, if the user often reads some columns of the same table,
the storage engine will no longer need to fetch the data from the disk if there is a cache.

- Added cache interface
- Added plasma implements
Use plasma to cache the pax columns
- read cache if column already in cache
- write cache if column read from disk

Also added a GUC to disable/enable cache.
ERROR:  cannot unpin a segment that is not pinned (dsm.c:983)  (seg2 slice17 127.0.1.1:7004 pid=7331) (dsm.c:983)

In scan path, if have not `delete` operator in target list, Then should not build block bitmap.

GetBlockNumber is not a process-safe function. Should not call it when caller not in update or delete operator.
In PAX, bitmap is implemented through byte array.

Changing to bit array can reduce the memory and improve performance.
The current commit encapsulates the `elog`.

Log needs to reimplement and more robust.
When reading a PAX file in, multiple IO are needed to read the meta info of the file.

  - Read the length of postscript
  - Read the post script
  - Read the file footer
  - Read the meta if exist

In fact, we don't need to use multiple IO to read the tail structure.

  - Support 32K read to load post_script + footer + meta
  - Split writer/reader into different file
PAX changed from multiple stripe(groups) to a single stripe(group).

But when PAX supports sparse filtering, dividing into multiple groups
can make the min/max statistical distribution in a single group more even.
In sparse filtering, if a single file is not filtered, there are also
cases where the group is filtered. So we decided to add multiple groups back.

- support multi groups in mirco-partition writer and reader
- refactor reader code
Create new auxlilary table `pg_pax_fastsequence` for pax
file number allocation usage for every single pax table.
`pg_pax_fastsequence_index` is created on column objid as
 well, which helps for searching effeciency of pax file
 fastsequence id.
The vectorized execution engine does not require a real CTID. Just use CTID to distinguish different rows.

In the current commit, PAX supports the return of CTID, but the returned CTID is only unique in the SEQ SCAN operator.
…r ..."

hashdata/arrow changed the core data structure ArrowRecordBatch in
47dd3a96f8ebcb87 from
```c
struct ArrowRecordBatch {
  struct ArrowArray *batch;
  struct ArrowSchema *schema;
};
```
to
```c
struct ArrowRecordBatch {
  struct ArrowArray batch;
  struct ArrowSchema schema;
};
```

This commit adapts pax to the above change.
min-max filter assumes that the left type is the same as the right type.
It's not truth in some cases. For example:
where d_date between '1999-4-01' and
  (cast('1999-4-01' as date) + '60 days'::interval)

The left type is date, while the right type is timestamp in the second
comparison. The comparing functions are not the same.

NOTE: The right type is kept in ScanKeyData.sk_subtype.
export comm/bitmap.h. For the header files in the comm directory, keep the minimum export, and export all the header files in the storage directory.
`type_align` in `pg_type` should follow:
- address alignment: the datum which return need alignment with type_align
- datum padding: the datum need padding with type_align

The Non-fixed column without encoding will not following address alignment in some case.
So we need a padding before data part, Then after reading from disk, the address always be aligned.
The current commit is a supplement to "Feature: mirco-partition support multi groups"

- VEC reader support read with group
- Support new bitmap implements in vec adpater
Refactor the signature of scan_begin_extractcolumns to contain execution
context, so that the scan node is able to filter data by evaluating
expressions in low level. Low-level filter may have less cost than extracting
all columns in a row and filtered out by the upper execution node.
PAX already supports the gmock(google mock), but gmock does not support mocking
private functions or changing C function pointers.

In the unit test of PAX, some CBDB functions always need to be called, but it is
difficult for this kind of CBDB function to construct the returns required in the test case.

So PAX introduces a new library named `cpp-stub`. cpp-stub can replace some global
C functions or private functions with PAX implementations, making test cases easier to implement.
The `test_main` still need link the `libarrow.so`
Add two reloptions to support partition: partition_by
and partition_ranges. The syntax of them is
partition_by = "..."

The raw string is stored in pg_class.reloptions like
any other options. But we also store a transformed
structure(data type is pg_node_tree) in pg_pax_table.partitionspec.
Constant values are not directly saved in PartitionBoundSpec.
For range partitions, the lowerdatums and upperdatums save
the list of PartitionRangeDatums, not Const. The transform
happens after parser, but we transform directly here.

Now, the partitionspec in pg_pax_tables are completely the same
as normal pg_class.relpartbound for partition tables.
wuhao and others added 27 commits April 14, 2025 21:20
Pax uses standard new/delete to manage memory for C++ objects, but
implements the global operator functions for new/delete. This commit
replaces all new/delete by the template functions for later refactor.
We uniform the behavior on how to manage the C++ objects in the new
template functions. It has two major advantages:
1. The memory management may also change in the future, we may change
    it all in one.
2. We don't use global operator function any more.
1. delete zstd submodule and use the same zstd dependency as cloudberrydb
2. updated cpp-stub(remove the submodules insided)
3. merge dependencies into the same cmake file and check dependencies in advance
When pax writes data, two kinds of statistical information level will be generated.
One is file-level statistical information, and the other is group-level statistical
information.

But after in WriteTuple, both levels of statistics are updated. In fact,
we only need to update the group-level statistics, then file-level statistics
can merge it.
Current lighting pipeline will use CBDB_BUILD_TYPE TO define build type.
CTupleSlot was originally used to encapsulate some operations on TupleTableSlot.

In fact, this is an additional object introduced. There are already enough functions in
PG to modify TupleTableSlot.

In the current version, this object will not only generate additional memory, but also
make the interface more complex. So the current change decide to remove it.
remove the BUILD_PAX_FORMAT option and compile paxformat.so and pax.so at the same
time by default. When compiling paxformat.so, add -DBUILD_PAX_FORMAT
BuildPaxFilePath is not a thread-safe function. And it will be call multi-times in write/read path.

Change to get the relation path before call BuildPaxFilePath, when we need build the file path,
we don’t need to build the relation path every time.
The original namespace of the generated protobuf class conflict
with the official ORC, which will use wrong class in PAX code.
We use a different namespace to avoid this case.
The ${prefix} is the path to install the databse when you configure. We
install libpax.so, libpaxformat.so, and headers to ${prefix}. So that
other extension can link to it.
The file name is previously generated by uuid, which means
the data file is unordered. The ctid is hard to define.
The ctid used by DELETE/UPDATE is built temporarily on dynamic
shared memory.
However, manipulating dynamic shared memory is unfriendly to
parallel scan, because we use the pg functions to operate them.
We'll no longer maintain the old behavior that uses uuid as
file name, so the file name will become the block number as
part of the ctid.
PAX always calculates min/max values in the micro partitions,
so we could apply these values to filter the whole micro partition.
There is no way to bypass the filtering process.

However, there's a case that we need to disable the filtering,
like BUGs.

The guc is on by default. We should manually turn it off.
We should use the macro defined by extension vectorization, other than
hard-coded flag.
The member `IndexFetchTableData base_` MUST be the first field of class
PaxIndexScanDesc. The class object MUST be able to convert to the postgres
struct IndexFetchTableData. On the other hand, the class PaxIndexScanDesc
is not allowed to have any virtual function.
`make -j` will use infinite number of jobs that may exhaust memory.
If the current footer length is too long, stack over flow may occur.

1. use scoped_ptr to release buffer avoid memory leak
2. fix compile error
pax_itemptr.cc is unused in libpaxformat.so but cause unexpected compile error.
The GUC name should not contain dot `.`, because setting non-exist
guc name doesn't raise an error. It's confusing if the user spells
the wrong guc name.

```
gpadmin=# set pax.abcdefg = on;
SET
gpadmin=# set pax_abcdefg = on;
ERROR:  unrecognized configuration parameter "pax_abcdefg"
```
This commit adds several basic regression tests for PAX.
When running ICW test, these basic tests will run, so some
basic DLL and DML queries could be tested.
When run configure script, we can pass `--enable-pax` to enable pax
support, or `--disable-pax` to disable pax. The default behavior
is to disable pax support.
After pax implements some of PG min/max operators, we no longer allow call detoast in pax operators.
We need to perform the detoast operation before storing the min/max value or non-fixed datum.
The guc `pax_scan_reuse_buffer_size` is marked that must be sync from
QD to QE. This commit marks the guc need not to sync. Here are two
reasons:
1. The guc is not necessary to forcely sync to QE. Normal `SET` will
    also set the guc to the QE. It's totally valid that the guc values
    are different among QD and QEs.
2. The current behavior requires that PAX must be loaded before setting
    gucs from QD on the QE. If the PAX module is not loaded, syncing
    `pax_scan_reuse_buffer_size` before loading pax.so will cause
    undefined guc error.

```
FATAL:  unrecognized configuration parameter "pax_scan_reuse_buffer_size"
```
PAX needs to define the StdRdOptions instead of vl_len.
Because many places in CBDB assume that option in relation can be cast into StdRdOptions.
`DataBuffer` is the basic buffer manager of pax.

Make frequently called methods to inline can reduce some instruction calls.
When cbdb using `alter table` with PAX table, the `PaxObjectAccessHook` will be call.

Then sub_id will not be 0, because `attrnum` is not 0.
The compress level in group footer always be 0, because pax will not set it.
We should record this field, even if it is not used during reading. 
Without record compress level will make the file written unaware of the compress level of
the group.
In the current implementation, the length of a single tuple field will not exceed 2GB,
so a single element of the lengths stream can use int32 to represent the length, so that
each variable-length field can save 4 bytes. In the large-wide table test with 500
variable-length columns and 1 million tuples, 2GB of storage space was saved.
…lemented sk_strategy

If current operator not support in pax operator. Then the all_null and has_null will
be lost in file level statistics.
Also Pax should not Assert(false) when accept a unimplemented sk_strategy.

Current change also implements the bpchar operator.
@jiaqizho jiaqizho force-pushed the pax-split-380-commit-2 branch from 285b3c9 to 3396ce7 Compare April 14, 2025 13:20
@jiaqizho
Copy link
Copy Markdown
Contributor Author

2/4 part of #1002.

Merging a pull request using the "Rebase and merge" option is limited to 100 commits. PR 1002 contains 380 commits.

@jiaqizho jiaqizho merged commit f967400 into apache:main Apr 14, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants