Skip to content

Feature: introduce a high-performance hybrid row-columnar storage engine (3/4)#1043

Merged
jiaqizho merged 98 commits intoapache:mainfrom
jiaqizho:pax-split-380-commit-3
Apr 14, 2025
Merged

Feature: introduce a high-performance hybrid row-columnar storage engine (3/4)#1043
jiaqizho merged 98 commits intoapache:mainfrom
jiaqizho:pax-split-380-commit-3

Conversation

@jiaqizho
Copy link
Copy Markdown
Contributor

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@jiaqizho jiaqizho force-pushed the pax-split-380-commit-3 branch from dcac80a to 8fd5438 Compare April 10, 2025 12:17
jiaqizho and others added 27 commits April 14, 2025 22:25
If current TupleTableSlot is resued, Then pax will got Assert(false) in ExecStoreVirtualTuple.
When pax uses TableParitionWriter to insert data, the file merging operation will be performed
in the Close.

If the current table already build the indexes, Pax can only use delete-update to update the
indexes.

If the current table is not indexed, pax assumes that the CTID in the TableTupleSlot(using
SetBlockNumber + SetTupleOffset) has no meaning. which means that we can directly call
writer->MergeTo to speed up the merge process.
Add back PAX unittests into CI
There is a big difference between numeric defined in arrow and numeric defined in pg.
Only two decimal types of arrow: decimal128 and decimal256, which fix the number of numeric
digits.

In order to be compatible with the pg format and make vectorization processing faster, pax
stores the adapted 128 bits format defined by arrow.

Users can use reloption: numeric_vec_storage to enable this storage method.
When exception happen in CreateMicroPartitionWriter, Then the MicroPartitionWriter
in TableWriter may become nullptr in this time.

But if caller call the TableWriter->Close(), Then caller will got nullptr error.
…mestamptz types

In the mapping from pg type to arrow type,the implementation of 
ConvSchemaAndDataToVec in pax extension needs to be consistent 
with the implementation of the PGTypeToArrowID function in the 
vectorization extension. Arrow::timestamp is used to represent 
the timestamp/timestampz/time type in pg.
When access method function swap_relation_files call, Pax should swap the
fast sequence.
The pax aux table will used the oid(7064) as namespace which is one of system table namespaces.

After create a pax table, the pax aux table will set the table space to invalid oid(0),
no matter currently table space is pg_default or not. This is because the table space of the
pax aux table is unchangeable (system table permission).

Whenever an pax aux table is created, the index of the pax aux table is also created. But
the index of the aux table is set to the same table space as the pax table
(current table space). Then the user can not successfully move the index to another table space.
We use the directory build generated by make to avoid compile again.
…'s less than 64

Check the value of gp_interconnect_queue_depth and warning if it's less than 64. the motion performance is
poor when the gp_interconnect_queue_depth value is small.
When executing sql containing spi logic, ReleaseCurrentSubTransaction
will be called at the end of spi. At this time, FdHandleAbortCallback
will be called, We should not release the parent's resources at this time.
So it need to check the owner of the resource.
Once pax cluster a index in transaction. The function CPaxCopyPaxBlockEntry may cause the
tuple in old aux table changed. So we should do copy rather than direct used old tuple.

Also we should update fast sequence when pax do cluster indexes.
In PAX, there are two splitting rules for file dimensions:

  1. Split by numbers of tuple
  2. Split by file size

But every time the stripe in file is written, the physical size of the current file
will be reset.
There are some statistical functions for tables in PG, such as the number of tuples inserted,
the number of tuples read, etc. This type of statistics also needs to be update in pax.

noted that current pg statistics does not affect the results of analyze.
Null test should be checked first before inspecting column-specific
attributes. `attno == 0` is the special case for null test. It checks
whether all columns are all null or all not null.
Special handling for C locale comparing.
In TPCH or TPCDS, If pax uses the default file splitting configuration,
it will make the data file too small and affect the reading performance.

But if it is in other case, using current file splitting configuration may
not be a problem.

Current change make the file splitting configuration to a GUC value, which
allows users to configure different configurations according to different case,
and the default value of file splitting configuration changed to a larger value.

Also, Pax will no longer set a default encoding for columns.
Fix incorrect micro partition file skip logic, if scan_key->sk_collation
is not equal to attr->attcollation, do not skip micro paritition file.
The CustomObjectClass in pax have not implements the callback object_type_desc
 and object_identity_parts which used in trigger call.

If current event_trigger registered, then CBDB can't drop the pax table.
Because the event_trigger try get the object description when table dropped.
But pax does not implement the callback function, so an error will be raise during
this process.
Under CBDB execution, for the vec_numeric type, we need to perform format conversion.

The newly allocated numeric is not freed.
Add an array of projection column index to quickly skip
unnecessary columns when reading projection columns.
In tables with a large number of columns, this modification
can bring good performance improvements when reading a small
number of columns.
1. tuple in storage_am is marked for deletion. For vectorized query,
we need to filter out invisible tuple data based on visibility map.
2. The BPCHAR type is no longer assert, and it is processed like varchar
Currently, we use AM handler's oid to judge whether the AM support
vectorization, generate the vectorization plan if AM does so.

For those new AMs defined in plugins, we may need to access
the oid through the plugin's interface, it is not convenient
because of the dependency. We use AM's callback 'scan_flags'
instead of AM, and each new AM in plugins only need to return
the flags represented the supported features.
PAX should preload, so the GUCs could be set in QD before creating QE
processes.
PAX has a very different storage model and IO model from the heap table.
The gist/brin index can't work for PAX. The unsupported index may be
even more.
We only guarantee that PAX support btree index. Disabling creating other
indexes on PAX table avoids wrong result on unsupported index.
Before storage-am using the row filter reader, PAX can not build CTID in `update` sql. Because storage-am will fetch all of tuple and build the CTID.

But after storage-am using the row filter reader, it won't fetch all of tuples in `update` sql. Then without setting CTID, storage-am can't generate CTID by itself.
Must deal with projection info if current projection indexes have not build.
jiaqizho and others added 27 commits April 14, 2025 22:26
In GetTuple, if multiple tuple are fetched, PAX will call CountNulls multiple times.
The method CountNulls will do foreach and accumulate the number of NULL.

In the current change, PAX will pre-calculate the number of null values for each position,
so that only one for-each is needed when we call the GetTuple.
When length 0 and all null bpchar occurs, we will fill a nullptr.

This will cause the vectorized execution engine to not handle it correctly.
1. refactor filesystem interface
2. add dfs_tablespace support
The current vec adapter becomes more bloated as the logic increases.

It mainly contains three parts of logic
1. Convert pax columns(in memory) to RecordBatch
2. Convert the buffer in the pax column(PORC format) to arrow::array
3. Convert the buffer in the pax column(PORC_VEC format) to arrow::array

In the current changes, these parts of logic are split into different files and
are independent of each others.
PAX previously used plasma to cache column-projected data.
However, the effect of this process is not very good, mainly for the following reasons:

  1. Plasma is not an efficient solution.
  2. Different data filtering schemes will cause cache failure.
  3. The page cache already cached in read path.

Coupled with the lack of maintenance of multiple versions, PAX has to no longer support the
plasma caching solution.
This commit fixes several issues about PAX with visimap and
adds test cases for all kinds of column types for both
porc and porc_vec.

  1. Scanning tuples for storage_format=porc_vec
  2. Add test cases for all types of column types

Co-authored-by: jiaqizho <zhoujiaqi@hashdata.cn>
Current changes enable that PAX support ICW on vectorization.
The current test is modified from contrib/vectorization/src/test/regress/parallel_schedule_aocs.
The reason why greenplum_schedule/parallel_schedule is not used in PAX is that the greenplum_schedule/parallel_schedule in vectorization directory still does not use
vectorization to run.

The current testing changes include the following parts:
1. Plan change: AOCS was run in the original test, AOCS does not perform index scan/index
only scan by default. This causes most plan diffs
2. Change the sql/out file name: In the previous test set, it was always _aocs.sql/_aocs.out
 as the suffix. And some files named 1_aocs.out are invalid(should be aocs_1.out).
Current changes will restore all file names.
3. Adapt the test to the pax-test:* CI job.
…ot -1

The fix is for typlen is not -1 but typbyval is not true, for example address type.
The exceptions in PAX was simple, relying only on the exceptions type to prompt information.

But when debugging is needed, we often need some accompanying contextual information to quickly check the current error problem.

Therefore, in the current changes, exception is strengthened and all the contextual information we need added in exception.
Enable the storage format porc_vec ICW tests.
Current test cases will running after porc format test.
In the previous PAX toast design, we uniformly named the toast files generated by PAX to
<blockname>.toast. This allows PAX to check whether the toast file exists or not by 
toast_file->exist().

Also the implementation of pax toast is unaware of storage_am/pg.

But after PAX was connected to gopher, the exist() interface has become unavailable.
Because if current file not exist, then gopher will do the throw -> try...catch -> 
return false which is very inefficient.

This forces PAX to record the toast_exist field(in the aux table) to avoid calling the
exist() interface.
Current change is mainly support for `storage_am`.
For now, the PAX can be use the COUNT to bypass
the `count(*)`.

In the future, if PAX supports `custom scan`,
then we can also bypass part of the `count/sum`
sql.
They are copied from CBDB, but each time we add a catalog
table, we will have to modify all of them.

All catalogs have been checked in CBDB, and they are
unnecessary, just ignore them to make it easy.

Authored-by: Zhang Mingli avamingli@gmail.com
Fix some issue in current changes:

  1. existexttoast won't be query in the aux table.
  2. the sum stats may be wrong after merge group

Also current changes re-enable the sql tests which defined in pax-tests target.
After pax support `min/max/count/sum` statistics in group/file. Then every update/delete in
single block will make that partially file-level statistics be invalid(count and sum).

Current change will update the file-level statistics after update/delete happend.

Notice that: the group-level statistics will never be updated
After PAX support the sum/count stats.
Then the pb stats combine function(which provider to storage_am) should also been updated.
Current change split the MicroPartitionStatisticsInfoCombine function to these parts:

  - The PrepareStatisticsInfoCombine function used to check the pb stats struct is valid.
  - The CommStatisticsInfoCombine function used to combine the required filed:
    count/hasnull/allnull.
  - The MinMaxStatisticsInfoCombine function used to combine the min/max.
  - The SumStatisticsInfoCombine function used to combine the sum.
After PAX supports toast storage, pax_dump also needs to specify toast file to ensure that
we can correctly parse the data part.

The current change does not support parsing toast datum, but supports inputting toast files
to open PAX files which exist toast.
The remote file can't use the kReadWriteMode.

Added a write_only flag to make sure that current file is write only
In some customer environments, we may not be allowed to get/access the customer's
data files, or even use the shell.

Therefore, the current changes also support the use of UDF to dump user data.
At the same time, UDF can connect to object storage to support storage_am debugging
which in not support in pax_dump.
PAX not support brin/gist/spgist index.This is because in these Indexes,
it is assumed that the distribution of data is managed by PAGE, which will
cause the function of the index itself to not meet expectations.

Current changes no longer allow PAX to create these indexes.
The pax table supports the cluster syntax based on the btree index
and behaves the same as the aocs table.
Support multiple files to be rewritten in order according to the z-order curve

Because the cluster implementation of Postgres depends on the index, we support
two implementations in PAX AM: index cluster and column-based zorder cluster.
Only one of them can be executed at the same time.

The default sorting of zorder cluster is in ascending order of z-value.

```
-- zorder cluster
create table t1(c1 int, c2 int) using pax with(cluster_columns='c1');
insert into t1 select i,i from generate_series(10,1,-1) i;
table t1;
cluster t1;
table t1;
```
Which would cause potential memory leak.
The operator of varchar does not exist in pg_operator.dat, but it have same oper with text.
This is because the type which can be cast picked in the oper() function.

Current change support the varchar min/max operator in PAX.
Unlike the DELETE keyword in C++, `cbdb:pfree` does not allow nullptr to be passed in.

Therefore, the current commit checks whether it is nullptr before calling `cbdb:pfree`.
The type of ptblockname is changed to int type, because the naming of table
block files is implemented through the self-incrementing id, and using int
index for query is more efficient.
@jiaqizho jiaqizho force-pushed the pax-split-380-commit-3 branch from 8fd5438 to a159ae5 Compare April 14, 2025 14:26
@jiaqizho jiaqizho merged commit 90a96f5 into apache:main Apr 14, 2025
22 checks passed
@jiaqizho
Copy link
Copy Markdown
Contributor Author

3/4 part of #1002.

Merging a pull request using the "Rebase and merge" option is limited to 100 commits. PR 1002 contains 380 commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants