Feature: introduce a high-performance hybrid row-columnar storage engine (3/4)#1043
Merged
jiaqizho merged 98 commits intoapache:mainfrom Apr 14, 2025
Merged
Feature: introduce a high-performance hybrid row-columnar storage engine (3/4)#1043jiaqizho merged 98 commits intoapache:mainfrom
jiaqizho merged 98 commits intoapache:mainfrom
Conversation
dcac80a to
8fd5438
Compare
tuhaihe
approved these changes
Apr 14, 2025
my-ship-it
approved these changes
Apr 14, 2025
If current TupleTableSlot is resued, Then pax will got Assert(false) in ExecStoreVirtualTuple.
When pax uses TableParitionWriter to insert data, the file merging operation will be performed in the Close. If the current table already build the indexes, Pax can only use delete-update to update the indexes. If the current table is not indexed, pax assumes that the CTID in the TableTupleSlot(using SetBlockNumber + SetTupleOffset) has no meaning. which means that we can directly call writer->MergeTo to speed up the merge process.
Add back PAX unittests into CI
There is a big difference between numeric defined in arrow and numeric defined in pg. Only two decimal types of arrow: decimal128 and decimal256, which fix the number of numeric digits. In order to be compatible with the pg format and make vectorization processing faster, pax stores the adapted 128 bits format defined by arrow. Users can use reloption: numeric_vec_storage to enable this storage method.
When exception happen in CreateMicroPartitionWriter, Then the MicroPartitionWriter in TableWriter may become nullptr in this time. But if caller call the TableWriter->Close(), Then caller will got nullptr error.
…mestamptz types In the mapping from pg type to arrow type,the implementation of ConvSchemaAndDataToVec in pax extension needs to be consistent with the implementation of the PGTypeToArrowID function in the vectorization extension. Arrow::timestamp is used to represent the timestamp/timestampz/time type in pg.
When access method function swap_relation_files call, Pax should swap the fast sequence.
The pax aux table will used the oid(7064) as namespace which is one of system table namespaces. After create a pax table, the pax aux table will set the table space to invalid oid(0), no matter currently table space is pg_default or not. This is because the table space of the pax aux table is unchangeable (system table permission). Whenever an pax aux table is created, the index of the pax aux table is also created. But the index of the aux table is set to the same table space as the pax table (current table space). Then the user can not successfully move the index to another table space.
We use the directory build generated by make to avoid compile again.
…'s less than 64 Check the value of gp_interconnect_queue_depth and warning if it's less than 64. the motion performance is poor when the gp_interconnect_queue_depth value is small.
When executing sql containing spi logic, ReleaseCurrentSubTransaction will be called at the end of spi. At this time, FdHandleAbortCallback will be called, We should not release the parent's resources at this time. So it need to check the owner of the resource.
Once pax cluster a index in transaction. The function CPaxCopyPaxBlockEntry may cause the tuple in old aux table changed. So we should do copy rather than direct used old tuple. Also we should update fast sequence when pax do cluster indexes.
In PAX, there are two splitting rules for file dimensions: 1. Split by numbers of tuple 2. Split by file size But every time the stripe in file is written, the physical size of the current file will be reset.
There are some statistical functions for tables in PG, such as the number of tuples inserted, the number of tuples read, etc. This type of statistics also needs to be update in pax. noted that current pg statistics does not affect the results of analyze.
Null test should be checked first before inspecting column-specific attributes. `attno == 0` is the special case for null test. It checks whether all columns are all null or all not null.
Special handling for C locale comparing.
In TPCH or TPCDS, If pax uses the default file splitting configuration, it will make the data file too small and affect the reading performance. But if it is in other case, using current file splitting configuration may not be a problem. Current change make the file splitting configuration to a GUC value, which allows users to configure different configurations according to different case, and the default value of file splitting configuration changed to a larger value. Also, Pax will no longer set a default encoding for columns.
Fix incorrect micro partition file skip logic, if scan_key->sk_collation is not equal to attr->attcollation, do not skip micro paritition file.
The CustomObjectClass in pax have not implements the callback object_type_desc and object_identity_parts which used in trigger call. If current event_trigger registered, then CBDB can't drop the pax table. Because the event_trigger try get the object description when table dropped. But pax does not implement the callback function, so an error will be raise during this process.
Under CBDB execution, for the vec_numeric type, we need to perform format conversion. The newly allocated numeric is not freed.
Add an array of projection column index to quickly skip unnecessary columns when reading projection columns. In tables with a large number of columns, this modification can bring good performance improvements when reading a small number of columns.
1. tuple in storage_am is marked for deletion. For vectorized query, we need to filter out invisible tuple data based on visibility map. 2. The BPCHAR type is no longer assert, and it is processed like varchar
Currently, we use AM handler's oid to judge whether the AM support vectorization, generate the vectorization plan if AM does so. For those new AMs defined in plugins, we may need to access the oid through the plugin's interface, it is not convenient because of the dependency. We use AM's callback 'scan_flags' instead of AM, and each new AM in plugins only need to return the flags represented the supported features.
PAX should preload, so the GUCs could be set in QD before creating QE processes.
PAX has a very different storage model and IO model from the heap table. The gist/brin index can't work for PAX. The unsupported index may be even more. We only guarantee that PAX support btree index. Disabling creating other indexes on PAX table avoids wrong result on unsupported index.
Before storage-am using the row filter reader, PAX can not build CTID in `update` sql. Because storage-am will fetch all of tuple and build the CTID. But after storage-am using the row filter reader, it won't fetch all of tuples in `update` sql. Then without setting CTID, storage-am can't generate CTID by itself.
Must deal with projection info if current projection indexes have not build.
In GetTuple, if multiple tuple are fetched, PAX will call CountNulls multiple times. The method CountNulls will do foreach and accumulate the number of NULL. In the current change, PAX will pre-calculate the number of null values for each position, so that only one for-each is needed when we call the GetTuple.
When length 0 and all null bpchar occurs, we will fill a nullptr. This will cause the vectorized execution engine to not handle it correctly.
1. refactor filesystem interface 2. add dfs_tablespace support
The current vec adapter becomes more bloated as the logic increases. It mainly contains three parts of logic 1. Convert pax columns(in memory) to RecordBatch 2. Convert the buffer in the pax column(PORC format) to arrow::array 3. Convert the buffer in the pax column(PORC_VEC format) to arrow::array In the current changes, these parts of logic are split into different files and are independent of each others.
PAX previously used plasma to cache column-projected data. However, the effect of this process is not very good, mainly for the following reasons: 1. Plasma is not an efficient solution. 2. Different data filtering schemes will cause cache failure. 3. The page cache already cached in read path. Coupled with the lack of maintenance of multiple versions, PAX has to no longer support the plasma caching solution.
This commit fixes several issues about PAX with visimap and adds test cases for all kinds of column types for both porc and porc_vec. 1. Scanning tuples for storage_format=porc_vec 2. Add test cases for all types of column types Co-authored-by: jiaqizho <zhoujiaqi@hashdata.cn>
Current changes enable that PAX support ICW on vectorization. The current test is modified from contrib/vectorization/src/test/regress/parallel_schedule_aocs. The reason why greenplum_schedule/parallel_schedule is not used in PAX is that the greenplum_schedule/parallel_schedule in vectorization directory still does not use vectorization to run. The current testing changes include the following parts: 1. Plan change: AOCS was run in the original test, AOCS does not perform index scan/index only scan by default. This causes most plan diffs 2. Change the sql/out file name: In the previous test set, it was always _aocs.sql/_aocs.out as the suffix. And some files named 1_aocs.out are invalid(should be aocs_1.out). Current changes will restore all file names. 3. Adapt the test to the pax-test:* CI job.
…ot -1 The fix is for typlen is not -1 but typbyval is not true, for example address type.
The exceptions in PAX was simple, relying only on the exceptions type to prompt information. But when debugging is needed, we often need some accompanying contextual information to quickly check the current error problem. Therefore, in the current changes, exception is strengthened and all the contextual information we need added in exception.
Enable the storage format porc_vec ICW tests. Current test cases will running after porc format test.
In the previous PAX toast design, we uniformly named the toast files generated by PAX to <blockname>.toast. This allows PAX to check whether the toast file exists or not by toast_file->exist(). Also the implementation of pax toast is unaware of storage_am/pg. But after PAX was connected to gopher, the exist() interface has become unavailable. Because if current file not exist, then gopher will do the throw -> try...catch -> return false which is very inefficient. This forces PAX to record the toast_exist field(in the aux table) to avoid calling the exist() interface.
Current change is mainly support for `storage_am`. For now, the PAX can be use the COUNT to bypass the `count(*)`. In the future, if PAX supports `custom scan`, then we can also bypass part of the `count/sum` sql.
They are copied from CBDB, but each time we add a catalog table, we will have to modify all of them. All catalogs have been checked in CBDB, and they are unnecessary, just ignore them to make it easy. Authored-by: Zhang Mingli avamingli@gmail.com
Fix some issue in current changes: 1. existexttoast won't be query in the aux table. 2. the sum stats may be wrong after merge group Also current changes re-enable the sql tests which defined in pax-tests target.
After pax support `min/max/count/sum` statistics in group/file. Then every update/delete in single block will make that partially file-level statistics be invalid(count and sum). Current change will update the file-level statistics after update/delete happend. Notice that: the group-level statistics will never be updated
After PAX support the sum/count stats.
Then the pb stats combine function(which provider to storage_am) should also been updated.
Current change split the MicroPartitionStatisticsInfoCombine function to these parts:
- The PrepareStatisticsInfoCombine function used to check the pb stats struct is valid.
- The CommStatisticsInfoCombine function used to combine the required filed:
count/hasnull/allnull.
- The MinMaxStatisticsInfoCombine function used to combine the min/max.
- The SumStatisticsInfoCombine function used to combine the sum.
After PAX supports toast storage, pax_dump also needs to specify toast file to ensure that we can correctly parse the data part. The current change does not support parsing toast datum, but supports inputting toast files to open PAX files which exist toast.
The remote file can't use the kReadWriteMode. Added a write_only flag to make sure that current file is write only
In some customer environments, we may not be allowed to get/access the customer's data files, or even use the shell. Therefore, the current changes also support the use of UDF to dump user data. At the same time, UDF can connect to object storage to support storage_am debugging which in not support in pax_dump.
PAX not support brin/gist/spgist index.This is because in these Indexes, it is assumed that the distribution of data is managed by PAGE, which will cause the function of the index itself to not meet expectations. Current changes no longer allow PAX to create these indexes.
The pax table supports the cluster syntax based on the btree index and behaves the same as the aocs table.
Support multiple files to be rewritten in order according to the z-order curve Because the cluster implementation of Postgres depends on the index, we support two implementations in PAX AM: index cluster and column-based zorder cluster. Only one of them can be executed at the same time. The default sorting of zorder cluster is in ascending order of z-value. ``` -- zorder cluster create table t1(c1 int, c2 int) using pax with(cluster_columns='c1'); insert into t1 select i,i from generate_series(10,1,-1) i; table t1; cluster t1; table t1; ```
Which would cause potential memory leak.
The operator of varchar does not exist in pg_operator.dat, but it have same oper with text. This is because the type which can be cast picked in the oper() function. Current change support the varchar min/max operator in PAX.
Unlike the DELETE keyword in C++, `cbdb:pfree` does not allow nullptr to be passed in. Therefore, the current commit checks whether it is nullptr before calling `cbdb:pfree`.
The type of ptblockname is changed to int type, because the naming of table block files is implemented through the self-incrementing id, and using int index for query is more efficient.
8fd5438 to
a159ae5
Compare
Contributor
Author
|
3/4 part of #1002. Merging a pull request using the "Rebase and merge" option is limited to 100 commits. PR 1002 contains 380 commits. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #ISSUE_Number
What does this PR do?
Type of Change
Breaking Changes
Test Plan
make installcheckmake -C src/test installcheck-cbdb-parallelImpact
Performance:
User-facing changes:
Dependencies:
Checklist
Additional Context
CI Skip Instructions