Skip to content

[fix](NestedColumnPruning) mixed array metadata-only access returns wrong nested cardinality#64535

Open
englefly wants to merge 5 commits into
apache:masterfrom
englefly:offset-strip
Open

[fix](NestedColumnPruning) mixed array metadata-only access returns wrong nested cardinality#64535
englefly wants to merge 5 commits into
apache:masterfrom
englefly:offset-strip

Conversation

@englefly

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

englefly and others added 3 commits June 15, 2026 23:21
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Consolidate duplicated nested column metadata-only access handling for string, array, and map columns. The refactor routes NULL/OFFSET-only access through one branch and centralizes covered metadata suffix cleanup in a helper that documents each strip case and its implementation path. OFFSET cleanup is split by coverage type so data-path coverage and deeper-OFFSET coverage have separate maintenance boundaries. It also adds a regression case for OFFSET plus NULL on the same prefix and removes an extra blank line in the related test file so FE checkstyle passes.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest
    - cd fe && mvn checkstyle:check -pl fe-core
- Behavior changed: No
- Does this need documentation: No

refactory accessTree.hasStringOffsetOnlyAccess() and other type merge
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: length(map_col['key']) was normalized from [map_col, *, OFFSET] to [map_col, KEYS] and [map_col, VALUES, OFFSET], but the logic only handled root MAP columns. For nested maps such as length(element_at(s, 'm')['key']), FE kept [s, m, *, OFFSET], which BE cannot interpret as split key/value access. This change recursively detects map element lookups whose value side is offset-only and normalizes both root and nested map paths before the common metadata strip logic runs.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - cd fe && mvn checkstyle:check -pl fe-core
    - ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest#testNestedMapElementLengthKeepsValueOffsetPath
    - ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest
- Behavior changed: No
- Does this need documentation: No
@englefly englefly changed the title fix-26175 and refacotry [fix](NestedColumnPruning) mixed array metadata-only access returns wrong nested cardinality Jun 16, 2026
@englefly

Copy link
Copy Markdown
Contributor Author

run buildall

@englefly englefly marked this pull request as ready for review June 16, 2026 04:47
@englefly

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for one correctness issue in nested metadata access-path cleanup.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to normalize and strip redundant nested NULL/OFFSET paths. The current tests cover deeper OFFSET over shallower OFFSET, same-prefix OFFSET over NULL, and deeper NULL over shallower NULL, but not the mixed shallower OFFSET plus deeper NULL case that still breaks BE reader mode selection.
  • Scope/focus: The change is localized to FE/Nereids nested-column pruning and related unit tests; the issue is in the new common strip pipeline.
  • Concurrency/lifecycle/config/compatibility/transactions/writes/observability: Not applicable; this is planner metadata passed to scan descriptors.
  • Parallel paths: The cleanup now handles data coverage and deeper-OFFSET coverage, but the equivalent deeper-NULL coverage for a shallower OFFSET path is missing.
  • Testing: git diff --check passed. I did not run FE unit tests because thirdparty/installed/bin/protoc is absent in this checkout, per FE build instructions.
  • User focus: No additional user-provided review focus was supplied.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 28702 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit df569935e53a729fa99ea1ded241b3666a0fb7ff, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17644	4024	4017	4017
q2	q3	10777	1350	789	789
q4	4691	464	342	342
q5	7553	849	571	571
q6	179	169	142	142
q7	767	841	632	632
q8	9450	1590	1599	1590
q9	6231	4513	4498	4498
q10	6785	1813	1528	1528
q11	441	274	245	245
q12	635	426	284	284
q13	18208	3286	2739	2739
q14	262	258	244	244
q15	q16	813	774	712	712
q17	940	977	916	916
q18	7022	5642	5479	5479
q19	1847	1373	1047	1047
q20	519	392	265	265
q21	5902	2589	2356	2356
q22	425	356	306	306
Total cold run time: 101091 ms
Total hot run time: 28702 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4291	4191	4211	4191
q2	q3	4483	4946	4317	4317
q4	2071	2195	1351	1351
q5	4477	4271	4262	4262
q6	223	175	128	128
q7	1700	1639	1711	1639
q8	2559	2129	2130	2129
q9	7908	7993	7797	7797
q10	4764	4791	4284	4284
q11	558	428	421	421
q12	767	754	543	543
q13	3354	3612	2990	2990
q14	282	292	277	277
q15	q16	723	751	684	684
q17	1369	1335	1289	1289
q18	8041	7221	7338	7221
q19	1145	1075	1081	1075
q20	2201	2212	1938	1938
q21	5244	4585	4455	4455
q22	530	471	433	433
Total cold run time: 56690 ms
Total hot run time: 51424 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 169229 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit df569935e53a729fa99ea1ded241b3666a0fb7ff, data reload: false

query5	4318	648	495	495
query6	446	195	188	188
query7	4847	570	315	315
query8	365	209	203	203
query9	8773	3985	4006	3985
query10	433	311	249	249
query11	5927	2349	2182	2182
query12	150	99	97	97
query13	1252	607	431	431
query14	6299	5421	5011	5011
query14_1	4339	4345	4355	4345
query15	211	191	172	172
query16	982	402	413	402
query17	929	682	577	577
query18	2428	471	336	336
query19	194	181	137	137
query20	107	110	105	105
query21	214	148	114	114
query22	13591	13565	13497	13497
query23	17142	16576	16218	16218
query23_1	16252	16351	16396	16351
query24	7535	1733	1297	1297
query24_1	1314	1290	1298	1290
query25	578	470	406	406
query26	1316	313	173	173
query27	2667	523	345	345
query28	4491	2064	2046	2046
query29	1092	634	498	498
query30	309	238	194	194
query31	1115	1083	966	966
query32	109	61	65	61
query33	521	329	254	254
query34	1224	1114	666	666
query35	741	786	687	687
query36	1382	1392	1184	1184
query37	159	109	94	94
query38	3202	3175	3039	3039
query39	926	923	892	892
query39_1	869	892	872	872
query40	221	129	103	103
query41	68	65	67	65
query42	94	97	95	95
query43	313	322	274	274
query44	
query45	202	189	178	178
query46	1066	1249	756	756
query47	2409	2361	2370	2361
query48	405	409	300	300
query49	637	487	366	366
query50	993	364	261	261
query51	4394	4324	4311	4311
query52	89	91	86	86
query53	252	276	200	200
query54	303	239	229	229
query55	81	78	70	70
query56	286	225	206	206
query57	1447	1406	1313	1313
query58	241	206	201	201
query59	1592	1682	1446	1446
query60	282	231	231	231
query61	151	147	146	146
query62	716	660	589	589
query63	232	189	190	189
query64	2561	744	602	602
query65	
query66	1785	452	336	336
query67	29598	29765	29517	29517
query68	
query69	422	304	257	257
query70	944	976	982	976
query71	301	215	211	211
query72	3054	2624	2325	2325
query73	837	732	400	400
query74	5150	4899	4756	4756
query75	2650	2561	2223	2223
query76	2305	1152	751	751
query77	344	368	289	289
query78	12436	12346	11951	11951
query79	1453	1065	705	705
query80	929	468	386	386
query81	515	282	243	243
query82	579	152	119	119
query83	352	291	245	245
query84	
query85	950	507	426	426
query86	423	293	287	287
query87	3420	3346	3262	3262
query88	3605	2696	2731	2696
query89	460	383	323	323
query90	1743	176	171	171
query91	173	166	131	131
query92	62	59	54	54
query93	1525	1500	888	888
query94	647	336	293	293
query95	687	381	427	381
query96	1004	800	350	350
query97	2689	2680	2533	2533
query98	210	209	206	206
query99	1171	1165	1025	1025
Total cold run time: 250448 ms
Total hot run time: 169229 ms

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 81.88% (122/149) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 79.27% (130/164) 🎉
Increment coverage report
Complete coverage report

@englefly

Copy link
Copy Markdown
Contributor Author

run buildall

@englefly

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for one remaining correctness issue in nested map-value metadata cleanup.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to normalize and strip redundant nested NULL/OFFSET access paths so BE readers do not enter metadata-only modes while deeper child data is still needed. The root array cases are now addressed, but the map-value case below is still not covered by the implementation or tests.
  • Scope/focus: The changes are localized to FE/Nereids access-path collection/cleanup and focused unit tests, but the affected helper is common to all nested metadata paths.
  • Concurrency/lifecycle/config/compatibility/transactions/writes/observability: Not applicable; this is planner metadata passed to scan descriptors, with no new runtime shared state, config, persistence, or protocol fields.
  • Parallel paths: Root array/struct same-prefix and deeper-prefix cleanup paths are handled, but the equivalent map value-side path after * -> VALUES normalization is missing.
  • Testing: I did not run tests locally in this review. Existing PR context reports FE unit/checkstyle runs, but the failing map-value mixed-meta scenario is not represented.
  • User focus: No additional user-provided review focus was supplied.

}
List<String> otherPrefix = other.subList(0, other.size() - 1);
if (hasStrictPrefix(otherPrefix, targetPrefix)) {
toRemove.add(p);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still leaves the same unsafe BE reader-mode combination after map value normalization, but in a different path shape from the earlier root-array thread.

Reduced plan:

Project(cardinality(element_at(m, 'a')), is_null(element_at(element_at(m, 'a'), 1)))
  OlapScan(m: MAP<STRING, ARRAY<INT>>)

AccessPathExpressionCollector emits [m, *, OFFSET] for cardinality(element_at(m,'a')) and [m, *, *, NULL] for the element null check. normalizeMapValueMetaOnlyAccessPaths rewrites only the first path to [m, KEYS] + [m, VALUES, OFFSET]; the deeper null path remains [m, *, *, NULL]. At this line the comparison is a plain lexical prefix check, so [m, *, *] is not treated as a value-side child of [m, VALUES], and [m, VALUES, OFFSET] survives.

BE then splits [m, *, *, NULL] to the value array as [value, *, NULL] and also sends [value, OFFSET]. ArrayFileColumnIterator::set_access_paths sees the value-array OFFSET, switches that iterator to OFFSET_ONLY, and skips the item iterator, so the deeper NULL path is never applied. The IS NULL result can be evaluated from default-filled item/null columns.

Please make the deeper-meta coverage map-aware here as well, e.g. reuse the compareOffsetPrefixCoverage semantics for * vs VALUES while preserving the supplemental [m, KEYS] path, or normalize deeper map-star meta paths to VALUES before this comparison. A unit/regression case with cardinality(map_arr_col['a']) plus element_at(map_arr_col['a'], 1) is null would catch this.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29187 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 06c339b9dcbae56bdc70cbba4a7527e71f702542, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17733	4136	4096	4096
q2	2029	313	186	186
q3	10284	1428	838	838
q4	4684	494	341	341
q5	7495	855	579	579
q6	192	180	141	141
q7	771	861	625	625
q8	9610	1670	1572	1572
q9	6476	4514	4515	4514
q10	6838	1841	1510	1510
q11	442	285	243	243
q12	627	469	289	289
q13	18170	3504	2749	2749
q14	272	260	257	257
q15	q16	781	781	717	717
q17	1122	881	985	881
q18	6989	5717	5577	5577
q19	1436	1323	1092	1092
q20	489	394	267	267
q21	5861	2671	2413	2413
q22	437	368	300	300
Total cold run time: 102738 ms
Total hot run time: 29187 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4522	4434	4468	4434
q2	341	380	234	234
q3	4660	5031	4442	4442
q4	2192	2249	1379	1379
q5	4661	4449	4465	4449
q6	246	191	133	133
q7	2347	1961	1612	1612
q8	2716	2321	2278	2278
q9	8249	8049	8113	8049
q10	4884	4850	4583	4583
q11	656	427	422	422
q12	784	785	560	560
q13	3273	3745	2957	2957
q14	299	292	282	282
q15	q16	732	734	653	653
q17	1389	1401	1362	1362
q18	8108	7502	7128	7128
q19	1089	1102	1116	1102
q20	2244	2227	1955	1955
q21	5514	4739	4608	4608
q22	551	463	406	406
Total cold run time: 59457 ms
Total hot run time: 53028 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 174721 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 06c339b9dcbae56bdc70cbba4a7527e71f702542, data reload: false

query5	4307	616	468	468
query6	439	188	175	175
query7	4896	560	295	295
query8	360	218	198	198
query9	8739	4064	4054	4054
query10	438	297	263	263
query11	5936	2348	2133	2133
query12	156	100	97	97
query13	1344	632	424	424
query14	6386	5395	5076	5076
query14_1	4461	4454	4435	4435
query15	202	196	175	175
query16	973	447	409	409
query17	1101	654	554	554
query18	2416	474	333	333
query19	192	178	134	134
query20	108	105	102	102
query21	207	138	116	116
query22	13683	13650	13338	13338
query23	17359	16579	16169	16169
query23_1	16317	16193	16251	16193
query24	7734	1704	1311	1311
query24_1	1325	1314	1312	1312
query25	535	435	353	353
query26	1313	335	179	179
query27	3155	516	334	334
query28	4545	2112	2072	2072
query29	1098	621	492	492
query30	326	241	199	199
query31	1134	1092	994	994
query32	105	67	61	61
query33	534	331	264	264
query34	1203	1184	691	691
query35	770	802	685	685
query36	1412	1411	1273	1273
query37	166	118	94	94
query38	3291	3219	3148	3148
query39	980	978	941	941
query39_1	904	953	940	940
query40	234	128	107	107
query41	69	69	70	69
query42	105	97	94	94
query43	322	327	290	290
query44	1468	779	783	779
query45	203	192	180	180
query46	1100	1214	774	774
query47	2346	2378	2207	2207
query48	439	431	303	303
query49	660	468	372	372
query50	955	358	261	261
query51	4343	4340	4209	4209
query52	89	90	78	78
query53	253	265	192	192
query54	278	225	208	208
query55	79	77	74	74
query56	248	232	233	232
query57	1449	1417	1311	1311
query58	252	224	214	214
query59	1637	1691	1434	1434
query60	302	261	239	239
query61	179	173	174	173
query62	697	650	589	589
query63	236	195	197	195
query64	2635	741	588	588
query65	4898	4834	4783	4783
query66	1779	446	332	332
query67	29827	29834	28896	28896
query68	3053	1616	930	930
query69	408	301	269	269
query70	1059	967	931	931
query71	297	236	207	207
query72	2999	2592	2322	2322
query73	828	774	421	421
query74	5184	5002	4746	4746
query75	2642	2595	2219	2219
query76	2305	1222	765	765
query77	355	380	280	280
query78	12424	12558	11913	11913
query79	1415	1154	700	700
query80	834	459	388	388
query81	495	278	240	240
query82	572	159	123	123
query83	319	278	246	246
query84	263	149	118	118
query85	871	528	412	412
query86	458	310	295	295
query87	3439	3353	3208	3208
query88	3749	2807	2760	2760
query89	428	388	330	330
query90	1902	189	182	182
query91	170	173	130	130
query92	65	58	54	54
query93	1561	1528	900	900
query94	614	350	317	317
query95	696	461	340	340
query96	1100	865	342	342
query97	2731	2673	2557	2557
query98	217	206	203	203
query99	1173	1166	1036	1036
Total cold run time: 262699 ms
Total hot run time: 174721 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants