[common] Introduce N-gram file index for query#7927
Conversation
JingsongLi
left a comment
There was a problem hiding this comment.
Review
The concept is simple and effective — store all n-grams from a file's string column as a HashSet, then at query time check if the query pattern's n-grams exist in the set. If any n-gram from the query is missing, the file cannot contain a match → SKIP.
Critical Issues
1. The index size grows unboundedly with data cardinality — no upper bound.
The n-gram set is a HashSet<String>. For a 2-gram index over diverse string data, the set is bounded by the alphabet squared (e.g., ~700 unique 2-grams for ASCII lowercase+digits+common chars). But for larger gram sizes or Unicode data, the set can grow unboundedly. A file with 1M rows of UUIDs will produce a massive index.
The PR description claims "680 bytes for 100K records" — that's because the benchmark uses a small alphabet (6 prefixes). Real-world data (UUIDs, URLs, free text) will produce much larger indexes. There's no size cap or fallback to a bloom filter when the set exceeds a threshold.
Consider either: (a) add a max-size config that degrades to REMAIN when exceeded, or (b) use a bloom filter when the n-gram count exceeds a threshold (the PR states "no bloom filter to avoid false positives" — but an unbounded HashSet serialized as strings can be worse than a bounded bloom filter in practice).
2. visitLike pattern parsing is incorrect.
public FileIndexResult visitLike(FieldRef fieldRef, Object literal) {
String pattern = literalToString(literal);
String[] parts = pattern.split("%");
String longestPart = "";
for (String part : parts) {
if (part.length() > longestPart.length()) {
longestPart = part;
}
}
return checkPattern(longestPart);
}Problems:
split("%")doesn't handle LIKE escape characters (_wildcard,\%escaped percent)- For pattern
%hello%world%, splitting gives["", "hello", "world"]. It picks "hello" (5 chars) or "world" (5 chars) — only checks ONE part. But correct logic should check ALL non-wildcard parts: if ANY part's n-grams are missing, we can SKIP - For pattern
hello%→ splits to["hello"]— works. But for%alone → splits to["", ""]→ longestPart is""→ REMAIN. OK but fragile.
Should check all parts, not just the longest:
for (String part : parts) {
FileIndexResult result = checkPattern(part);
if (result == SKIP) return SKIP;
}
return REMAIN;3. visitEqual semantics are wrong for equality.
public FileIndexResult visitEqual(FieldRef fieldRef, Object literal) {
return checkPattern(literalToString(literal));
}If the file contains "hello" and "world", the n-gram set is {he,el,ll,lo,wo,or,rl,ld}. Query visitEqual("helo") would check n-grams {he,el,lo} — all present in the set! So it returns REMAIN, but "helo" is NOT in the file. This is expected (false positive), but the PR description says "avoid the issue of false positives" — it doesn't, it just reduces them compared to bloom filters.
More importantly, for visitEqual on strings, a bloom filter on the full string value (not n-grams) would be more effective since it's an exact-match check. The n-gram approach is specifically designed for substring/prefix/suffix — visitEqual should probably just delegate to the base class (return REMAIN) or use a separate bloom filter.
4. writeShort(tokenBytes.length) — token length limited to 65535 bytes.
With gramSize=2 this is safe. But if someone sets gramSize=100000 (no validation), a single n-gram could exceed the short limit. Add validation that gramSize is reasonable (e.g., 2-10).
5. Strings shorter than gramSize are stored as-is in the n-gram set.
private void addNgrams(String value) {
if (value.length() < gramSize) {
ngramSet.add(value); // stored whole
} else { ... }
}But checkPattern requires pattern.length() >= gramSize to not early-return REMAIN:
if (pattern == null || pattern.isEmpty() || pattern.length() < gramSize) {
return REMAIN;
}So these short values can never be matched by the index — they're stored but never queried. This wastes space. Either don't store them, or handle short patterns differently (direct set lookup for short patterns).
Minor Issues
NgramFileIndexFactory.create()ignoresDataType— doesn't pass it to the index. Fine for now but inconsistent with other factories.NgramFileIndexSimpleTest.testNgramGenerationtests the expected set directly but doesn't actually verify the serialized bytes contain these n-grams (just checks a local HashSet).- Benchmark tests should not be
@Testmethods that run in CI — they produce console output and have loose assertions (isLessThan(5000)ms). Move to a separate benchmark class or exclude from CI. "UTF-8"string should useStandardCharsets.UTF_8to avoid the checkedUnsupportedEncodingException.
Summary
The core idea works for startsWith/endsWith/contains on limited-alphabet data. Main issues: (1) unbounded index size for diverse data, (2) visitLike should check all parts not just the longest, (3) visitEqual gives a false sense of effectiveness.
Purpose
Currently Paimon not support N-gram file index, so there is room for improvement in scenarios involving prefix and suffix queries.
Let me briefly explain the principles and workflow of the n-gram file index within this PR:
benchmark test result:
The current solution does not employ a Bloom filter, primarily to avoid the issue of false positives.
Tests
NgramFileIndexSimpleTest.java
NgramFileIndexTest.java