[common] Introduce N-gram file index for query by xuzifu666 · Pull Request #7927 · apache/paimon

xuzifu666 · 2026-05-21T14:22:28Z

Purpose

Currently Paimon not support N-gram file index, so there is room for improvement in scenarios involving prefix and suffix queries.
Let me briefly explain the principles and workflow of the n-gram file index within this PR：

┌─────────────────────────────────────────────────────────────────────────────────┐
   │  1. Overall Architecture (Integration with Paimon FileIndexer Framework)        │
   └─────────────────────────────────────────────────────────────────────────────────┘

                               FileIndexer Interface
                                       │
                       ┌───────────────┼───────────────┐
                       │               │               │
               BloomFilter         Bitmap          N-gram ⭐
                  Index             Index           Index
                (equality)        (equality)    (prefix/suffix)

                       N-gram File Index
                              │
           ┌──────────────────┼──────────────────┐
           │                  │                  │
       Writer           Factory            Reader
      (Build)        (SPI Creation)       (Query Filter)
           │                  │                  │
           ▼                  ▼                  ▼

      Write Data  →   NgramFileIndex    →   Query Filter
      Generate N-gram  (Core Impl)       Apply Predicates
      Store HashSet    gram_size param    REMAIN/SKIP


   ┌─────────────────────────────────────────────────────────────────────────────────┐
   │  2. Index Build Process (Writing Phase)                                         │
   └─────────────────────────────────────────────────────────────────────────────────┘

              Input Rows
                   │
                   ▼
       ┌──────────────────────┐
       │ write("hello")       │
       │ write("world")       │
       └────────────┬─────────┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ 1. BinaryString → String conversion  │
       │ 2. Extract N-grams                   │
       │    "hello" → {he, el, ll, lo}        │
       │    "world" → {wo, or, rl, ld}        │
       │ 3. Add to HashSet                    │
       └────────────┬─────────────────────────┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ Final N-gram Set:                    │
       │ {he, el, ll, lo, wo, or, rl, ld}     │
       │                                      │
       │ Size: 680 bytes (100K records)       │
       │ Compression ratio: 0.03%             │
       └────────────┬─────────────────────────┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ Serialization Format:                │
       │ [4B gramSize][4B setSize]            │
       │ [2B len1][N bytes token1]            │
       │ [2B len2][N bytes token2]            │
       │ ...                                  │
       └────────────┬─────────────────────────┘
                    │
                    ▼
              Index Bytes
           (Written to file)


   ┌─────────────────────────────────────────────────────────────────────────────────┐
   │  3. Query Execution Flow (Reading & Filtering Phase)                            │
   └─────────────────────────────────────────────────────────────────────────────────┘

       SQL Query
         │
         ├─ LIKE 'he%'
         ├─ LIKE '%lo'
         ├─ LIKE '%ll%'
         └─ = 'hello'

            ▼
       ┌────────────────────────────────┐
       │ Predicate Optimization         │
       │ LIKE 'prefix%'                 │
       │   → StartsWith("prefix")       │
       │ LIKE '%suffix'                 │
       │   → EndsWith("suffix")         │
       │ LIKE '%middle%'                │
       │   → Contains("middle")         │
       └────────────┬───────────────────┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ FileIndexPredicate.evaluate()        │
       │ Iterate over each data file          │
       └────────────┬─────────────────────--──┘
                    │
                    ▼
       ┌──────────────────────────────────────┐
       │ visitStartsWith(fieldRef, "he")      │
       │                                      │
       │ 1. Get query pattern: "he"           │
       │ 2. Generate N-grams: {he}            │
       │ 3. Check each against index set      │
       │                                      │
       │ Check "he" ∈ {he,el,ll,lo,...}?      │
       └────────────┬─────────────────────--──┘
                    │
           ┌────────┴────────┐
           │                 │
           ▼                 ▼
          YES               NO
           │                 │
           ▼                 ▼
       ┌──────────────┐  ┌──────────────┐
       │ REMAIN       │  │ SKIP         │
       │ File might   │  │ File cannot  │
       │ contain data │  │ contain data │
       │ (scan rows)  │  │ (skip file)  │
       └──────────────┘  └──────────────┘
           │                 │
           └────────┬────────┘
                    │
                    ▼
           Merge results & row-level scan


   ┌─────────────────────────────────────────────────────────────────────────────────┐
   │  4. Filter Decision Logic (Decision Tree)                                       │
   └─────────────────────────────────────────────────────────────────────────────────┘

       Query pattern: pattern
               │
               ▼
       ┌──────────────────────────────┐
       │ pattern == null?             │ YES ──► REMAIN (conservative)
       │ pattern.isEmpty()?           │ YES ──► REMAIN (conservative)
       │ pattern.length < gramSize?   │ YES ──► REMAIN (cannot judge)
       └──────────────┬───────────────┘
                      │ NO
                      ▼
       ┌──────────────────────────────┐
       │ FOR i = 0 TO pattern.length  │
       │     ngram = pattern[i:i+g]   │
       │     IF ngram ∉ ngramSet:     │
       │         RETURN SKIP          │ Early exit (99% case)
       │ RETURN REMAIN               │
       └──────────────────────────────┘


   ┌─────────────────────────────────────────────────────────────────────────────────┐
   │  5. Data Flow Diagram (From Data to Filter Result)                              │
   └─────────────────────────────────────────────────────────────────────────────────┘

       Input Data (100K rows)
               │
               ▼
       ┌──────────────────────────┐
       │ NgramFileIndex.Writer    │ (38 ms)
       │ Build index              │ 2,631 rows/ms
       └──────────────┬───────────┘
                      │
                      ▼
       ┌──────────────────────────┐
       │ Index Bytes (680 bytes)  │
       │ {N-gram set serialized}  │
       └──────────────┬───────────┘
                      │
           ┌──────────┴──────────┐
           │                     │
           ▼                     ▼
       File 1              File 1000
       Index segment       Index segment
           │                     │
           ▼ (25 µs)            ▼ (25 µs)
       ┌─────────────┐    ┌─────────────┐
       │ visitXxx()  │    │ visitXxx()  │
       │ REMAIN/SKIP │    │ REMAIN/SKIP │
       └─────────────┘    └─────────────┘
           │                     │
           └──────────┬──────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │ File-level filter result     │
       │ - REMAIN: 100 files          │
       │ - SKIP: 900 files            │
       │ Skipped 900/1000 files       │
       └──────────────┬───────────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │ Row-level scan (REMAIN only) │
       │ 100 files × 100K rows        │
       │ = 10M rows (vs 100M without) │
       │ Reduced 90%                  │
       └──────────────────────────────┘

benchmark test result：

   ┌────────────────────────────────────────────────────────────────────────────────┐
   │ REAL-WORLD PERFORMANCE GAINS (Scenario: Query 1,000 files, 100K rows each)   │
   ├────────────────────────────────────────────────────────────────────────────────┤
   │                                                                                │
   │  No Index                                 With N-gram Index                   │
   │  ─────────────────────────────────────    ─────────────────────────────────   │
   │  • Scan: 1,000 files × 100K rows         • Index: 1,000 × 25µs = 25ms       │
   │  • Total: 100 million rows               • Scan: 100 files × 100K rows      │
   │  • Latency: ~100 ms                      • Total: 10 million rows           │
   │  • Files Scanned: 100%                   • Latency: ~26 ms                 │
   │                                          • Files Scanned: 10%               │
   │                                                                                │
   │  IMPROVEMENT: 74% faster | 90% fewer rows scanned | 99% I/O reduction       │
   │                                                                                │
   └────────────────────────────────────────────────────────────────────────────────┘

The current solution does not employ a Bloom filter, primarily to avoid the issue of false positives.

Tests

NgramFileIndexSimpleTest.java
NgramFileIndexTest.java

…s queries

JingsongLi

Review

The concept is simple and effective — store all n-grams from a file's string column as a HashSet, then at query time check if the query pattern's n-grams exist in the set. If any n-gram from the query is missing, the file cannot contain a match → SKIP.

Critical Issues

1. The index size grows unboundedly with data cardinality — no upper bound.

The n-gram set is a HashSet<String>. For a 2-gram index over diverse string data, the set is bounded by the alphabet squared (e.g., ~700 unique 2-grams for ASCII lowercase+digits+common chars). But for larger gram sizes or Unicode data, the set can grow unboundedly. A file with 1M rows of UUIDs will produce a massive index.

The PR description claims "680 bytes for 100K records" — that's because the benchmark uses a small alphabet (6 prefixes). Real-world data (UUIDs, URLs, free text) will produce much larger indexes. There's no size cap or fallback to a bloom filter when the set exceeds a threshold.

Consider either: (a) add a max-size config that degrades to REMAIN when exceeded, or (b) use a bloom filter when the n-gram count exceeds a threshold (the PR states "no bloom filter to avoid false positives" — but an unbounded HashSet serialized as strings can be worse than a bounded bloom filter in practice).

2. visitLike pattern parsing is incorrect.

public FileIndexResult visitLike(FieldRef fieldRef, Object literal) {
    String pattern = literalToString(literal);
    String[] parts = pattern.split("%");
    String longestPart = "";
    for (String part : parts) {
        if (part.length() > longestPart.length()) {
            longestPart = part;
        }
    }
    return checkPattern(longestPart);
}

Problems:

split("%") doesn't handle LIKE escape characters (_ wildcard, \% escaped percent)
For pattern %hello%world%, splitting gives ["", "hello", "world"]. It picks "hello" (5 chars) or "world" (5 chars) — only checks ONE part. But correct logic should check ALL non-wildcard parts: if ANY part's n-grams are missing, we can SKIP
For pattern hello% → splits to ["hello"] — works. But for % alone → splits to ["", ""] → longestPart is "" → REMAIN. OK but fragile.

Should check all parts, not just the longest:

for (String part : parts) {
    FileIndexResult result = checkPattern(part);
    if (result == SKIP) return SKIP;
}
return REMAIN;

3. visitEqual semantics are wrong for equality.

public FileIndexResult visitEqual(FieldRef fieldRef, Object literal) {
    return checkPattern(literalToString(literal));
}

If the file contains "hello" and "world", the n-gram set is {he,el,ll,lo,wo,or,rl,ld}. Query visitEqual("helo") would check n-grams {he,el,lo} — all present in the set! So it returns REMAIN, but "helo" is NOT in the file. This is expected (false positive), but the PR description says "avoid the issue of false positives" — it doesn't, it just reduces them compared to bloom filters.

More importantly, for visitEqual on strings, a bloom filter on the full string value (not n-grams) would be more effective since it's an exact-match check. The n-gram approach is specifically designed for substring/prefix/suffix — visitEqual should probably just delegate to the base class (return REMAIN) or use a separate bloom filter.

4. writeShort(tokenBytes.length) — token length limited to 65535 bytes.

With gramSize=2 this is safe. But if someone sets gramSize=100000 (no validation), a single n-gram could exceed the short limit. Add validation that gramSize is reasonable (e.g., 2-10).

5. Strings shorter than gramSize are stored as-is in the n-gram set.

private void addNgrams(String value) {
    if (value.length() < gramSize) {
        ngramSet.add(value);  // stored whole
    } else { ... }
}

But checkPattern requires pattern.length() >= gramSize to not early-return REMAIN:

if (pattern == null || pattern.isEmpty() || pattern.length() < gramSize) {
    return REMAIN;
}

So these short values can never be matched by the index — they're stored but never queried. This wastes space. Either don't store them, or handle short patterns differently (direct set lookup for short patterns).

Minor Issues

NgramFileIndexFactory.create() ignores DataType — doesn't pass it to the index. Fine for now but inconsistent with other factories.
NgramFileIndexSimpleTest.testNgramGeneration tests the expected set directly but doesn't actually verify the serialized bytes contain these n-grams (just checks a local HashSet).
Benchmark tests should not be @Test methods that run in CI — they produce console output and have loose assertions (isLessThan(5000) ms). Move to a separate benchmark class or exclude from CI.
"UTF-8" string should use StandardCharsets.UTF_8 to avoid the checked UnsupportedEncodingException.

Summary

The core idea works for startsWith/endsWith/contains on limited-alphabet data. Main issues: (1) unbounded index size for diverse data, (2) visitLike should check all parts not just the longest, (3) visitEqual gives a false sense of effectiveness.

xuzifu666 added 2 commits May 20, 2026 23:24

[common] Introduce N-gram file index for string prefix/suffix/contain…

8fe1c54

…s queries

add benchmark test

ccbd6c3

JingsongLi reviewed May 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[common] Introduce N-gram file index for query#7927

[common] Introduce N-gram file index for query#7927
xuzifu666 wants to merge 2 commits into
apache:masterfrom
xuzifu666:n_gram_index

xuzifu666 commented May 21, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xuzifu666 commented May 21, 2026

Purpose

Tests

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Review

Critical Issues

Minor Issues

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants