[Proposal] Iceberg subsystem for datalake_fdw — design proposal #1683

MisterRaindrop · 2026-04-20T07:49:32Z

MisterRaindrop
Apr 20, 2026
Collaborator

Proposers

@MisterRaindrop

Proposal Status

Under Discussion

Abstract

1. Abstract

Cloudberry does not have a complete set of plug-in tools for accessing various data sources.

I plan to design a data lake approach to access these data sources, and evolve Cloudberry toward a data lake–enabled architecture.

datalake_fdw extends Cloudberry with two complementary ways of accessing data-lake storage:

FDW foreign-table read / append : direct read / append of Parquet / ORC / Avro / Text / CSV files on S3 / HDFS / OSS.
Native Iceberg tables (added by this design): CREATE ICEBERG TABLE inside CB to create and manage Apache Iceberg tables with full SELECT / INSERT / UPDATE / DELETE / VACUUM, Schema Evolution, and snapshot-based Read Committed isolation.

This document focuses on the second part — the design, the key decisions, and the open questions — and is meant for community review.

Motivation

2. Motivation & Goals

2.1 Why we need this

As an MPP data warehouse, Cloudberry has long lacked a transactional read / write entry point for data-lake formats, Iceberg in particular:

The existing PXF-based FDW foreign tables are limited in capability;
There is no Catalog concept, so CB cannot share metadata with the wider Iceberg ecosystem (Spark / Trino / Flink);
There is no snapshot isolation or ACID, which makes lakehouse scenarios diverge from CB's native-table semantics.

The Iceberg subsystem aims to introduce Iceberg tables as first-class "lake tables" in CB without breaking PostgreSQL / Cloudberry transactional semantics:

Same SQL entry point as native tables (CREATE ICEBERG TABLE ..., INSERT, UPDATE, DELETE, VACUUM);
Metadata format fully compatible with the Iceberg community — snapshots written by CB must be directly readable by Spark / Trino;
Write path is MPP-parallel; segments talk to object storage directly;
Transactional semantics aligned with PG: a single transaction is either fully visible or fully rolled back, and SAVEPOINT is supported.

2.2 Goals

The first release of this design aims to deliver:

Catalog support: Polaris / Hive Metastore / Builtin (CB-internal);
Storage support: S3 (including MinIO / OSS) and HDFS (including HA + Kerberos);
Read Committed isolation with concurrent commits;
Reuse of the Iceberg community Java implementation to keep metadata-semantics maintenance cost low;
MPP-parallel file-level execution; QEs write directly to object storage.

2.3 Non-goals (outside the first release)

No explicit Serializable isolation;
No partition-spec evolution; bucket / truncate / hour transforms not supported;
No Branch / Tag / Time Travel queries;
Does not replace the FDW raw-file path — both paths coexist.

Implementation

3. Overall Architecture

The proposed design has four layers, split into a metadata path and a data path:

       ┌─────────────────────────────────────────────────────────────┐
       │  SQL:  CREATE / SELECT / INSERT / UPDATE / DELETE / VACUUM  │
       └─────────────────────────────────────────────────────────────┘
                                      │
       ┌─────────────────────────────────────────────────────────────┐
       │  Iceberg Table AM                                           │
       │  (makes Iceberg tables look like ordinary tables;           │
       │   tableam callbacks + transaction Tracker)                  │
       └─────────────────────────────────────────────────────────────┘
                   │                                       │
                   │ metadata                              │ data
                   ▼                                       ▼
       ┌──────────────────────────┐          ┌──────────────────────────┐
       │   Catalog FDW            │          │   Volume FDW             │
       │   Polaris / Hive /       │          │   S3 / HDFS              │
       │   Builtin (CB sys tbl)   │          │                          │
       └──────────────────────────┘          └──────────────────────────┘
                   │                                       │
                  gRPC                                     │
                   ▼                                       ▼
       ┌──────────────────────────┐          ┌──────────────────────────┐
       │   datalake_agent (jar)   │          │   Provider (C++)         │
       │   Java / iceberg-java    │          │   Parquet reader/writer  │
       │   ↑ launched by          │          │   position/eq delete     │
       │     datalake_proxy       │          │                          │
       │     bgworker             │          │                          │
       └──────────────────────────┘          └──────────────────────────┘
                   │                                       │
                   ▼                                       ▼
           Catalog service / HMS                 Object store / HDFS

All metadata operations (CREATE TABLE, plan files, commit snapshot, VACUUM rewrite) go through the agent and are handled by iceberg-java;
All data operations (Parquet / position-delete file read / write) go through FDW → Provider; segments talk to storage directly, bypassing the agent;
datalake_agent is a Java jar; it is launched and supervised by the PG bgworker datalake_proxy at postmaster startup (see §5.4);
The RPC channel between PG and the agent is gRPC.

4. The Core Abstraction: Catalog × Volume × Table

The design splits an Iceberg table into three independently configurable, freely composable pieces:

Abstraction	Responsibility	Supported
Catalog	Iceberg metadata directory: namespace / table listing, `metadata.json` location, schema-evolution history	Polaris REST / Hive Metastore / Builtin
Volume	Where data files physically live: data files / delete files / manifest / metadata json	S3 (incl. MinIO / OSS) / HDFS (incl. HA + Kerberos)
Table	The above two + column definitions + partition keys + table options	——

A Volume can be shared by multiple tables (different paths under the same bucket); a Catalog can reference multiple Volumes (different tables on different storage). Polaris is a special case — the storage configuration is dispatched by the Polaris service, so a user-side Volume is optional.

Why Catalog and Volume are separated

In real deployments they are orthogonal:

Some users already run a Hive Metastore and want data files on S3;
Some users use Polaris as the catalog but keep two buckets — hot and cold — for different tables;
Some users have no external catalog at all, only object storage.

Making Catalog and Volume two separate FDWs, each with its own Server / UserMapping, lets us cover every combination without inventing a new FDW for each.

Builtin Catalog

For users with no external Catalog (Polaris / Hive) available, the design offers a Builtin option: the metadata.json location is stored directly in a CB system table. Data files still live on the Volume, and other engines can open the table through Iceberg's HadoopCatalog / FileIO using that path.

Why we need it: it removes the hard dependency on a Catalog service and lowers the barrier to entry. It also gives a zero-dependency option for the "CB is the only writer" single-writer scenario.

5. Components & Design Decisions

The following lists the design choice — and the reason behind it — for each key component.

5.1 Iceberg Table AM: why not a pure FDW

The most direct approach would be to keep using FDW, but two hard limitations get in the way:

PG's FDW has limited support for UPDATE / DELETE, which does not fit Iceberg's requirement for full DML;
FDW foreign tables are treated as second-class citizens in many parts of the planner / analyzer / resource group.

Table AM (TableAmRoutine), introduced in PG 14, is a first-class storage abstraction: from the SQL side an Iceberg table looks like an ordinary table, and UPDATE / DELETE / ctid semantics, transactional callbacks, and ANALYZE all come for free from the kernel.

The proposed approach is therefore: register Iceberg tables as a dedicated Table AM, and have the AM delegate data I/O to the Volume FDW internally (reusing the existing S3 / HDFS read / write code). We get the SQL consistency of tableam and avoid reimplementing the storage layer.

The core code will live in src/am_iceberg/. The AM handler itself is very thin; the main logic is planned to be organized as follows:

pg_iceberg_ddl.c — OAT_POST_CREATE / OAT_DROP hook; creates/drops Iceberg tables via the Catalog on DDL;
pg_iceberg_catalog.c — unified wrapper for all Catalog calls;
pg_iceberg_metadata.c — manages the iceberg.pg_iceberg_metadata system table;
pg_iceberg_metadata_tracker.c — transaction-scoped metadata tracker (see §5.6);
pg_iceberg_rewrite_plan.c — QD ↔ QE JSON contract for VACUUM compaction.

5.2 Catalog FDW: abstracting three backends

iceberg_catalog_fdw abstracts metadata operations into a set of IcebergCatalogOperations (create_table / load_table / drop_table / append / update / delete / get_fragment / get_statistics / plan_file_groups / commit_* and so on).

The Server's type option decides the backend:

`type`	Backend
`polaris`	Polaris REST Catalog
`hive`	Hive Metastore (Kerberos supported)
unset	Builtin (CB system table)

Upwards, the AM only sees pg_iceberg_*_with_catalog() functions. Downwards, agent_cli talks to the agent over RPC (Builtin is the exception — it short-circuits to the CB system table).

Why FDW instead of a plain C function: it lets us reuse PG's CREATE SERVER / USER MAPPING for credentials and permissions, and unifies the configuration entry point across multiple Catalog types.

5.3 Volume FDW: the data-file I/O abstraction

iceberg_volume_fdw is planned to handle the actual read / write of data files and delete files (manifest / metadata json are managed by the Catalog side). It implements the full FDW interface: GetForeignRelSize / GetForeignPaths / BeginForeignScan / BeginForeignModify / ....

Its responsibilities:

Take the fragment list from the Catalog FDW (passed into fdw_private at plan time);
On the QE side, filter out the fragments assigned to this segment by segindex;
Call the Provider to read / write Parquet;
After writing, serialize the file metadata produced by this QE (path / record_count / size / partition / whether it is a position-delete) into JSON and return it to the QD.

The Server's type option decides storage: s3 / s3b (OSS / MinIO / OBS / …) / hdfs.

5.4 datalake_agent: why a separate Java service

This is the single most important design trade-off.

Iceberg's metadata semantics are complex: manifest lists, snapshot logs, partition-spec evolution, schema field-id mapping, optimistic CAS commit, and so on. The community's most invested, most mature implementation is iceberg-java.

Reimplementing all of this on the C / C++ side would cost us:

A large initial implementation;
Repeated effort on every Iceberg version upgrade;
Format-compatibility risk (some defaults are hard-coded in the reference implementation and not fully documented).

Therefore the design delegates all metadata operations to a dedicated datalake_agent (Java Spring Boot, wrapping iceberg-java + hive-jdbc + hadoop-client). The interface is planned to cover:

/iceberg/tables — create / load / drop;
/fragments — plan files (with predicate pushdown);
/modify — incremental snapshot generation;
/commit — CAS commit;
/plan-rewrite + /commit-rewrite — VACUUM.

Upside:

Compatibility: snapshots written by CB are byte-identical to the community format;
Easy upgrades: picking up a new Iceberg version is just a jar swap on the agent;
Stateless: every request carries the full configuration, making horizontal scaling easy.

Cost: one extra network hop — but only on the metadata path; data I/O still goes straight from C++ to storage, so throughput is unaffected.

Process lifecycle: managed by the `datalake_proxy` bgworker

To tie the agent's lifecycle to the CB cluster and spare users from babysitting a Java process, the design introduces datalake_proxy (contrib/datalake_proxy/), a PG background worker (bgworker):

datalake_proxy is registered in shared_preload_libraries and starts with the postmaster;
In _PG_init, a bgworker is registered that forks a child process to run the agent jar on startup;
If the child crashes, datalake_proxy restarts it;
The GUC datalake_proxy.register_datalake_proxy toggles the feature; datalake_proxy.dlagent_memory_limit (default 2 GB) caps the agent's JVM heap;
When the postmaster exits, the signal propagates through datalake_proxy to the agent for a clean shutdown.

From the user's perspective this means "CB is up → Iceberg is available" — no extra deployment, no extra supervisor.

RPC protocol: gRPC

JSON / REST has two pain points at scale — large fragment lists cost CPU to encode / decode, and plan-file results are slow to deserialize when they get big. The plan is to expose the same interface over protobuf + gRPC:

Bidirectional streaming interfaces (e.g. get_fragments can be server-streamed) reduce QD memory pressure;
protobuf saves bandwidth and CPU;
gRPC's built-in health checks / load balancing pave the way for a multi-instance agent deployment in the future.

5.5 Provider layer: the data plane

src/provider/iceberg/ is planned to be a C++ implementation covering Iceberg's data plane:

Parquet / ORC row readers and writers;
Position-delete file I/O (schema = file_path string, pos long);
Delete-index construction (data file → deleted-positions bitmap);
Equality-delete read (read-only for now);
Translation from Iceberg FileScanTask into a row reader.

Why Provider does not go through the agent: data I/O is the system's throughput bottleneck. Only by having each segment read / write storage independently and in parallel can we sustain MPP-scale writes. Meanwhile, mature C++ libraries already exist for Parquet (arrow-cpp / orc) — reusing them is far more efficient than routing through an agent.

5.6 Metadata Tracker: the heart of transactional semantics

The problem: Iceberg uses optimistic CAS (via the metadata.json version chain) for concurrency, while PG uses MVCC. How do we fit Iceberg's snapshot semantics inside a PG transaction?

The design: a transaction-scoped Metadata Tracker. Its shape is inspired by Rust iceberg-rs's MetadataLocationTracker and pg_lake's IcebergSnapshotBuilder.

Under this design, modifications to an Iceberg table within a transaction flow as follows:

  BEGIN
    │
    ├─ First access to t: read current metadata_location from Catalog
    │                     as initial_base
    │
    ├─ DML-1 ──→ QE writes data files
    │           QD calls agent /modify to produce an "intermediate"
    │           metadata.json (NOT committed to Catalog)
    │           tracker records: current_metadata, accumulated data_files
    │
    ├─ DML-2 ──→ Read latest metadata from Catalog (rebase check)
    │           If it changed (someone else committed), re-plan against it
    │           Produce a new "intermediate" metadata
    │
    ├─ SELECT ─→ Uses tracker.current (Read-Your-Own-Writes)
    │           For already-modified tables, triggers one more rebase
    │           (to see concurrent commits)
    │
    ├─ SAVEPOINT / ROLLBACK TO ─→ stack-style restore of accumulated files
    │
  COMMIT
    │
    └─ tracker_commit_all(): per modified table, CAS to Catalog
       On conflict, rebase and retry; up to 10 retries then PG-level abort

Three rebase trigger points:

Scenario	When	Purpose
per-statement	End of each DML	Read-Your-Own-Writes + early concurrent-conflict detection
at-scan	SELECT on an already-modified table	Let SELECT see concurrent committed data
at-commit	PRE_COMMIT	Final merge, reduces CAS failure probability

The resulting semantics:

Read Committed: every statement sees committed concurrent transactions;
Read-Your-Own-Writes: a SELECT within the transaction sees its own prior INSERTs;
ACID: the CAS to Catalog happens only at COMMIT. On rollback, intermediate metadata.json files and the data files already written become orphans and are reclaimed by the background cleanup queue;
SAVEPOINT: the tracker maintains an internal level_history stack, recording the metadata and file counts before each nested-transaction modification.

5.7 Deletion Queue: why asynchronous cleanup

DROPping an Iceberg table, replacing old files during VACUUM, orphans left behind by a rolled-back transaction — all of these need deletions against object storage.

Why not delete synchronously: a single Iceberg table can reference tens of thousands to millions of files. Synchronous deletion inside the transaction would make DDL block for a long time, and a mid-way failure would leave the system in a "metadata gone, files stranded" inconsistent state.

The design: an iceberg.pg_iceberg_deletion_queue system table plus a background task.

DROP: just enqueue the metadata_location (DELETION_TYPE_METADATA);
VACUUM: enqueue the paths of old data files that were replaced (DELETION_TYPE_FILE);
The background task polls the queue, expands the referenced files from metadata, and deletes them in batches;
Failed entries get retry_count++ and are retried later, giving idempotency.

6. End-to-End Flows

Execution paths for each key SQL under this design.

CREATE ICEBERG TABLE

PG core performs the CREATE, inserting into pg_class / pg_attribute / pg_lake_table;
An OAT_POST_CREATE hook on the QD calls the agent's /iceberg/tables to produce the initial metadata.json;
The returned metadata_location is written into iceberg.pg_iceberg_metadata.

SELECT

The planner calls AM's scan_get_am_private and obtains the metadata_location "that this scan should see" (an already-modified table triggers one rebase);
The QD calls the agent's /fragments (with pushdown predicates) and receives List<FileScanTask>;
The fragment list is passed through ForeignScan plan; QEs pick up their share by segindex;
Each QE calls the Provider to read Parquet, applying the delete index to skip marked-deleted rows.

INSERT / UPDATE / DELETE

QE calls Volume FDW + Provider to write data files (and, for UPDATE / DELETE, position-delete files);
QE returns file-metadata JSON to the QD;
QD calls tracker.apply_updates_with_rebase:
- Read latest metadata from Catalog; decide whether rebase is needed;
- Accumulate into the tracker's data_files / delete_files;
- Call the agent's /modify to generate a new intermediate metadata.json.
At COMMIT, tracker_commit_all performs the CAS for every modified table.

VACUUM

QD calls the agent's /plan-rewrite and receives a rewrite plan (groups built from min-input-files + target-file-size);
QEs each process one group: read old files + write one larger file;
QD collects results and calls the agent's /commit-rewrite to commit a RewriteFiles snapshot;
The paths of the replaced old files are enqueued into the deletion queue.

DROP

The OAT_DROP hook enqueues the metadata_location into the deletion queue;
The row in pg_iceberg_metadata is removed;
The background cleanup task expands all files referenced by the metadata and deletes them in batches.

7. MPP Execution Model

The responsibilities are divided as follows under MPP.

7.1 QD vs QE responsibilities

Responsibility	QD	QE
Call the agent (create / plan / commit)	✓
Metadata Tracker	✓
Fragment dispatch	✓
Data-file read / write		✓
Position-delete read / write		✓
Writes to the deletion queue	✓

Principle: only the QD talks to the agent. Letting N QEs hit the agent in parallel would both make the agent a bottleneck and introduce concurrent writes to Iceberg snapshot state, which brings its own complexity. The parallel part is the data I/O.

7.2 Fragment dispatch

The QD places List<FileScanTask> into the plan tree; it is serialized and dispatched to QEs. Each QE picks its fragments round-robin by segindex % segcount.

The GUC datalake.external_table_limit_segment_num can cap the number of segments that participate in a scan — useful when joining with small tables to reduce dispatch overhead.

7.3 Global file-id consistency

UPDATE / DELETE plans may include a Redistribute Motion that ships a row from QE-i to QE-j. QE-j, when it later dereferences the ctid, must still be able to resolve it back to its original file.

Under this design, ctids are encoded as <file_id, row_pos>. To let any QE resolve a ctid from any origin, BeginForeignModify pre-populates a global file-id map using the full fragment list (not just the subset assigned to the current QE).

8. Pushdown & Optimization

WHERE clauses are translated through deparse.c into the agent's FilterNode tree; the agent then converts that into an Iceberg Expression, applying partition pruning + manifest min/max filtering at planFiles time. Operators planned for pushdown: =, !=, >, <, >=, <=, IS [NOT] NULL, LIKE, IN, AND, OR.

The Provider C++ layer then applies row-group filtering + residual predicates + column projection.

A fragment cache (GUC datalake.enable_iceberg_fragment_cache, default on) caches metadata_location + filter → plan result within a single backend, avoiding repeated trips to the agent.

9. Concurrency with External Engines

Community Iceberg engines (Spark / Trino / …) may write the same table concurrently. Under this design:

When an external engine commits, it changes the Catalog's metadata_location;
The next CB DML's rebase will notice global != last_base and replan (accumulated files are reapplied on top of the new global);
If replay hits an incompatible evolution (e.g. column-type conflict) → the agent raises an error → PG aborts the transaction and asks the user to retry.

10. Extensibility

New Catalog type (Nessie / Glue / in-house):

Add the corresponding Iceberg Catalog construction on the agent side;
Add a new type branch on the PG side.
Because all Iceberg semantics live in the agent, the PG-side change is minimal.

New storage backend:

Add a FileSystem implementation;
Have the Volume FDW recognize the new type and handle its connection parameters.

New DML shapes (MERGE / UPSERT): mostly planner work; the underlying "write data file + write position-delete" primitives can be reused.

11. Outside the First Release (follow-up work)

Items the first release will not cover and that will be discussed in later iterations:

Only identity partitioning is planned; bucket / truncate / hour transforms are not supported;
No partition-spec evolution;
No Branch / Tag / Time Travel queries;
Equality deletes are read-only;
Concurrency only at Read Committed;
The agent is single-instance by design; production deployments are expected to front it with a reverse proxy and multiple instances themselves;
ANALYZE relies on record_count / bytes returned by the agent and is not deeply integrated with PG's column statistics;
When an entire data file is deleted, the first release still writes a position-delete file and relies on a later VACUUM for cleanup — there is room for optimization here.

12. Appendix

12.1 Key GUCs (planned)

GUC	Default	Description
`iceberg_default_catalog`	`''`	default Catalog
`iceberg_default_volume`	`''`	default Volume
`datalake_agent_server_url`	—	agent endpoint
`datalake.enable_iceberg_fragment_cache`	`on`	enable fragment cache
`datalake.iceberg_vacuum_compact_min_input_files`	`10`	min input files to trigger VACUUM compaction
`datalake.iceberg_vacuum_rewrite_target_file_size_mb`	`512`	VACUUM target file size (MB)
`datalake.iceberg_postion_deletes_threshold`	`100000`	position-delete threshold
`datalake.external_table_limit_segment_num`	`0`	cap on segments participating in a scan (0 = no cap)
`datalake.disable_filter_pushdown`	`off`	disable predicate pushdown (for debugging)
`datalake.iceberg_autovacuum`	`off`	enable autovacuum (requires restart)
`datalake.iceberg_autovacuum_naptime`	`600`	autovacuum interval (seconds)

12.2 New system tables (planned)

iceberg.pg_iceberg_metadata — current metadata location for each Iceberg table

Column	Type	Description
`relid`	oid	LakeTable OID (primary key)
`metadata_location`	text	current metadata.json path
`previous_metadata_location`	text	previous version (used for CAS)
`is_internal`	bool	whether this is a Builtin Catalog table
`default_spec_id`	int4	default partition spec

iceberg.pg_iceberg_deletion_queue — queue of files to be cleaned up

Column	Type	Description
`path`	text	path to delete (primary key)
`table_name`	oid	originating table OID
`orphaned_at`	timestamptz	time enqueued
`retry_count`	int4	retry count
`deletion_type`	int4	`0 = FILE` / `1 = METADATA`

12.3 Planned code layout

contrib/datalake_fdw/
├── src/am_iceberg/            Iceberg Table AM + Metadata Tracker + DDL hook
├── src/iceberg_catalog_fdw/   Catalog FDW (Polaris / Hive / Builtin)
├── src/iceberg_volume_fdw/    Volume FDW (S3 / HDFS)
├── src/provider/iceberg/      Provider C++ (Parquet I/O, delete handling)
├── src/components/agent_cli/  agent gRPC client
└── docs/                      this document

contrib/datalake_proxy/        PG bgworker that launches and supervises the agent jar
contrib/datalake_agent/        Java Spring Boot, wraps iceberg-java

Suggested review focus:

Whether the four-layer split (AM / Catalog FDW / Volume FDW / Agent) is sound;
The trade-off of a dedicated Java service for metadata vs. a pure C implementation;
Whether the datalake_proxy bgworker process model is the right way to host the Java agent;
The evolution path and compatibility story of the RPC protocol;
Correctness of the Metadata Tracker's rebase + CAS strategy under Read Committed and SAVEPOINT;
The MPP division of responsibilities: "agent is only talked to by the QD; data I/O is parallelized on QEs";
The necessity of the Builtin Catalog as a metadata fallback;
Whether splitting Catalog and Volume into two FDWs is over-abstraction;
The extension path for partition evolution / Branch / equality deletes.

Rollout/Adoption Plan

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

ostinru · 2026-04-20T08:16:21Z

ostinru
Apr 20, 2026
Collaborator

CC: @leborchuk @andr-sokolov

0 replies

ostinru · 2026-04-20T09:48:35Z

ostinru
Apr 20, 2026
Collaborator

Making Catalog and Volume two separate FDWs, each with its own Server / UserMapping, lets us cover every combination without inventing a new FDW for each.

If I am not mistaken - there is no way to CREATE USER MAPPING for ROLE (group of users) - only PUBLIC or explicitly for each user in the group. This may force DBAs to conduct more work than you would expect.

There were discussion for Multi-Catalog support for Cloudberry (may be there are some useful thoughts)
#1297

1 reply

MisterRaindrop Apr 20, 2026
Collaborator Author

That's good question. Initially, we envisioned unifying the FDW and Table AM layers. That's why we use FDW to manage catalogs and volumes.
Here are potential solutions:

We could store credentials in PUBLIC and achieve isolation via GRANT USAGE ON FOREIGN SERVER / schema permissions.
Providing helper functions to expand pg_auth_members for a given role, and batch creating/syncing USER MAPPING for members of that role would also alleviate the issue.

How about this?

leborchuk · 2026-04-20T15:41:39Z

leborchuk
Apr 20, 2026
Collaborator

Hi!

Thank you very much for sharing your thoughts and ideas! We here in Moscow have struggled with the same issue. I must say that everyone is crazy about lakehouse architecture, especially since no one fully understands what it actually means. But anyway, I have formulated their wishes for myself as "using various databases to work with well-structured transactional data"

I really appreciate efforts and willing to participate in development process.

That's why it's very important for me to understand why we are doing it, which parts are important, and what types of work should be done first and what can be postponed until later stages. I'd like to focus on:

The best SELECT performance

Where the Cloudberry place in lakehouse world? Everyone know about Trino and Apache Doris/Starrocks, and most probably the kernel of feature lakehouse system will be one of them, not Cloudberry. We could try to catch up with them and achieve feature parity, doing the same as these products, not worse for a start, and preferably better. It's real, but it also takes a lot of effort and time. And still does not succeed.

If we only accept that there are other databases and they do a better job and they are more likely to be used for it. We can set priorities and start by doing something better than everyone else. I mean we in MPP get used to everything should be properly distributed. And use one of the best cost-based optimizers to produce execution plan.

Let's:
A. Define the amount of work for each QE on a QD on a planning phase, using statistics and cost models.
B. Replan query/reassign the list of reading parquet files to worker if we missed with selectivity estimation
C. Use various optimizations on QE like threads/SIMD instructions etc. We have iceberg-cxx - the same as iceberg-cpp but right now has a better performance.
D. Use special proxy to get data from S3. Proxy could be used for IO-control and as a caching layer, see Simplified workflow used in yezzey

Native Polaris intergration

Why place metadata catalog outside the Cloudberry cluster? Let's make it a first-class citizen. One could configure the Apache Cloudberry cluster with the Polaris catalog. The Cloudberry can store data, and it can also be used for storing Polaris catalog data. And so, Cloudberry is once again the central element of the lakehouse.

5 replies

MisterRaindrop Apr 21, 2026
Collaborator Author

Thank you very much. First, let me respond to several questions:

Why use Java Iceberg instead of iceberg-cpp
Currently, iceberg-cpp cannot meet our requirements. Although we have made some efforts, iceberg-cpp is still far from mature. By using the Java implementation of Iceberg, we can support the latest features such as Iceberg V3, V4, etc., in later stages. The Java-side Iceberg always maintains the latest version.
datalake_agent will be integrated into Cloudberry
datalake_agent will include the Java Iceberg JAR package. It will be mainly responsible for parsing Iceberg metadata on the QD node, then dispatching and passing the metadata information to the segments. The segments will only be in charge of loading data.
The advantages of this approach are:

The Java Iceberg JAR package is always up-to-date, allowing us to easily follow the latest code to implement features and support Iceberg V3, V4.
It reduces the pressure of metadata access.

Optimal performance
We plan to use QE to perform unified data reading, which is faster than parsing by a single PXF process alone. For further performance optimization, we can refer more to optimizations for Parquet in projects such as Apache Arrow or DataFusion.
I believe pure performance optimization is not an issue; the higher priority is to ensure complete functionality.
Caching for object storage and Hadoop
Caching does significantly impact overall performance. However, we plan to reserve a dedicated read/write IO layer for users to implement their own best practices. This depends on how users define their own file IO.
We will provide basic methods for accessing object storage and HDFS. Users can also implement their own optimized IO methods if needed.
Regarding Polaris
This is a good question. However, I would like to clarify what integrating Polaris into Cloudberry specifically means.
Does it mean hosting the Polaris service directly on Cloudberry? Or hosting Polaris metadata on Cloudberry? @leborchuk

MisterRaindrop Apr 21, 2026
Collaborator Author

I don’t think simply loading, reading or writing data is an issue for Cloudberry—we can optimize performance to match that of local PAx tables.
When compared with Apache Doris/Starrocks, the difference may lie in the execution engine: we are based on the PostgreSQL engine, which is different from a pure columnar engine.
It might be better to build a separate dedicated columnar analytical engine?
I recall there was a previous discussion topic about pushing down queries to DuckDB; I wonder if this could be helpful later on.

hi @yjhjstz Do you have any good suggestions on this part?

leborchuk Apr 23, 2026
Collaborator

Thank you very much. First, let me respond to several questions:

Why use Java Iceberg instead of iceberg-cpp
Currently, iceberg-cpp cannot meet our requirements. Although we have made some efforts, iceberg-cpp is still far from mature. By using the Java implementation of Iceberg, we can support the latest features such as Iceberg V3, V4, etc., in later stages. The Java-side Iceberg always maintains the latest version.

datalake_agent will be integrated into Cloudberry
datalake_agent will include the Java Iceberg JAR package. It will be mainly responsible for parsing Iceberg metadata on the QD node, then dispatching and passing the metadata information to the segments. The segments will only be in charge of loading data.
The advantages of this approach are:

The Java Iceberg JAR package is always up-to-date, allowing us to easily follow the latest code to implement features and support Iceberg V3, V4.

It reduces the pressure of metadata access.

Optimal performance
We plan to use QE to perform unified data reading, which is faster than parsing by a single PXF process alone. For further performance optimization, we can refer more to optimizations for Parquet in projects such as Apache Arrow or DataFusion.
I believe pure performance optimization is not an issue; the higher priority is to ensure complete functionality.

Caching for object storage and Hadoop
Caching does significantly impact overall performance. However, we plan to reserve a dedicated read/write IO layer for users to implement their own best practices. This depends on how users define their own file IO.
We will provide basic methods for accessing object storage and HDFS. Users can also implement their own optimized IO methods if needed.

Regarding Polaris
This is a good question. However, I would like to clarify what integrating Polaris into Cloudberry specifically means.
Does it mean hosting the Polaris service directly on Cloudberry? Or hosting Polaris metadata on Cloudberry? @leborchuk

Yes, it sounds wise to use mature project. Iceberg java is great and so no need to write all functions once again just to make sure it launches inside main process.
Yes, datalake_agent sounds good. But is it possible to define stable serializable RPC interface for interacting with the datalake_agent. What it should be? protobuf + GRPC?
I cannot say if optimal performance is crucial or not but I'm afraid we will have a strong demand for the performance. Not optimal but fast enough to make it sense to use the extension.

What is the primary purpose for which you are considering using Iceberg?

Our scenario is as follows.

(1) Sharing data

There is a lot of data that does not fit into a single greenplum cluster, so we need to create several smaller clusters, say up to 10, each with around 1-2 racks size. The problem is how to upload the data to these clusters. Copying the same data across 10 different clusters is not practical, time-consuming and leads to the growth of the clusters. Instead, we can load the data into an iceberg, and then use extensions to read it from different clusters. We need to make sure that the reading is no slower than reading from local files. No recording is required for this scenario, as the data can be generated by other databases, such as Spark/Trino/StarRocks.

You can see a code for the GP6 extension in the tea project. ( https://github.com/lithium-tech/tea )

(2) Archive data

Write data from GP to S3 and store catalog info for later re-read them. Allows you to reduce the cluster size. Right now there are no write functionality in GP extensions. But performance here is not so crucial, you could write data to archive in a background. Though you shouldn't spend CPU aimlessly, GP clusters usually have little free CPU and memory.

I'd like to participate in all activities. But want to assess my capabilities soberly. I will be able to focus now primarily on scenario (1) Sharing data. I think I can test this code on a production-like installation. And only if succeed there it would be wise to move further. If not - we will need to continue working on the architecture.

Yes, the current approach is fdw, but TableAm approach looks more promising.

There is also an interesting aspect: how exactly to work with metadata? First, it would be great if we could import a schema so as not to have to create objects ourselves. Secondly, we need to figure out how to handle columns and their data types. Ideally, I would like to have something like a view, where you create an iceberg table and not say which columns you want - just select everything. And then depending on the (iceberg) transaction you can see different column set and their types in the table.

Sorry for the direct question, but do you have any evidence? We tried to cache data in yezzey project - https://github.com/open-gpdb/yezzey - no performance benefits. And while testing starrocks (iceberg caching is enabled in it by setting) - again no significant differences in TPC-H queries (Datacache in tpc-h provides about 10% performance compared to reading directly from S3.).

We use yproxy (https://github.com/open-gpdb/yproxy) mainly for limit input/output, memory and CPU consumption. This turned out to be more important than caching.

Polaris

I am not sure, we're still discussing it. Should it be Polaris or maybe https://github.com/apache/gravitino ? Does Cloudberry really good at oltp workload from catalog or something else should be used. No answers right now.

MisterRaindrop Apr 24, 2026
Collaborator Author

Thanks for the detailed feedback, @leborchuk — happy to dive into this topic with you.

Personally, I see this as one of the inevitable directions for the next generation of data
infrastructure. Open table formats — Iceberg, Lance, Hudi — are already emerging as a
foundational layer, and storage–compute separation is, in my view, the architectural endpoint
almost every serious analytics system is converging toward. We're also seeing these formats
increasingly adopted as the data substrate for embodied-AI and multimodal workloads, which only
reinforces the case for Cloudberry to be a first-class citizen here.

Responding to your points one by one:

RPC interface

Yes — datalake_agent will expose a Protobuf + gRPC interface, treated as a stable, versioned
contract so that the QD and the agent can evolve independently.

Our primary motivation

Our main motivation aligns with your scenario (1): cross-cluster data sharing, together with
the storage–compute separation that the Iceberg architecture naturally enables. We're very
optimistic about this direction overall.

On scenario (2) — archive

A genuine question back: if the end state is data sitting on object storage with Iceberg
metadata, why not write directly to object storage from day one, rather than landing it in GP
first and archiving later? That would collapse the archive case into the same code path as data
sharing.

Schema import / view-like tables

Because we're going with the Table AM approach, every Iceberg table must have a corresponding
relation in the catalog, so a CREATE TABLE is unavoidable — you will still need to create a
table. That said, making the column set dynamic (tracking Iceberg schema evolution at read
time) is entirely feasible and not particularly hard, and we plan to support it.

Caching

We do believe caching is effective. The first pull from remote storage is unavoidably slow, but
once blocks are cached on local disk, reads are essentially indistinguishable from local
files. On the cache side, prefetching and parallel download are both worth considering.

The common reasons caching appears to underperform are, in our view:

cache capacity too small → low hit rate
network bottleneck during background fetch
cache block size too large → poor efficiency
insufficient concurrency → can't keep up with the consumer

In principle, with proper sizing and tuning, a well-configured cache can reach near-local
performance.

Polaris

Polaris is not a blocker. Cloudberry will manage (mirror) all Iceberg metadata internally;
Polaris is only consulted at read time to fetch the latest Iceberg metadata pointer. Even if
Polaris goes down, we can still read the Iceberg data.

Zooming out: I think getting Iceberg right inside Cloudberry isn't just a feature — it's
positioning the project for where the ecosystem is actually going (lakehouse + open formats +
AI-native workloads). Looking forward to keeping this conversation going, and very open to
collaborating on scenario (1) with you in a production-like setting.

@leborchuk

ostinru Apr 24, 2026
Collaborator

Schema import / view-like tables
Because we're going with the Table AM approach, every Iceberg table must have a corresponding
relation in the catalog, so a CREATE TABLE is unavoidable — you will still need to create a
table. That said, making the column set dynamic (tracking Iceberg schema evolution at read
time) is entirely feasible and not particularly hard, and we plan to support it.

When I was researching how to make PXF's CRETE FOREIGN TABLE easier to use I ended up with an idea of IMPORT FOREIGN SCHEMA as first step + and background worker that refreshes state. apache/cloudberry-pxf#69
However I am not sure that FOREIGN TABLE is the best solution from UX point of view. Multi-catalog approach sounds like better solution.

Caching
We do believe caching is effective. The first pull from remote storage is unavoidably slow, but
once blocks are cached on local disk, reads are essentially indistinguishable from local
files. On the cache side, prefetching and parallel download are both worth considering.

@leborchuk , I read (somewhere) following approach for caching - cache Parquet/ORC's file footers - this should eliminate extra roundtrip and can help storage engine to skip files without fetching them.

[Proposal] Iceberg subsystem for datalake_fdw — design proposal #1683

Uh oh!

Uh oh!

MisterRaindrop Apr 20, 2026 Collaborator

Proposers

Proposal Status

Abstract

1. Abstract

Motivation

2. Motivation & Goals

2.1 Why we need this

2.2 Goals

2.3 Non-goals (outside the first release)

Implementation

3. Overall Architecture

4. The Core Abstraction: Catalog × Volume × Table

Why Catalog and Volume are separated

Builtin Catalog

5. Components & Design Decisions

5.1 Iceberg Table AM: why not a pure FDW

5.2 Catalog FDW: abstracting three backends

5.3 Volume FDW: the data-file I/O abstraction

5.4 datalake_agent: why a separate Java service

Process lifecycle: managed by the datalake_proxy bgworker

RPC protocol: gRPC

5.5 Provider layer: the data plane

5.6 Metadata Tracker: the heart of transactional semantics

5.7 Deletion Queue: why asynchronous cleanup

6. End-to-End Flows

CREATE ICEBERG TABLE

SELECT

INSERT / UPDATE / DELETE

VACUUM

DROP

7. MPP Execution Model

7.1 QD vs QE responsibilities

7.2 Fragment dispatch

7.3 Global file-id consistency

8. Pushdown & Optimization

9. Concurrency with External Engines

10. Extensibility

11. Outside the First Release (follow-up work)

12. Appendix

12.1 Key GUCs (planned)

12.2 New system tables (planned)

12.3 Planned code layout

Rollout/Adoption Plan

Are you willing to submit a PR?

Replies: 3 comments · 6 replies

Uh oh!

ostinru Apr 20, 2026 Collaborator

Uh oh!

Uh oh!

ostinru Apr 20, 2026 Collaborator

Uh oh!

Uh oh!

MisterRaindrop Apr 20, 2026 Collaborator Author

Uh oh!

leborchuk Apr 20, 2026 Collaborator

The best SELECT performance

Native Polaris intergration

Uh oh!

Uh oh!

MisterRaindrop Apr 21, 2026 Collaborator Author

Uh oh!

MisterRaindrop Apr 21, 2026 Collaborator Author

Uh oh!

leborchuk Apr 23, 2026 Collaborator

(1) Sharing data

(2) Archive data

Uh oh!

Uh oh!

MisterRaindrop Apr 24, 2026 Collaborator Author

Uh oh!

ostinru Apr 24, 2026 Collaborator

MisterRaindrop
Apr 20, 2026
Collaborator

Process lifecycle: managed by the `datalake_proxy` bgworker

Replies: 3 comments 6 replies

ostinru
Apr 20, 2026
Collaborator

ostinru
Apr 20, 2026
Collaborator

MisterRaindrop Apr 20, 2026
Collaborator Author

leborchuk
Apr 20, 2026
Collaborator

MisterRaindrop Apr 21, 2026
Collaborator Author

MisterRaindrop Apr 21, 2026
Collaborator Author

leborchuk Apr 23, 2026
Collaborator

MisterRaindrop Apr 24, 2026
Collaborator Author

ostinru Apr 24, 2026
Collaborator