Skip to content

feat(arrow): resolve S3-compatible schemes in ArrowFileSystemFileIO#703

Open
plusplusjiajia wants to merge 4 commits into
apache:mainfrom
plusplusjiajia:fix-arrow-s3-compatible-scheme
Open

feat(arrow): resolve S3-compatible schemes in ArrowFileSystemFileIO#703
plusplusjiajia wants to merge 4 commits into
apache:mainfrom
plusplusjiajia:fix-arrow-s3-compatible-scheme

Conversation

@plusplusjiajia

Copy link
Copy Markdown
Member

ArrowFileSystemFileIO::ResolvePath passed the full URI to the wrapped FileSystem's PathFromUri, which only accepts that FileSystem's native scheme(s3:// for S3FileSystem). S3-compatible object stores are commonly addressed with other schemes — s3a/s3n, or vendor schemes such as gs:// (GCS) andoss:// (Alibaba OSS) — served by an arrow::fs::S3FileSystem configured with an endpoint_override. Every read and write of such a location failed with:

expected a URI with one of the schemes (s3) but received <scheme>://...

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates iceberg::arrow::ArrowFileSystemFileIO::ResolvePath to better handle S3-compatible (and other “foreign”) URI schemes by resolving them to the underlying Arrow filesystem path, avoiding failures when the wrapped Arrow FileSystem only accepts its native scheme.

Changes:

  • Adjusted ArrowFileSystemFileIO::ResolvePath to fall back from PathFromUri() to a scheme-less path when URI parsing fails.
  • Added a regression test ensuring reads succeed when a non-native scheme is used with the underlying filesystem.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/iceberg/arrow/arrow_io.cc Implements URI scheme fallback behavior in ResolvePath.
src/iceberg/test/arrow_io_test.cc Adds test coverage for resolving a foreign scheme to the underlying path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/iceberg/arrow/arrow_io.cc Outdated
Comment on lines 476 to 484
if (auto pos = file_location.find("://"); pos != std::string::npos) {
auto path = arrow_fs_->PathFromUri(file_location);
if (path.ok()) {
return path.ValueOrDie();
}
// PathFromUri rejects S3-compatible schemes (s3a/s3n, gs://, oss://);
// fall back to the scheme-less bucket/key.
return file_location.substr(pos + 3);
}

@plusplusjiajia plusplusjiajia Jun 8, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed: the fallback now triggers only on Arrow's scheme-mismatch error; other PathFromUri failures (malformed URI, unsupported authority, …) propagate. ?query/#fragment are stripped. Added tests for the propagate path and for preserving percent-encoded keys.

Comment thread src/iceberg/arrow/arrow_io.cc Outdated
}
// PathFromUri rejects S3-compatible schemes (s3a/s3n, gs://, oss://);
// fall back to the scheme-less bucket/key.
return file_location.substr(pos + 3);

@MisterRaindrop MisterRaindrop Jun 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fallback is too broad. As implemented, any URI scheme rejected by PathFromUri will be stri
pped and interpreted as a path in the wrapped filesystem. For example, with an S3FileSystem, a mismatch ed URI such as file://bucket/prefix would be interpreted as bucket/prefix instead of failing fast.

Can we make the fallback explicit and narrow?

  1. The scheme should be in an allowlist for the wrapped filesystem. For S3FileSystem, I think we can
    support well-known aliases such as s3a and s3n. Vendor schemes such as oss or gs should probably be opt- in or discussed separately, because they may also represent native OSS/GCS locations.
  2. The PathFromUri failure should match the expected scheme-mismatch / unsupported-scheme case. Other errors, such as malformed URI, authority handling errors, or local-path errors, should still be returned to the caller.

In other words, this should be guarded by both a scheme allowlist and the expected error condition,
rather than falling back for every PathFromUri failure.

@plusplusjiajia plusplusjiajia Jun 8, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MisterRaindrop Agreed — the scheme allowlist belongs at the selection layer (like Java's ResolvingFileIO.SCHEME_TO_FILE_IO), so DetectBuiltinFileIO now accepts s3a/s3n. For the error condition: ResolvePath now only falls back on Arrow's scheme-mismatch error — any other PathFromUri failure (malformed URI, unsupported authority, …) is propagated — and it strips ?query/#fragment. oss/gs left for separate discussion, as you suggested.

@plusplusjiajia plusplusjiajia force-pushed the fix-arrow-s3-compatible-scheme branch 4 times, most recently from 0b65bf6 to c6886c7 Compare June 8, 2026 09:12
@plusplusjiajia plusplusjiajia force-pushed the fix-arrow-s3-compatible-scheme branch from c6886c7 to 5c7656b Compare June 8, 2026 09:32
Comment thread src/iceberg/arrow/arrow_io.cc Outdated
}
// Scheme-less bucket/key, dropping any ?query / #fragment.
std::string bucket_key = file_location.substr(pos + 3);
return bucket_key.substr(0, bucket_key.find_first_of("?#"));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One subtle concern: s3:// and s3a:// currently use different parsing semantics. s3:// goes through Arrow's PathFromUri, while s3a/s3n use the fallback substring parser. This means percent-encoded keys may resolve differently.
example:
s3://bucket/a%20b -> Arrow PathFromUri -> bucket/a b
s3a://bucket/a%20b -> fallback -> bucket/a%20b

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MisterRaindrop Good catch — fixed. The fallback now percent-decodes the key via arrow::util::UriUnescape, so s3a/s3n resolve identically to native s3:// (a%20b → a b).

Comment thread src/iceberg/arrow/arrow_io.cc Outdated

// Only fall back for Arrow's scheme-mismatch error; propagate anything else.
const auto& status = path.status();
if (status.message().find("expected a URI with one of the schemes") ==

@MisterRaindrop MisterRaindrop Jun 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is problematic. It relies on Arrow's error messages, which will break if Arrow changes the message text. It does not simultaneously check status.code()

@plusplusjiajia plusplusjiajia Jun 9, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MisterRaindrop Good point — dropped the error-text match. ResolvePath now validates the URI with arrow::util::Uri::Parse (a malformed URI fails there and propagates) and then percent-decodes the scheme-less bucket/key, so it no longer depends on Arrow's message text or status code.

@plusplusjiajia plusplusjiajia force-pushed the fix-arrow-s3-compatible-scheme branch from bc30e66 to 0f32c67 Compare June 9, 2026 02:11
@plusplusjiajia plusplusjiajia force-pushed the fix-arrow-s3-compatible-scheme branch from 0f32c67 to 277d8ba Compare June 9, 2026 02:22
@MisterRaindrop

Copy link
Copy Markdown
Contributor

Thanks for @plusplusjiajia contribution. There are a few minor issues with this PR.
I agree that the allowlist in DetectBuiltinFileIO covers the initial catalog-level FileIO selection from warehouse/io-impl.

My remaining concern is that later metadata/manifest/delete locations may reach the already-created FileIO directly, so
they are not necessarily guarded by DetectBuiltinFileIO. That said, I also agree that ResolvePath is not the right place to implement global scheme routing or rejection.

For the scope of this PR, this looks acceptable to me. I think we should follow up with a separate issue to discuss a higher-level resolving FileIO, similar to Java's ResolvingFileIO, for per-location scheme dispatch/enforcement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants