feat(arrow): resolve S3-compatible schemes in ArrowFileSystemFileIO#703
feat(arrow): resolve S3-compatible schemes in ArrowFileSystemFileIO#703plusplusjiajia wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates iceberg::arrow::ArrowFileSystemFileIO::ResolvePath to better handle S3-compatible (and other “foreign”) URI schemes by resolving them to the underlying Arrow filesystem path, avoiding failures when the wrapped Arrow FileSystem only accepts its native scheme.
Changes:
- Adjusted
ArrowFileSystemFileIO::ResolvePathto fall back fromPathFromUri()to a scheme-less path when URI parsing fails. - Added a regression test ensuring reads succeed when a non-native scheme is used with the underlying filesystem.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/iceberg/arrow/arrow_io.cc | Implements URI scheme fallback behavior in ResolvePath. |
| src/iceberg/test/arrow_io_test.cc | Adds test coverage for resolving a foreign scheme to the underlying path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (auto pos = file_location.find("://"); pos != std::string::npos) { | ||
| auto path = arrow_fs_->PathFromUri(file_location); | ||
| if (path.ok()) { | ||
| return path.ValueOrDie(); | ||
| } | ||
| // PathFromUri rejects S3-compatible schemes (s3a/s3n, gs://, oss://); | ||
| // fall back to the scheme-less bucket/key. | ||
| return file_location.substr(pos + 3); | ||
| } |
There was a problem hiding this comment.
Addressed: the fallback now triggers only on Arrow's scheme-mismatch error; other PathFromUri failures (malformed URI, unsupported authority, …) propagate. ?query/#fragment are stripped. Added tests for the propagate path and for preserving percent-encoded keys.
| } | ||
| // PathFromUri rejects S3-compatible schemes (s3a/s3n, gs://, oss://); | ||
| // fall back to the scheme-less bucket/key. | ||
| return file_location.substr(pos + 3); |
There was a problem hiding this comment.
I think this fallback is too broad. As implemented, any URI scheme rejected by PathFromUri will be stri
pped and interpreted as a path in the wrapped filesystem. For example, with an S3FileSystem, a mismatch ed URI such as file://bucket/prefix would be interpreted as bucket/prefix instead of failing fast.
Can we make the fallback explicit and narrow?
- The scheme should be in an allowlist for the wrapped filesystem. For S3FileSystem, I think we can
support well-known aliases such as s3a and s3n. Vendor schemes such as oss or gs should probably be opt- in or discussed separately, because they may also represent native OSS/GCS locations. - The PathFromUri failure should match the expected scheme-mismatch / unsupported-scheme case. Other errors, such as malformed URI, authority handling errors, or local-path errors, should still be returned to the caller.
In other words, this should be guarded by both a scheme allowlist and the expected error condition,
rather than falling back for every PathFromUri failure.
There was a problem hiding this comment.
@MisterRaindrop Agreed — the scheme allowlist belongs at the selection layer (like Java's ResolvingFileIO.SCHEME_TO_FILE_IO), so DetectBuiltinFileIO now accepts s3a/s3n. For the error condition: ResolvePath now only falls back on Arrow's scheme-mismatch error — any other PathFromUri failure (malformed URI, unsupported authority, …) is propagated — and it strips ?query/#fragment. oss/gs left for separate discussion, as you suggested.
0b65bf6 to
c6886c7
Compare
c6886c7 to
5c7656b
Compare
| } | ||
| // Scheme-less bucket/key, dropping any ?query / #fragment. | ||
| std::string bucket_key = file_location.substr(pos + 3); | ||
| return bucket_key.substr(0, bucket_key.find_first_of("?#")); |
There was a problem hiding this comment.
One subtle concern: s3:// and s3a:// currently use different parsing semantics. s3:// goes through Arrow's PathFromUri, while s3a/s3n use the fallback substring parser. This means percent-encoded keys may resolve differently.
example:
s3://bucket/a%20b -> Arrow PathFromUri -> bucket/a b
s3a://bucket/a%20b -> fallback -> bucket/a%20b
There was a problem hiding this comment.
@MisterRaindrop Good catch — fixed. The fallback now percent-decodes the key via arrow::util::UriUnescape, so s3a/s3n resolve identically to native s3:// (a%20b → a b).
|
|
||
| // Only fall back for Arrow's scheme-mismatch error; propagate anything else. | ||
| const auto& status = path.status(); | ||
| if (status.message().find("expected a URI with one of the schemes") == |
There was a problem hiding this comment.
This approach is problematic. It relies on Arrow's error messages, which will break if Arrow changes the message text. It does not simultaneously check status.code()
There was a problem hiding this comment.
@MisterRaindrop Good point — dropped the error-text match. ResolvePath now validates the URI with arrow::util::Uri::Parse (a malformed URI fails there and propagates) and then percent-decodes the scheme-less bucket/key, so it no longer depends on Arrow's message text or status code.
bc30e66 to
0f32c67
Compare
0f32c67 to
277d8ba
Compare
|
Thanks for @plusplusjiajia contribution. There are a few minor issues with this PR. My remaining concern is that later metadata/manifest/delete locations may reach the already-created FileIO directly, so For the scope of this PR, this looks acceptable to me. I think we should follow up with a separate issue to discuss a higher-level resolving FileIO, similar to Java's ResolvingFileIO, for per-location scheme dispatch/enforcement. |
ArrowFileSystemFileIO::ResolvePathpassed the full URI to the wrapped FileSystem'sPathFromUri, which only accepts that FileSystem's native scheme(s3://forS3FileSystem). S3-compatible object stores are commonly addressed with other schemes —s3a/s3n, or vendor schemes such asgs://(GCS) andoss://(Alibaba OSS) — served by anarrow::fs::S3FileSystemconfigured with anendpoint_override. Every read and write of such a location failed with: