Skip to content

Delta Sharing CLI#863

Open
zhu-tom wants to merge 6 commits intomainfrom
zhu-tom/ds-cli
Open

Delta Sharing CLI#863
zhu-tom wants to merge 6 commits intomainfrom
zhu-tom/ds-cli

Conversation

@zhu-tom
Copy link
Copy Markdown
Collaborator

@zhu-tom zhu-tom commented Mar 19, 2026

Summary

  • Adds a zero-dependency Python CLI for interacting with any Delta Sharing server via the open
    protocol
  • Covers all 10 protocol endpoints: shares, schemas, tables, metadata, queries, Change Data Feed, and temporary credentials
  • Supports the protocol-spec profile file format for instant setup, plus
    INI-style named profiles for multi-environment use

Motivation

Delta Sharing is an open protocol, but there is no official CLI in the ecosystem. The existing delta-sharing Python package is a library that requires writing
code and importing pandas or Spark. For anyone who wants to quickly explore shared data, debug server behavior, or script against the API, the only option today
is hand-crafting curl commands — with long URLs, manual auth headers, and ndjson responses that break standard JSON tools.

Every major open protocol has a CLI (HTTP has curl, S3 has aws s3, gRPC has grpcurl). Delta Sharing should too.

What's included

CLI (cli.py) — single-file, pure Python 3.6+ stdlib. No external dependencies.

Command Protocol endpoint
delta-sharing shares list GET /shares
delta-sharing shares get GET /shares/{share}
delta-sharing schemas list GET /shares/{share}/schemas
delta-sharing tables list GET /shares/{share}/schemas/{schema}/tables
delta-sharing tables list-all GET /shares/{share}/all-tables
delta-sharing tables version GET .../tables/{table}/version
delta-sharing tables metadata GET .../tables/{table}/metadata
delta-sharing tables query POST .../tables/{table}/query
delta-sharing tables changes GET .../tables/{table}/changes
delta-sharing tables credentials POST .../temporary-table-credentials
delta-sharing configure Set up named profiles
delta-sharing profiles List configured profiles

Key features:

  • Profile file import (delta-sharing configure -f profile.json) — hand someone a profile, they're connected in one command
  • Named profiles in ~/.delta-sharing.cfg (INI format) with -P <name> switching
  • Auto-pagination (-a) for all list endpoints
  • Verbose mode (-V) — prints full request/response to stderr with tokens masked, for debugging and bug reports
  • ndjson responses automatically parsed into readable JSON arrays
  • SSL verification skip (-k) for testing environments, configurable per-profile

Tests (test_cli.py) — 75 unit tests covering URL building, config read/write, profile loading, connection resolution, argument parsing for every command and
flag.

Packaging (setup.py)pip install -e . registers the delta-sharing console script.

Example: before and after

Before (curl):

curl -s -k -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "delta-sharing-capabilities: responseformat=delta;readerfeatures=deletionvectors" \
  -d '{"predicateHints":["region = '\''us-east-1'\''"],"limitHint":100,"version":5}' \
  "https://host/api/2.0/delta-sharing/metastores/$ID/shares/myshare/schemas/default/tables/mytable/query" \
  | python3 -c "import sys,json;[print(json.dumps(json.loads(l),indent=2)) for l in sys.stdin if l.strip()]"

After (CLI):

  delta-sharing tables query myshare default mytable \
    -l 100 -v 5 --predicate-hints "region = 'us-east-1'" \
    --response-format delta --reader-features deletionvectors

Test plan

  • python3 -m unittest test_cli -v — 75 tests pass
  • Manual testing against live Delta Sharing server (list shares, schemas, tables, metadata, query, CDF)
  • Profile file import and named profile switching verified
  • Verbose mode and SSL skip verified against staging endpoints

Comment thread cli/test_cli.py
@@ -104,6 +104,14 @@ def test_roundtrip(self):
result = cli._read_cfg()
self.assertEqual(result, sections)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems somehow this was accidentally added in my server change PR? Please review the full file

./python/dev/lint-python
./python/dev/pytest
python -m pip install -e ./cli
cd cli && python -m unittest discover -v -p 'test*.py'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why test it with Python not a standalone build?

Comment thread cli/README.md
```bash
delta-sharing tables metadata my_share my_schema my_table
delta-sharing tables metadata my_share my_schema my_table \
--response-format delta --reader-features deletionvectors
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there are multiple reader features?

The value deletionvectors doesn't seem to match real feature name?

Comment thread cli/README.md
### tables query

```bash
delta-sharing tables query my_share my_schema my_table
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does query support auto resolution?

@littlegrasscao
Copy link
Copy Markdown
Collaborator

Code review (posted on behalf of review pass)

Overall

The CLI is a strong addition: stdlib-only, sensible auth resolution order, 600 permissions on the INI file, masked Authorization in verbose mode, timeouts, User-Agent, and solid unit coverage for URLs, config, connection resolution, argparse, and headers.


1. Protocol gaps: query pagination / EndStreamAction

tables query issues a single POST and parses NDJSON. The protocol describes EndStreamAction with nextPageToken for paginated queries and refresh tokens for long scans (see PROTOCOL.md — Delta Sharing Capabilities / EndStreamAction).

Suggestion: Document that the CLI returns only the first HTTP response body, and either implement follow-up requests when endStreamAction contains nextPageToken, or expose flags aligned with includeEndStreamAction / refresh behavior so users know this is not a full protocol client for large/paginated reads.


2. README / --reader-features (follow-up to existing review thread)

PROTOCOL states values are comma-separated inside the capability, e.g. readerfeatures=deletionvectors,columnmapping, and that matching is case-insensitive. The example deletionvectors matches the spec examples; it is not the same spelling as Delta table feature IDs in the Delta repo, which is intentional for this header.

Suggestion: In cli/README.md, add one line showing multiple features, e.g. --reader-features "deletionvectors,columnmapping", and link to the protocol section so readers are not confused by a single-token example.


3. “Auto resolution” / auto-pagination for tables query

List commands have -a / auto-pagination via JSON nextPageToken. tables query does not.

Suggestion: Clarify in the README that “auto-pagination” applies to list endpoints (shares, schemas, tables, list-all), not to POST .../query. If query-side behavior is added later, spell out whether it would chase nextPageToken from endStreamAction or only surface the token for manual follow-up.


4. Python version mismatch

cli/cli.py module docstring still says “Python 3.6+”; pyproject.toml has requires-python = ">=3.8".

Suggestion: Align the docstring (and any other docs) with 3.8+.


5. Apache license headers

Other Python artifacts in the repo use the standard Apache 2.0 file header (e.g. python/delta_sharing/__init__.py). cli.py and test_cli.py do not.

Suggestion: Add the usual Delta Lake / Apache boilerplate to new Python files for license consistency.


6. Distribution vs. console script naming

pyproject.toml uses name = "delta-sharing-cli" while the entry point is delta-sharing. That avoids PyPI name collision with the existing delta-sharing connector package. The connector does not define console_scripts, so the script name is unlikely to clash today.

Suggestion: In README or packaging notes, briefly state that the installable distribution is delta-sharing-cli but the command is delta-sharing, and that this is intentional next to the delta-sharing library on PyPI.


7. CI choice (“why not a standalone build?”)

The package is stdlib-only; pip install -e ./cli + unittest validates the artifact users actually run without introducing Rust/maturin-style builds.

Note: A standalone binary would be a separate deliverable (e.g. PyInstaller); the current approach matches “zero third-party deps” and keeps CI minimal.


8. Small robustness / UX nits (optional)

  • _parse_ndjson: Uses decode() without errors="replace"; malformed UTF-8 in a response could hard-fail; consider defensive decoding or clearer errors.
  • HTTP error bodies: If the error body parses to a non-dict JSON value, err["httpStatus"] = e.code can throw; rare but possible.
  • Module name cli: Fine for a small package; renaming to a less generic top-level name is only worth it if shadowing becomes an issue.

9. Tests

Coverage for URL building, config, resolution, and argparse is strong. There are still no tests for _request / HTTP layering (would need a local stub or urllib mocking)—acceptable for v1 but worth a follow-up for redirects, chunked responses, or error shapes.


(Automated assistant draft; posted by @littlegrasscao — edit or resolve as you see fit.)

Copy link
Copy Markdown
Collaborator

@littlegrasscao littlegrasscao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional line-level review (does not repeat the earlier summary comment on pagination, README reader-features, doc/Python version alignment, license headers, PyPI naming, CI rationale, or generic test gaps).

Comment thread cli/cli.py
return token[:4] + "..." + token[-4:]

_SECTION_RE = re.compile(r"^\[([^\]]+)\]\s*$")
_KV_RE = re.compile(r"^(\w+)\s*=\s*(.*)$")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_KV_RE only allows ASCII word keys (\w+). That rejects INI keys with hyphens and silently ignores non-matching lines. Consider documenting supported key syntax in README, or relaxing the pattern if you want parity with common INI dialects.

Comment thread cli/cli.py
Delta Sharing CLI — interact with a Delta Sharing server from the command line.

Connection resolution (first applicable branch wins; see resolve_connection):
1. Both --endpoint and --token set (explicit credentials only)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor inconsistency: the module docstring describes step 1 as when both --endpoint and --token are set, which matches the code, but readers may not realize that supplying only one of them still falls through to profile/cfg resolution. A clarifying phrase ("both required to skip profile resolution") would help.

Comment thread cli/cli.py
for k, v in e.headers.items():
_debug(f"< {k}: {v}")
_debug(f"< Body: {body_bytes.decode(errors='replace')[:2000]}")
try:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the error response body is valid JSON but not an object (e.g. a quoted string or an array), err["httpStatus"] = e.code raises TypeError. Consider normalizing to a dict first, e.g. err = err if isinstance(err, dict) else {"message": err}.

Comment thread cli/cli.py
def _url(base, *parts, **query):
path = "/".join(urllib.parse.quote(p, safe="") for p in parts)
url = f"{base.rstrip('/')}/{path}"
qs = {k: str(v) for k, v in query.items() if v is not None}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query values use str(v). For CDF, --include-historical-metadata stays a string from argparse today (good); if this ever became a boolean you would emit True/False in the query string instead of true/false. Minor footgun to keep in mind.

Comment thread cli/cli.py
def _parse_ndjson(raw_bytes):
"""Parse newline-delimited JSON into a list of objects."""
lines = raw_bytes.decode().strip().splitlines()
return [json.loads(line) for line in lines if line.strip()]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode() uses strict UTF-8. NDJSON with invalid UTF-8 raises UnicodeDecodeError (surfacing as generic Unexpected error unless -V). Consider decode(errors="replace") or a dedicated error message.

Comment thread cli/cli.py
_, _, body = _request("GET", url, token, extra_headers=extra_headers)
data = json.loads(body)
all_items.extend(data.get("items", []))
page_token = data.get("nextPageToken")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROTOCOL allows nextPageToken to be an empty string meaning no more pages. if not page_token matches that. Optional: a one-line comment in code tying this to the spec would help future readers.

Comment thread cli/cli.py
"tables", args.table, "version",
startingTimestamp=args.starting_timestamp)
_, hdrs, _ = _request("GET", url, token, extra_headers=xh)
version = hdrs.get("Delta-Table-Version") or hdrs.get("delta-table-version")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int(version) still throws if the header is present but non-numeric. Consider try/except and a JSON error on stderr for consistency with _die.

Comment thread cli/cli.py
if args.insecure:
profile["insecure"] = "true"
else:
insecure = input("Skip SSL verification (y/N): ").strip().lower()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After configure -f profile.json, you still prompt for SSL skip unless -k. When stdin is not a TTY, input() reads the next line of stdin (often empty). Document that non-interactive flows should pass -k when needed, or skip the prompt when not sys.stdin.isatty() and default to secure.

Comment thread cli/test_cli.py
if os.name == "nt":
self.skipTest("POSIX file modes")
sections = {"default": {"endpoint": "https://h", "token": "t"}}
cli._write_cfg(sections)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import cli assumes tests run with cli/ on sys.path (CI uses cd cli). Running from repo root without PYTHONPATH can import the wrong package. A short note for contributors would help.

Comment thread cli/README.md
@@ -0,0 +1,309 @@
# delta-sharing CLI

A command-line tool for interacting with [Delta Sharing](https://github.com/delta-io/delta-sharing) servers. Covers all endpoints in the [protocol spec](https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md). **Zero third-party dependencies** — Python 3.8+ and the standard library only.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Says the tool covers all protocol endpoints. If optional flows (vendor URL prefixes, full EndStreamAction / refresh continuation, etc.) are intentionally unsupported, one sentence would avoid over-promise.

conda activate test-environment
./python/dev/lint-python
./python/dev/pytest
python -m pip install -e ./cli
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI tests ride on the full conda + connector job. A tiny standalone setup-python job could isolate CLI-only failures and mirror how many users install the tool.

Comment thread cli/cli.py
val = _mask_token(v) if k.lower() == "authorization" else v
_debug(f"> {k}: {val}")
if data:
_debug(f"> Body: {data.decode()}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verbose path uses data.decode() without errors=replace. Non-UTF8 JSON body will raise here while logging; consider matching the lenient decode used for HTTP error bodies.

Comment thread cli/cli.py
url += ("&" if "?" in url else "?") + urllib.parse.urlencode(qs)

_, _, body = _request("GET", url, token, extra_headers=extra_headers)
data = json.loads(body)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json.loads(body) on list responses will raise JSONDecodeError on unexpected HTML/plain-text error pages (e.g. proxy/gateway). Could catch and route through _die with a short hint to use -V or check endpoint URL.

Copy link
Copy Markdown
Collaborator

@linzhou-db linzhou-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants