Skip to content

feat(data-drains): add GCS, Azure Blob, BigQuery, Snowflake, and Datadog destinations#4552

Open
waleedlatif1 wants to merge 3 commits intostagingfrom
waleedlatif1/abuja-v1
Open

feat(data-drains): add GCS, Azure Blob, BigQuery, Snowflake, and Datadog destinations#4552
waleedlatif1 wants to merge 3 commits intostagingfrom
waleedlatif1/abuja-v1

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

  • Add 5 new data-drain destinations: Google Cloud Storage, Azure Blob Storage, Google BigQuery, Snowflake, and Datadog Logs
  • Each destination implements test() + openSession()/deliver() with provider-spec-correct auth, retry, byte-accurate size guards, and abort-signal forwarding
  • Snowflake: key-pair JWT (with account-suffix stripping), 202-async polling, identifier quoting, PARSE_JSON bindings, 16 MB VARIANT guard
  • BigQuery: tabledata.insertAll with drainId-prefixed insertId dedup, partial-failure surfacing, 401 token refresh + 5xx/429 retry
  • Datadog: v2 logs intake with gzip, per-entry 1 MB + per-request 5 MB / 1000-entry guards, all sites including ap2
  • GCS: JSON API media uploads with shared retry helper, GCS-spec bucket-name validation (rejects goog/google)
  • Azure Blob: @azure/storage-blob SDK with sovereign-cloud endpointSuffix support
  • Settings UI: form specs for each destination, icons, search/UI polish
  • Docs: full destination sections in enterprise/data-drains.mdx
  • 57/57 destination tests pass; `tsc --noEmit` clean

Type of Change

  • New feature

Testing

Tested manually. New unit tests cover schema validation, retry paths, byte-accurate size guards, gzip, sovereign-cloud routing, and partial-failure handling per destination.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel Bot commented May 11, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 11, 2026 3:32am

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 11, 2026

PR Summary

Medium Risk
Adds multiple new external delivery backends plus API validation and a DB enum migration; risk is mainly around new auth/retry/size-limit logic and correct handling of partial failures/timeouts when exporting production logs.

Overview
Expands Enterprise Data Drains beyond s3/webhook to support gcs, azure_blob, bigquery, snowflake, and datadog destinations end-to-end (UI selection/forms, contract validation, server destination registry, and docs).

Introduces new destination implementations with provider-specific auth and delivery behavior (object uploads for GCS/Azure, streaming inserts for BigQuery, SQL API inserts + async polling for Snowflake, log intake posting + optional gzip for Datadog) plus extensive unit tests covering retries, limits, and error surfacing.

Updates the API dataDrainDestinationBodySchema/response schemas and DESTINATION_TYPES, adds icons/UI labels for the new destination types, and applies a DB migration to extend the data_drain_destination enum accordingly.

Reviewed by Cursor Bugbot for commit f56c6a2. Configure here.

Comment thread apps/sim/lib/api/contracts/data-drains.ts Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR adds five new data-drain destinations — GCS, Azure Blob, BigQuery, Snowflake, and Datadog — each with full test()/openSession()/deliver() implementations, API contract schemas, UI form specs, DB migration, and unit tests. The additions are generally well-structured and follow patterns established by the existing S3 and webhook destinations.

  • Snowflake & GCS: Both destinations still use bare sleep() from @sim/utils/helpers during retry backoff, which does not honor the abort signal. BigQuery had this same defect and it was explicitly fixed in this PR via sleepUntilAborted — Snowflake (up to 5 s delay) and GCS (up to 30 s delay) need the same treatment.
  • Datadog: parseNdjson splits on '\n' only, unlike BigQuery and Snowflake which use /\r?\n/; CRLF line endings would leave a trailing \r on every line and cause JSON.parse to fail for the entire delivery chunk.
  • Azure Blob & BigQuery: No new issues found; abort-signal forwarding, naming validation, and retry logic are correct.

Confidence Score: 4/5

Three of the five new destinations have correctness defects that would leave active drain runs unresponsive to cancellation or break delivery for CRLF bodies; fixes are small and isolated.

Snowflake and GCS retry loops call sleep() without checking the abort signal mid-wait — a cancellation during backoff goes undetected for up to 30 s in GCS and 5 s in Snowflake, despite this exact pattern being fixed for BigQuery in this same PR. Datadog's parseNdjson only splits on , so any body with CRLF line endings will fail JSON parsing for every entry. All three defects are in newly added code on active delivery paths.

gcs.ts (30 s abort lag in fetchWithRetry), snowflake.ts (5 s abort lag in executeStatement and pollStatement), and datadog.ts (CRLF handling in parseNdjson).

Important Files Changed

Filename Overview
apps/sim/lib/data-drains/destinations/snowflake.ts Snowflake destination with JWT key-pair auth, 202 async polling, and VARIANT guards — but retry backoff still uses bare sleep() that ignores the abort signal (same issue fixed in bigquery.ts).
apps/sim/lib/data-drains/destinations/bigquery.ts BigQuery streaming insertAll with dedup insertIds, partial-failure surfacing, 401 token refresh, and sleepUntilAborted for abort-aware retry backoff — implementation looks correct.
apps/sim/lib/data-drains/destinations/gcs.ts GCS JSON API uploads with shared retry helper — but fetchWithRetry uses bare sleep() during backoff instead of abort-aware waiting, allowing up to 30 s of unresponsive delay when a drain run is cancelled.
apps/sim/lib/data-drains/destinations/datadog.ts Datadog v2 logs intake with gzip, per-entry/per-request guards, and sleepUntilAborted — but parseNdjson splits only on '\n' (not /\r?\n/), unlike every other destination, causing JSON parse failures on CRLF bodies.
apps/sim/lib/data-drains/destinations/azure_blob.ts Azure Blob destination using the SDK with sovereign-cloud endpoint suffix support, validated naming regexes, and abort signal forwarding — looks correct.
apps/sim/lib/api/contracts/data-drains.ts API contract schemas for all five new destinations added correctly, with matching response schemas; ap2 Datadog site is now included.
packages/db/migrations/0205_public_lord_hawal.sql Adds five new enum values to the data_drain_destination Postgres enum using ADD VALUE ... BEFORE 'webhook' — correct migration pattern for Postgres enums.
apps/sim/ee/data-drains/destinations/registry.tsx UI form specs for all five new destinations including Datadog site Combobox and Textarea for SA JSON — complete and consistent with existing patterns.

Sequence Diagram

sequenceDiagram
    participant Driver as Drain Driver
    participant Dest as Destination (GCS/SF/BQ/DD/Az)
    participant Remote as Remote API

    Driver->>Dest: openSession()
    Dest-->>Driver: "{ deliver, close }"

    loop per chunk
        Driver->>Dest: "deliver({ body, metadata, signal })"
        alt GCS / Azure Blob
            Dest->>Remote: PUT object (NDJSON blob)
            Remote-->>Dest: 2xx
        else BigQuery
            Dest->>Remote: tabledata.insertAll (JSON rows)
            Remote-->>Dest: 200 + optional insertErrors
        else Snowflake
            Dest->>Remote: POST /api/v2/statements
            Remote-->>Dest: 200 (sync) or 202 (async)
            opt 202 async
                loop poll
                    Dest->>Remote: "GET /api/v2/statements/{handle}"
                    Remote-->>Dest: 202 (still running) or 200 (done)
                end
            end
        else Datadog
            Dest->>Remote: POST /api/v2/logs (gzip optional)
            Remote-->>Dest: 202 Accepted
        end
        Dest-->>Driver: "{ locator }"
    end

    Driver->>Dest: close()
Loading

Reviews (2): Last reviewed commit: "fix(data-drains): address PR review comm..." | Re-trigger Greptile

Comment thread apps/sim/ee/data-drains/destinations/registry.tsx Outdated
Comment thread apps/sim/lib/data-drains/destinations/bigquery.ts Outdated
Comment thread apps/sim/lib/data-drains/destinations/gcs.ts
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/lib/data-drains/destinations/snowflake.ts
Comment thread apps/sim/lib/data-drains/destinations/datadog.ts
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f56c6a2. Configure here.

Comment thread apps/sim/lib/data-drains/destinations/gcs.ts Outdated
Comment thread apps/sim/lib/data-drains/destinations/snowflake.ts Outdated
Comment thread apps/sim/lib/data-drains/destinations/snowflake.ts Outdated
Comment thread apps/sim/lib/data-drains/destinations/bigquery.ts Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant