Add BigQuery adapter with emulator-based integration tests

nicosuave · nicosuave · commit 96a6fc024a7c · 2025-10-07T22:27:26.000-07:00
- Create BigQueryAdapter in sidemantic/db/bigquery.py
  - Uses google-cloud-bigquery client
  - Supports bigquery://project_id/dataset_id URL format
  - BigQueryResult wrapper for DuckDB-compatible API
  - Arrow support via to_arrow()

- Add bigquery optional dependency to pyproject.toml
  - google-cloud-bigquery&gt;=3.0.0
  - pyarrow&gt;=14.0.0

- Update SemanticLayer to recognize bigquery:// URLs
- Add BigQueryConnection to config.py

- Add tests:
  - test_bigquery_adapter.py: Basic adapter tests (import, URL parsing)
  - test_bigquery_integration.py: 8 integration tests against emulator

- Add BigQuery emulator to docker-compose.yml
  - Uses ghcr.io/goccy/bigquery-emulator:latest
  - Runs on port 9050

- Update integration.yml workflow
  - Add bigquery-integration job with emulator service

- Update documentation in tests/db/README.md

Regular tests: 570 passed, 3 skipped, 18 deselected (10 postgres + 8 bigquery)
diff --git a/.github/workflows/integration.yml b/.github/workflows/integration.yml
@@ -43,4 +43,40 @@ jobs:
         env:
           POSTGRES_TEST: "1"
           POSTGRES_URL: "postgres://test:test@localhost:5432/sidemantic_test"
-        run: uv run pytest -m integration -v
+        run: uv run pytest -m integration tests/db/test_postgres_integration.py -v
+
+  bigquery-integration:
+    runs-on: ubuntu-latest
+
+    services:
+      bigquery:
+        image: ghcr.io/goccy/bigquery-emulator:latest
+        ports:
+          - 9050:9050
+        options: >-
+          --health-cmd "grpc_health_probe -addr=:9050"
+          --health-interval 10s
+          --health-timeout 5s
+          --health-retries 5
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          enable-cache: true
+
+      - name: Set up Python
+        run: uv python install 3.12
+
+      - name: Install dependencies
+        run: uv sync --extra bigquery --extra dev
+
+      - name: Run BigQuery integration tests
+        env:
+          BIGQUERY_TEST: "1"
+          BIGQUERY_EMULATOR_HOST: "localhost:9050"
+          BIGQUERY_PROJECT: "test-project"
+          BIGQUERY_DATASET: "test_dataset"
+        run: uv run pytest -m integration tests/db/test_bigquery_integration.py -v
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -15,16 +15,34 @@ services:
     volumes:
       - postgres_data:/var/lib/postgresql/data
 
+  bigquery:
+    image: ghcr.io/goccy/bigquery-emulator:latest
+    platform: linux/amd64
+    ports:
+      - "9050:9050"
+    command: ["--project=test-project", "--dataset=test_dataset"]
+    healthcheck:
+      test: ["CMD", "grpc_health_probe", "-addr=:9050"]
+      interval: 5s
+      timeout: 5s
+      retries: 5
+
   test:
     build:
       context: .
       dockerfile: Dockerfile.test
     depends_on:
       postgres:
         condition: service_healthy
+      bigquery:
+        condition: service_healthy
     environment:
       POSTGRES_TEST: "1"
       POSTGRES_URL: "postgres://test:test@postgres:5432/sidemantic_test"
+      BIGQUERY_TEST: "1"
+      BIGQUERY_EMULATOR_HOST: "bigquery:9050"
+      BIGQUERY_PROJECT: "test-project"
+      BIGQUERY_DATASET: "test_dataset"
     command: pytest -m integration -v
 
 volumes:
diff --git a/pyproject.toml b/pyproject.toml
@@ -39,6 +39,10 @@ postgres = [
     "psycopg[binary]>=3.0.0",
     "pyarrow>=14.0.0",  # For Arrow support
 ]
+bigquery = [
+    "google-cloud-bigquery>=3.0.0",
+    "pyarrow>=14.0.0",  # For Arrow support
+]
 
 [build-system]
 requires = ["hatchling"]
diff --git a/sidemantic/config.py b/sidemantic/config.py
@@ -24,6 +24,15 @@ class PostgreSQLConnection(BaseModel):
     password: str = Field(..., description="Password")
 
 
+class BigQueryConnection(BaseModel):
+    """BigQuery connection configuration."""
+
+    type: Literal["bigquery"] = "bigquery"
+    project_id: str = Field(..., description="GCP project ID")
+    dataset_id: str | None = Field(default=None, description="Default dataset ID (optional)")
+    location: str = Field(default="US", description="BigQuery location")
+
+
 class PostgresServerConfig(BaseModel):
     """PostgreSQL wire protocol server configuration (ALPHA).
 
@@ -35,7 +44,7 @@ class PostgresServerConfig(BaseModel):
     password: str | None = Field(default=None, description="Password for authentication (optional)")
 
 
-Connection = DuckDBConnection | PostgreSQLConnection
+Connection = DuckDBConnection | PostgreSQLConnection | BigQueryConnection
 
 
 class SidemanticConfig(BaseModel):
@@ -192,5 +201,8 @@ def build_connection_string(config: SidemanticConfig) -> str:
             f"postgres://{config.connection.username}{password_part}@"
             f"{config.connection.host}:{config.connection.port}/{config.connection.database}"
         )
+    elif isinstance(config.connection, BigQueryConnection):
+        dataset_part = f"/{config.connection.dataset_id}" if config.connection.dataset_id else ""
+        return f"bigquery://{config.connection.project_id}{dataset_part}"
     else:
         raise ValueError(f"Unknown connection type: {type(config.connection)}")
diff --git a/sidemantic/core/semantic_layer.py b/sidemantic/core/semantic_layer.py
@@ -31,6 +31,7 @@ def __init__(
                 - duckdb:///:memory: (default)
                 - duckdb:///path/to/db.duckdb
                 - postgres://user:pass@host:port/dbname
+                - bigquery://project_id/dataset_id
             dialect: SQL dialect for query generation (optional, inferred from adapter)
             auto_register: Set as current layer for auto-registration (default: True)
             use_preaggregations: Enable automatic pre-aggregation routing (default: False)
@@ -58,10 +59,15 @@ def __init__(
 
                 self.adapter = PostgreSQLAdapter.from_url(connection)
                 self.dialect = dialect or "postgres"
+            elif connection.startswith("bigquery://"):
+                from sidemantic.db.bigquery import BigQueryAdapter
+
+                self.adapter = BigQueryAdapter.from_url(connection)
+                self.dialect = dialect or "bigquery"
             else:
                 raise ValueError(
                     f"Unsupported connection URL: {connection}. "
-                    "Supported: duckdb:///, postgres://, or BaseDatabaseAdapter instance"
+                    "Supported: duckdb:///, postgres://, bigquery://, or BaseDatabaseAdapter instance"
                 )
         else:
             raise TypeError(f"connection must be a string URL or BaseDatabaseAdapter instance, got {type(connection)}")
diff --git a/sidemantic/db/bigquery.py b/sidemantic/db/bigquery.py
@@ -0,0 +1,186 @@
+"""BigQuery database adapter."""
+
+from typing import Any
+
+from sidemantic.db.base import BaseDatabaseAdapter
+
+
+class BigQueryResult:
+    """Wrapper for BigQuery query result to match DuckDB result API."""
+
+    def __init__(self, query_job):
+        """Initialize BigQuery result wrapper.
+
+        Args:
+            query_job: BigQuery query job result
+        """
+        self.query_job = query_job
+        self._result = query_job.result()
+        self._rows_iter = iter(self._result)
+
+    def fetchone(self) -> tuple | None:
+        """Fetch one row from the result."""
+        try:
+            row = next(self._rows_iter)
+            return tuple(row.values())
+        except StopIteration:
+            return None
+
+    def fetchall(self) -> list[tuple]:
+        """Fetch all remaining rows."""
+        return [tuple(row.values()) for row in self._rows_iter]
+
+    def fetch_record_batch(self) -> Any:
+        """Convert result to PyArrow RecordBatchReader."""
+        import pyarrow as pa
+
+        # BigQuery can return Arrow tables directly
+        arrow_table = self._result.to_arrow()
+        return pa.RecordBatchReader.from_batches(arrow_table.schema, arrow_table.to_batches())
+
+    @property
+    def description(self):
+        """Get column descriptions."""
+        return [(field.name, field.field_type) for field in self._result.schema]
+
+
+class BigQueryAdapter(BaseDatabaseAdapter):
+    """BigQuery database adapter.
+
+    Example:
+        >>> adapter = BigQueryAdapter(project_id="my-project", dataset_id="my_dataset")
+        >>> result = adapter.execute("SELECT * FROM table")
+    """
+
+    def __init__(
+        self,
+        project_id: str | None = None,
+        dataset_id: str | None = None,
+        credentials: Any | None = None,
+        location: str = "US",
+        **kwargs,
+    ):
+        """Initialize BigQuery adapter.
+
+        Args:
+            project_id: GCP project ID (if None, uses default credentials project)
+            dataset_id: Default dataset ID (optional)
+            credentials: Google Cloud credentials (if None, uses default credentials)
+            location: BigQuery location (default: US)
+            **kwargs: Additional arguments passed to bigquery.Client
+        """
+        try:
+            from google.cloud import bigquery
+        except ImportError as e:
+            raise ImportError(
+                "BigQuery support requires google-cloud-bigquery. "
+                "Install with: pip install sidemantic[bigquery] or pip install google-cloud-bigquery"
+            ) from e
+
+        self.client = bigquery.Client(project=project_id, credentials=credentials, location=location, **kwargs)
+        self.project_id = project_id or self.client.project
+        self.dataset_id = dataset_id
+
+    def execute(self, sql: str) -> BigQueryResult:
+        """Execute SQL query."""
+        query_job = self.client.query(sql)
+        return BigQueryResult(query_job)
+
+    def executemany(self, sql: str, params: list) -> Any:
+        """Execute SQL with multiple parameter sets.
+
+        Note: BigQuery doesn't have native executemany, so we run queries sequentially.
+        """
+        results = []
+        for param_set in params:
+            # BigQuery uses @param syntax for parameters
+            query_job = self.client.query(sql, job_config={"query_parameters": param_set})
+            results.append(BigQueryResult(query_job))
+        return results
+
+    def fetchone(self, result: BigQueryResult) -> tuple | None:
+        """Fetch one row from result."""
+        return result.fetchone()
+
+    def fetch_record_batch(self, result: BigQueryResult) -> Any:
+        """Fetch result as PyArrow RecordBatchReader."""
+        return result.fetch_record_batch()
+
+    def get_tables(self) -> list[dict]:
+        """List all tables in the dataset."""
+        if not self.dataset_id:
+            # If no dataset specified, list tables from all datasets
+            tables = []
+            for dataset in self.client.list_datasets():
+                dataset_ref = self.client.dataset(dataset.dataset_id)
+                for table in self.client.list_tables(dataset_ref):
+                    tables.append({"table_name": table.table_id, "schema": dataset.dataset_id})
+            return tables
+
+        # List tables in specific dataset
+        dataset_ref = self.client.dataset(self.dataset_id)
+        tables = []
+        for table in self.client.list_tables(dataset_ref):
+            tables.append({"table_name": table.table_id, "schema": self.dataset_id})
+        return tables
+
+    def get_columns(self, table_name: str, schema: str | None = None) -> list[dict]:
+        """Get column information for a table."""
+        schema = schema or self.dataset_id
+        if not schema:
+            raise ValueError("schema (dataset_id) required for get_columns")
+
+        table_ref = self.client.dataset(schema).table(table_name)
+        table = self.client.get_table(table_ref)
+
+        columns = []
+        for field in table.schema:
+            columns.append(
+                {
+                    "column_name": field.name,
+                    "data_type": field.field_type,
+                    "is_nullable": field.mode != "REQUIRED",
+                }
+            )
+        return columns
+
+    def close(self) -> None:
+        """Close the BigQuery client."""
+        self.client.close()
+
+    @property
+    def dialect(self) -> str:
+        """Return SQL dialect."""
+        return "bigquery"
+
+    @property
+    def raw_connection(self) -> Any:
+        """Return raw BigQuery client."""
+        return self.client
+
+    @classmethod
+    def from_url(cls, url: str) -> "BigQueryAdapter":
+        """Create adapter from connection URL.
+
+        URL format: bigquery://project_id/dataset_id
+        or: bigquery://project_id  (no default dataset)
+
+        Args:
+            url: Connection URL
+
+        Returns:
+            BigQueryAdapter instance
+        """
+        if not url.startswith("bigquery://"):
+            raise ValueError(f"Invalid BigQuery URL: {url}")
+
+        # Parse URL: bigquery://project_id/dataset_id
+        path = url[len("bigquery://") :]
+        if not path:
+            raise ValueError("BigQuery URL must include project_id: bigquery://project_id/dataset_id")
+
+        parts = path.split("/")
+        project_id = parts[0]
+        dataset_id = parts[1] if len(parts) > 1 else None
+
+        return cls(project_id=project_id, dataset_id=dataset_id)
diff --git a/tests/db/README.md b/tests/db/README.md
@@ -19,7 +19,7 @@ docker compose up test --build --abort-on-container-exit
 
 # Or run tests locally against dockerized Postgres
 docker compose up -d postgres
-POSTGRES_TEST=1 uv run --extra postgres pytest -m integration -v
+POSTGRES_TEST=1 uv run --extra postgres pytest -m integration tests/db/test_postgres_integration.py -v
 ```
 
 **Manual setup:**
@@ -32,14 +32,45 @@ export POSTGRES_TEST=1
 export POSTGRES_URL="postgres://test:test@localhost:5432/sidemantic_test"
 
 # Run integration tests only
-uv run pytest -m integration -v
+uv run pytest -m integration tests/db/test_postgres_integration.py -v
+```
+
+### BigQuery Integration Tests
+
+BigQuery tests use the BigQuery emulator and are marked with `@pytest.mark.integration`. They require the `bigquery` extra dependencies.
+
+**Using Docker Compose (recommended):**
+```bash
+# Start BigQuery emulator and run integration tests
+docker compose up test --build --abort-on-container-exit
+
+# Or run tests locally against dockerized emulator
+docker compose up -d bigquery
+BIGQUERY_TEST=1 BIGQUERY_EMULATOR_HOST=localhost:9050 uv run --extra bigquery pytest -m integration tests/db/test_bigquery_integration.py -v
+```
+
+**Manual setup:**
+```bash
+# Install bigquery dependencies
+uv sync --extra bigquery
+
+# Set up BigQuery emulator (adjust as needed)
+export BIGQUERY_TEST=1
+export BIGQUERY_EMULATOR_HOST=localhost:9050
+export BIGQUERY_PROJECT=test-project
+export BIGQUERY_DATASET=test_dataset
+
+# Run integration tests only
+uv run pytest -m integration tests/db/test_bigquery_integration.py -v
 ```
 
 **Note:** Normal `pytest` runs will skip integration tests automatically. Use `-m integration` to run them explicitly.
 
 ## Test Coverage
 
 - **test_duckdb_adapter.py**: Tests for DuckDB adapter implementation
-- **test_postgres_adapter.py**: Basic Postgres adapter tests (mostly ImportError checks)
-- **test_postgres_integration.py**: Full integration tests against real Postgres database
+- **test_postgres_adapter.py**: Basic Postgres adapter tests (import checks, no connection required)
+- **test_postgres_integration.py**: Full integration tests against real Postgres database (10 tests)
+- **test_bigquery_adapter.py**: Basic BigQuery adapter tests (import checks, URL parsing)
+- **test_bigquery_integration.py**: Full integration tests against BigQuery emulator (10 tests)
 - **test_semantic_layer_adapters.py**: Tests for SemanticLayer integration with different adapters
diff --git a/tests/db/test_bigquery_adapter.py b/tests/db/test_bigquery_adapter.py
diff --git a/tests/db/test_bigquery_integration.py b/tests/db/test_bigquery_integration.py
diff --git a/uv.lock b/uv.lock