Skip to content

Text attributes data truncation#893

Open
fogelito wants to merge 21 commits into
mainfrom
upsert-text-truncation
Open

Text attributes data truncation#893
fogelito wants to merge 21 commits into
mainfrom
upsert-text-truncation

Conversation

@fogelito

@fogelito fogelito commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

Release Notes

  • New Features

    • TEXT, MEDIUMTEXT, and LONGTEXT fields now enforce UTF-8 byte-length limits consistently, reducing silent truncation risk.
    • Validation error messages now report the relevant maximum byte capacity when limits are exceeded.
  • Bug Fixes

    • Improved validator behavior so text-like attributes apply byte-capacity checks instead of falling back to more generic validation.
  • Tests

    • Added unit and end-to-end coverage, including UTF-8 multibyte (emoji) cases, for create, validate, and update scenarios.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

TEXT-family byte limits are centralized in Database, enforced through ByteLength, wired into attribute and schema validation, and covered by unit and end-to-end tests for ASCII and multibyte values.

Changes

Byte-safe TEXT validation

Layer / File(s) Summary
Constants and ByteLength
src/Database/Database.php, src/Database/Validator/ByteLength.php
Database adds maximum byte constants for TEXT, MEDIUMTEXT, and LONGTEXT. ByteLength stores a max byte limit, reports a string description, and validates values with strlen().
Validation and adapter integration
src/Database/Validator/Structure.php, src/Database/Validator/Attribute.php, src/Database/Adapter/MariaDB.php, src/Database/Adapter/SQLite.php
Structure gives VAR_TEXT, VAR_MEDIUMTEXT, and VAR_LONGTEXT dedicated cases using ByteLength and the shared byte caps. Attribute switches to the same constants, and MariaDB and SQLite use them for type and length handling.
Unit tests
tests/unit/Validator/StructureTest.php
Adds byte-limit coverage for VAR_TEXT, updates text-family error messages to bytes, and extends mediumtext sizing checks for ASCII and multibyte values.
E2E tests
tests/e2e/Adapter/Scopes/DocumentTests.php
Adds create, update, and round-trip tests for VAR_TEXT values at and beyond the 65535-byte limit, including emoji input.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • utopia-php/database#788: Touches Structure.php validation for VAR_TEXT/VAR_MEDIUMTEXT/VAR_LONGTEXT, which this PR extends with byte-based enforcement.

Suggested reviewers

  • abnegate

Poem

🐰 I hop through bytes from leaf to leaf,
TEXT grows strict beneath my brief.
Emoji stack up, but limits hold tight,
65535 bytes keeps the burrow right.
ByteLength guards the carrot stash,
and longtext bounds no longer clash.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: enforcing data truncation limits for text attributes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch upsert-text-truncation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/unit/Validator/StructureTest.php (2)

412-413: 💤 Low value

Minor: Multibyte test lacks error message assertion.

For consistency with the ASCII overflow test on line 409, consider asserting the error message for the multibyte test as well.

🔍 Add assertion for consistency
         // Multi-byte content over the limit is rejected the same way.
         $multibyte = \str_repeat('📝', 20000);
         $this->assertEquals(false, $validator->isValid(new Document($base + ['blocks_json' => $multibyte])));
+        $this->assertEquals('Invalid document structure: Attribute "blocks_json" has invalid type. Value must be a valid string and no longer than 16383 chars', $validator->getDescription());
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/Validator/StructureTest.php` around lines 412 - 413, The multibyte
overflow test starting at the line with str_repeat and '📝' is missing an
assertion for the error message, unlike the ASCII overflow test on line 409.
After the assertEquals call that validates isValid returns false for the
multibyte Document, add an additional assertion to check the error message
returned by the validator, following the same pattern as the ASCII overflow test
for consistency.

995-1047: 💤 Low value

Consider adding test case for byte-safe limit edge.

While testTextByteSafeValidation comprehensively tests the 16,383-character byte-safe limit for legacy attributes, testTextValidation could benefit from a similar edge-case test. Currently it tests 16,383 chars (pass) and 65,536 chars (fail due to declared size), but doesn't verify that 16,384 chars would fail due to the byte-safe limit.

🧪 Optional test case to add

Add this test case between the existing assertions to verify byte-safe enforcement:

        $this->assertEquals(false, $validator->isValid(new Document([
            '$collection' => ID::custom('posts'),
            'title' => 'Demo Title',
            'description' => 'Demo description',
            'rating' => 5,
            'price' => 1.99,
            'published' => true,
            'tags' => ['dog', 'cat', 'mouse'],
            'feedback' => 'team@appwrite.io',
            'text_field' => \str_repeat('a', 16384),
            '$createdAt' => '2000-04-01T12:00:00.000+00:00',
            '$updatedAt' => '2000-04-01T12:00:00.000+00:00'
        ])));

        $this->assertEquals('Invalid document structure: Attribute "text_field" has invalid type. Value must be a valid string and no longer than 16383 chars', $validator->getDescription());
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/Validator/StructureTest.php` around lines 995 - 1047, The
testTextValidation method should include an additional test case to verify the
byte-safe character limit edge case. Add a new assertion between the existing
test cases that validates a Document with text_field set to \str_repeat('a',
16384) should fail (assertEquals false), and verify the corresponding error
message from getDescription indicates the byte-safe limit of 16383 characters,
similar to the pattern already established in the method with the 65536
character test case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/Database/Database.php`:
- Around line 118-121: The MAX_LONGTEXT_BYTES constant in the Database class is
set to a value that exceeds the 32-bit signed integer maximum, causing it to be
converted to a float in 32-bit PHP environments. This breaks the intdiv call in
downstream code that expects integer arguments. Fix this by explicitly casting
MAX_LONGTEXT_BYTES to int using the (int) type cast operator to ensure it
remains an integer on all PHP platforms, including 32-bit environments.

---

Nitpick comments:
In `@tests/unit/Validator/StructureTest.php`:
- Around line 412-413: The multibyte overflow test starting at the line with
str_repeat and '📝' is missing an assertion for the error message, unlike the
ASCII overflow test on line 409. After the assertEquals call that validates
isValid returns false for the multibyte Document, add an additional assertion to
check the error message returned by the validator, following the same pattern as
the ASCII overflow test for consistency.
- Around line 995-1047: The testTextValidation method should include an
additional test case to verify the byte-safe character limit edge case. Add a
new assertion between the existing test cases that validates a Document with
text_field set to \str_repeat('a', 16384) should fail (assertEquals false), and
verify the corresponding error message from getDescription indicates the
byte-safe limit of 16383 characters, similar to the pattern already established
in the method with the 65536 character test case.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 338e4613-aa3d-423d-af38-75776ac9f9df

📥 Commits

Reviewing files that changed from the base of the PR and between cfba533 and cbdc3cf.

📒 Files selected for processing (4)
  • src/Database/Database.php
  • src/Database/Validator/Structure.php
  • tests/e2e/Adapter/Scopes/DocumentTests.php
  • tests/unit/Validator/StructureTest.php

Comment thread src/Database/Database.php Outdated
@greptile-apps

greptile-apps Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a new ByteLength validator and wires it into Structure for VAR_TEXT, VAR_MEDIUMTEXT, and VAR_LONGTEXT attributes, replacing the previous character-count (Text) approach with actual byte-length measurement (strlen()). A set of shared byte-ceiling constants (MAX_TEXT_BYTES, MAX_MEDIUMTEXT_BYTES, MAX_LONGTEXT_BYTES) is extracted into Database and used uniformly across Structure, Attribute, MariaDB, and SQLite.

  • ByteLength validator uses PHP's strlen() (byte-accurate for UTF-8) to gate values against both the attribute's declared size and the column-type ceiling; the max = 0 sentinel correctly disables the user-declared limit when no explicit size is set.
  • Dual-validator pattern (ByteLength($size) + ByteLength(MAX_TYPE_BYTES)) handles legacy attributes whose stored $size may exceed the physical column ceiling — the declared-size validator is a no-op for valid attributes and acts as a guard only for legacy rows.
  • Constants refactoring in Database, MariaDB, and SQLite replaces magic numbers with named constants; the SQLite emulation block moves from private class constants to the shared Database constants with no behavior change.

Confidence Score: 3/5

The byte-based validation fix is correct for MariaDB/MySQL, but Structure.php now enforces MariaDB TEXT column byte ceilings unconditionally — both for adapters that map VAR_TEXT to unbounded storage and for array-typed text attributes stored as JSON rather than as TEXT columns — meaning valid data can be rejected by the generic validator in those configurations.

The new ByteLength validator and the dual-validator pattern are well-implemented and fix the core silent-truncation problem. However, Structure.php wires in hard-coded MariaDB-specific byte caps without checking which adapter is active or whether the attribute value is destined for a TEXT column vs a JSON column. On adapters where VAR_TEXT maps to an unbounded type, or for any text attribute marked as an array, values well within what the backing store can hold will now be rejected. These paths affect the core document write and update flows.

src/Database/Validator/Structure.php — the TEXT-family byte ceilings are applied without adapter context or array-storage awareness

Important Files Changed

Filename Overview
src/Database/Validator/ByteLength.php New validator using strlen() for byte-accurate measurement; correctly handles the max=0 unlimited sentinel; clean implementation
src/Database/Validator/Structure.php Replaces intdiv character caps with ByteLength byte validators for VAR_TEXT/MEDIUMTEXT/LONGTEXT; the MariaDB-specific column byte ceilings are applied unconditionally in the generic validator regardless of active adapter or whether the attribute is stored in a TEXT vs JSON column (array attributes)
src/Database/Database.php Extracts TEXT-family byte ceiling magic numbers into named public constants; no behavior change
src/Database/Validator/Attribute.php Replaces inline magic numbers with the new Database constants; values are identical, no behavior change
src/Database/Adapter/MariaDB.php Replaces magic-number thresholds in getSQLType() with the new Database constants; values identical to what was there before
src/Database/Adapter/SQLite.php Moves MariaDB byte-ceiling constants from SQLite private scope to Database (shared); uses string concatenation to preserve the original string type of characterMaximumLength
tests/e2e/Adapter/Scopes/DocumentTests.php Three new E2E tests cover oversized byte rejection, valid full-capacity round-trip, and update rejection; tests do not delete the collections they create, which can cause createCollection failures on repeated test runs
tests/unit/Validator/StructureTest.php Adds unit tests for byte-safe rejection, multibyte emoji rejection, valid full-capacity acceptance, and legacy oversized declared-size handling; error message assertions updated to match new ByteLength wording

Reviews (12): Last reviewed commit: "Merge branch 'main' of github.com:utopia..." | Re-trigger Greptile

Comment thread src/Database/Validator/Structure.php Outdated
Comment thread src/Database/Validator/Structure.php Outdated
Comment thread src/Database/Validator/Structure.php Outdated
Comment thread src/Database/Validator/Structure.php Outdated
Comment thread tests/unit/Validator/StructureTest.php Outdated
Comment thread src/Database/Adapter/MariaDB.php
Comment thread src/Database/Validator/Structure.php
Comment thread tests/e2e/Adapter/Base.php Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/unit/Validator/StructureTest.php (1)

441-445: ⚡ Quick win

Assert the exact failure reason in the multibyte overflow test.

This test currently validates only false; asserting the description too will lock the byte-limit contract and catch regressions in failure source.

Suggested test hardening
 $multibyte = \str_repeat('📝', 20000);
 $this->assertEquals(false, $validator->isValid(new Document($base + ['text' => $multibyte])));
+$this->assertEquals(
+    'Invalid document structure: Attribute "text" has invalid type. Value must be a valid string no longer than 65535 bytes',
+    $validator->getDescription()
+);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/Validator/StructureTest.php` around lines 441 - 445, The multibyte
overflow test in the test method currently only asserts that the validator
returns false but does not verify the specific failure reason. Strengthen this
test by adding an additional assertion that checks the failure description or
error message from the validator after the isValid call with the multibyte
document. This will ensure that the specific byte-limit contract is properly
validated and catch any regressions where the failure source changes. Reference
the validator's method to retrieve the error description (commonly methods like
getError, getFailureDescription, or similar) to assert the expected error
message alongside the false result.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@phpunit.xml`:
- Line 10: The stopOnFailure attribute in the phpunit.xml configuration is set
to "true", which causes the test suite to halt immediately upon encountering the
first failure. This masks subsequent failures and reduces diagnostic value in CI
environments. Change stopOnFailure from "true" to "false" to allow the full test
suite to execute and report all failures in a single run, providing better
visibility into the overall test health.

In `@src/Database/Validator/ByteLength.php`:
- Line 53: The file is missing a trailing newline at the end of the file, which
violates PSR-12 standards that require a single blank line at the end of every
file. Add a single newline character after the closing brace at the end of the
ByteLength.php file to satisfy the Pint code style checker and resolve the
single_blank_line_at_eof CI failure.

In `@tests/e2e/Adapter/Base.php`:
- Around line 26-27: Uncomment the two trait usage statements in the Base class
by removing the leading // from the lines containing CollectionTests and
CustomDocumentTypeTests. These commented-out trait usages need to be re-enabled
so that their test methods are included in the test discovery and execution,
restoring the full E2E coverage that these traits provide to the adapter tests.

---

Nitpick comments:
In `@tests/unit/Validator/StructureTest.php`:
- Around line 441-445: The multibyte overflow test in the test method currently
only asserts that the validator returns false but does not verify the specific
failure reason. Strengthen this test by adding an additional assertion that
checks the failure description or error message from the validator after the
isValid call with the multibyte document. This will ensure that the specific
byte-limit contract is properly validated and catch any regressions where the
failure source changes. Reference the validator's method to retrieve the error
description (commonly methods like getError, getFailureDescription, or similar)
to assert the expected error message alongside the false result.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d8392e1-0183-4343-9aef-8defb270cf4f

📥 Commits

Reviewing files that changed from the base of the PR and between 24cb977 and 3dc94f9.

📒 Files selected for processing (6)
  • phpunit.xml
  • src/Database/Validator/ByteLength.php
  • src/Database/Validator/Structure.php
  • tests/e2e/Adapter/Base.php
  • tests/e2e/Adapter/Scopes/DocumentTests.php
  • tests/unit/Validator/StructureTest.php

Comment thread phpunit.xml Outdated
Comment thread src/Database/Validator/ByteLength.php Outdated
Comment thread tests/e2e/Adapter/Base.php Outdated
@fogelito fogelito changed the title Upsert text truncation Text attributes data truncation Jun 18, 2026
Comment thread src/Database/Adapter/SQLite.php Outdated
Comment on lines +44 to +46
private const MARIADB_TEXT_BYTES = '' . Database::MAX_TEXT_BYTES;
private const MARIADB_MEDIUMTEXT_BYTES = '' . Database::MAX_MEDIUMTEXT_BYTES;
private const MARIADB_LONGTEXT_BYTES = '' . Database::MAX_LONGTEXT_BYTES;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be a string?

case Database::VAR_LONGTEXT:
$validators[] = new Text($size, min: 0);
$validators[] = new ByteLength(Database::MAX_LONGTEXT_BYTES);
break;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both?

Comment thread src/Database/Adapter/SQLite.php

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/Database/Database.php (1)

9784-9789: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Use a stronger digest for cache-key derivation.

md5 is collision-prone; for cache field/key hashing, prefer SHA-256 to reduce collision-based cache poisoning/mix risks.

Suggested patch
-            $schemaHash = \md5(
+            $schemaHash = \hash(
+                'sha256',
                 \json_encode($collection->getAttribute('attributes', []))
                 . \json_encode($collection->getAttribute('indexes', []))
                 . \json_encode($collection->getAttribute('$permissions', []))
                 . \json_encode($collection->getAttribute('documentSecurity', false))
             );
@@
-            \md5(\json_encode($queryPayload) ?: ''),
+            \hash('sha256', \json_encode($queryPayload) ?: ''),

Also applies to: 9792-9797

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Database/Database.php` around lines 9784 - 9789, The cache-key derivation
in the schema hash uses a collision-prone digest and should be strengthened.
Update the schema hash logic in Database::setCollection (the block building
$schemaHash from attributes, indexes, $permissions, and documentSecurity) to use
SHA-256 instead of md5, and make the same change in the other schema-hash block
referenced by the review so both cache field derivations use the stronger digest
consistently.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/Database/Database.php`:
- Around line 9784-9789: The cache-key derivation in the schema hash uses a
collision-prone digest and should be strengthened. Update the schema hash logic
in Database::setCollection (the block building $schemaHash from attributes,
indexes, $permissions, and documentSecurity) to use SHA-256 instead of md5, and
make the same change in the other schema-hash block referenced by the review so
both cache field derivations use the stronger digest consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d0f0028a-1410-443e-85e5-62fb04a7b23e

📥 Commits

Reviewing files that changed from the base of the PR and between cf959b1 and 22a0dc3.

📒 Files selected for processing (5)
  • src/Database/Adapter/SQLite.php
  • src/Database/Database.php
  • src/Database/Validator/Structure.php
  • tests/e2e/Adapter/Scopes/DocumentTests.php
  • tests/unit/Validator/StructureTest.php
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/e2e/Adapter/Scopes/DocumentTests.php

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants