Skip to content

Stable Build v4.0.0 [Breaking Changes]

Choose a tag to compare

@modesty modesty released this 12 Oct 19:47
· 7 commits to master since this release
c8b372b

v4.0.0 Release Notes

includes critical fixes for text encoding, space preservation, and text positioning, along with improved error handling. This release contains breaking changes that require attention when upgrading from v3.x.

🚨 Breaking Changes

Text Encoding Change (Issue #385, PR #410)

What Changed: Text in JSON output is no longer URI-encoded. All text now outputs as UTF-8 directly.

Why: To properly support Chinese, Japanese, Korean, and other multi-byte Unicode characters. The previous URI encoding caused issues with CJK text display and partial character extraction.

Migration Required: If your code expects URI-encoded text, you must update it to handle plain UTF-8 text.

JSON Output Examples

Before v4.0.0 (URI-encoded):

{
  "Pages": [{
    "Texts": [{
      "R": [{
        "T": "Added%20Text%20from%20Acrobat"
      }]
    }]
  }]
}

After v4.0.0 (UTF-8):

{
  "Pages": [{
    "Texts": [{
      "R": [{
        "T": "Added Text from Acrobat"
      }]
    }]
  }]
}

Code Migration

Before v4.0.0:

// Had to decode URI components
const text = decodeURIComponent(textObj.R[0].T);
// Output: "Added Text from Acrobat"

After v4.0.0:

// Direct text access, no decoding needed
const text = textObj.R[0].T;
// Output: "Added Text from Acrobat"

CJK Character Support

Before v4.0.0:

{
  "T": "%E4%B8%AD%E6%96%87"
}

After v4.0.0:

{
  "T": "中文"
}

✨ Features & Enhancements

Accurate Space Preservation (Issues #355, #361, #319, PR #411)

Complete overhaul of space detection and preservation in text extraction (test CLI with -c command line option):

  • Glyph-based width calculation - Uses actual font metrics instead of estimates
  • Proper coordinate system handling - Correctly processes scaled positions with unscaled widths
  • Text scale support - Applies textHScale for compressed/expanded text
  • Dynamic Y-tolerance - Font size-aware vertical positioning (fontSize × 0.15)

Impact: Spaces in extracted text (both content.txt and JSON output) now accurately reflect the original PDF layout. Multi-word phrases, tables, and formatted text preserve proper spacing.

Example Output Improvement

Before v4.0.0:

Name:JohnDoeSSN:123-45-6789

After v4.0.0:

Name: John Doe    SSN: 123-45-6789

🐛 Bug Fixes

Text Block Coordinate Accuracy (Issue #408, PR #409)

  • Fixed text block coordinate calculations for proper positioning
  • Added comprehensive coordinate tests
  • Ensures accurate x/y values in JSON output

Character Extraction Completeness (Issue #385, PR #410)

  • Fixed missing character extraction for glyphs marked as "disabled"
  • Moved text extraction outside glyph.disabled check
  • All visible characters now properly extracted

CLI Error Handling (Issue #414)

  • Unified error and exception handling for CLI operations
  • Better error messages for invalid input parameters
  • Auto-creates output directory when not specified (removed unnecessary validation)
  • Improved stack trace display

more related issues should have been fixed (needs testing PDFs)

  • #352 : unexpected space
  • #291 : problem with sentences broken into 1 word
  • #272 : unrecognized Text
  • #220 : two TEXTs unexpected joined together in one RUN
  • #212 : content is being randomly split into multiple lines
  • #177 : heading level of text is not captured
  • #156 : extracting table content
  • #94 : parser not handling some spaces between words

📦 Dependencies

  • Maintained zero runtime dependencies (since v3.1.6)
  • Updated development dependencies for build tooling