v4.0.0 Release Notes

includes critical fixes for text encoding, space preservation, and text positioning, along with improved error handling. This release contains breaking changes that require attention when upgrading from v3.x.

🚨 Breaking Changes

Text Encoding Change (Issue #385, PR #410)

What Changed: Text in JSON output is no longer URI-encoded. All text now outputs as UTF-8 directly.

Why: To properly support Chinese, Japanese, Korean, and other multi-byte Unicode characters. The previous URI encoding caused issues with CJK text display and partial character extraction.

Migration Required: If your code expects URI-encoded text, you must update it to handle plain UTF-8 text.

JSON Output Examples

Before v4.0.0 (URI-encoded):

{
  "Pages": [{
    "Texts": [{
      "R": [{
        "T": "Added%20Text%20from%20Acrobat"
      }]
    }]
  }]
}

After v4.0.0 (UTF-8):

{
  "Pages": [{
    "Texts": [{
      "R": [{
        "T": "Added Text from Acrobat"
      }]
    }]
  }]
}

Code Migration

Before v4.0.0:

// Had to decode URI components
const text = decodeURIComponent(textObj.R[0].T);
// Output: "Added Text from Acrobat"

After v4.0.0:

// Direct text access, no decoding needed
const text = textObj.R[0].T;
// Output: "Added Text from Acrobat"

CJK Character Support

Before v4.0.0:

{
  "T": "%E4%B8%AD%E6%96%87"
}

After v4.0.0:

{
  "T": "中文"
}

✨ Features & Enhancements

Accurate Space Preservation (Issues #355, #361, #319, PR #411)

Complete overhaul of space detection and preservation in text extraction (test CLI with -c command line option):

Glyph-based width calculation - Uses actual font metrics instead of estimates
Proper coordinate system handling - Correctly processes scaled positions with unscaled widths
Text scale support - Applies textHScale for compressed/expanded text
Dynamic Y-tolerance - Font size-aware vertical positioning (fontSize × 0.15)

Impact: Spaces in extracted text (both content.txt and JSON output) now accurately reflect the original PDF layout. Multi-word phrases, tables, and formatted text preserve proper spacing.

Example Output Improvement

Before v4.0.0:

Name:JohnDoeSSN:123-45-6789

After v4.0.0:

Name: John Doe    SSN: 123-45-6789

🐛 Bug Fixes

Text Block Coordinate Accuracy (Issue #408, PR #409)

Fixed text block coordinate calculations for proper positioning
Added comprehensive coordinate tests
Ensures accurate x/y values in JSON output

Character Extraction Completeness (Issue #385, PR #410)

Fixed missing character extraction for glyphs marked as "disabled"
Moved text extraction outside glyph.disabled check
All visible characters now properly extracted

CLI Error Handling (Issue #414)

Unified error and exception handling for CLI operations
Better error messages for invalid input parameters
Auto-creates output directory when not specified (removed unnecessary validation)
Improved stack trace display

more related issues should have been fixed (needs testing PDFs)

#352 : unexpected space
#291 : problem with sentences broken into 1 word
#272 : unrecognized Text
#220 : two TEXTs unexpected joined together in one RUN
#212 : content is being randomly split into multiple lines
#177 : heading level of text is not captured
#156 : extracting table content
#94 : parser not handling some spaces between words

📦 Dependencies

Maintained zero runtime dependencies (since v3.1.6)
Updated development dependencies for build tooling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable Build v4.0.0 [Breaking Changes]

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v4.0.0 Release Notes

🚨 Breaking Changes

Text Encoding Change (Issue #385, PR #410)

JSON Output Examples

Code Migration

CJK Character Support

✨ Features & Enhancements

Accurate Space Preservation (Issues #355, #361, #319, PR #411)

Example Output Improvement

🐛 Bug Fixes

Text Block Coordinate Accuracy (Issue #408, PR #409)

Character Extraction Completeness (Issue #385, PR #410)

CLI Error Handling (Issue #414)

more related issues should have been fixed (needs testing PDFs)

📦 Dependencies

Uh oh!