Stable Build v4.0.0 [Breaking Changes]
v4.0.0 Release Notes
includes critical fixes for text encoding, space preservation, and text positioning, along with improved error handling. This release contains breaking changes that require attention when upgrading from v3.x.
🚨 Breaking Changes
Text Encoding Change (Issue #385, PR #410)
What Changed: Text in JSON output is no longer URI-encoded. All text now outputs as UTF-8 directly.
Why: To properly support Chinese, Japanese, Korean, and other multi-byte Unicode characters. The previous URI encoding caused issues with CJK text display and partial character extraction.
Migration Required: If your code expects URI-encoded text, you must update it to handle plain UTF-8 text.
JSON Output Examples
Before v4.0.0 (URI-encoded):
{
"Pages": [{
"Texts": [{
"R": [{
"T": "Added%20Text%20from%20Acrobat"
}]
}]
}]
}After v4.0.0 (UTF-8):
{
"Pages": [{
"Texts": [{
"R": [{
"T": "Added Text from Acrobat"
}]
}]
}]
}Code Migration
Before v4.0.0:
// Had to decode URI components
const text = decodeURIComponent(textObj.R[0].T);
// Output: "Added Text from Acrobat"After v4.0.0:
// Direct text access, no decoding needed
const text = textObj.R[0].T;
// Output: "Added Text from Acrobat"CJK Character Support
Before v4.0.0:
{
"T": "%E4%B8%AD%E6%96%87"
}After v4.0.0:
{
"T": "中文"
}✨ Features & Enhancements
Accurate Space Preservation (Issues #355, #361, #319, PR #411)
Complete overhaul of space detection and preservation in text extraction (test CLI with -c command line option):
- Glyph-based width calculation - Uses actual font metrics instead of estimates
- Proper coordinate system handling - Correctly processes scaled positions with unscaled widths
- Text scale support - Applies
textHScalefor compressed/expanded text - Dynamic Y-tolerance - Font size-aware vertical positioning (fontSize × 0.15)
Impact: Spaces in extracted text (both content.txt and JSON output) now accurately reflect the original PDF layout. Multi-word phrases, tables, and formatted text preserve proper spacing.
Example Output Improvement
Before v4.0.0:
Name:JohnDoeSSN:123-45-6789
After v4.0.0:
Name: John Doe SSN: 123-45-6789
🐛 Bug Fixes
Text Block Coordinate Accuracy (Issue #408, PR #409)
- Fixed text block coordinate calculations for proper positioning
- Added comprehensive coordinate tests
- Ensures accurate x/y values in JSON output
Character Extraction Completeness (Issue #385, PR #410)
- Fixed missing character extraction for glyphs marked as "disabled"
- Moved text extraction outside glyph.disabled check
- All visible characters now properly extracted
CLI Error Handling (Issue #414)
- Unified error and exception handling for CLI operations
- Better error messages for invalid input parameters
- Auto-creates output directory when not specified (removed unnecessary validation)
- Improved stack trace display
more related issues should have been fixed (needs testing PDFs)
- #352 : unexpected space
- #291 : problem with sentences broken into 1 word
- #272 : unrecognized Text
- #220 : two TEXTs unexpected joined together in one RUN
- #212 : content is being randomly split into multiple lines
- #177 : heading level of text is not captured
- #156 : extracting table content
- #94 : parser not handling some spaces between words
📦 Dependencies
- Maintained zero runtime dependencies (since v3.1.6)
- Updated development dependencies for build tooling