feat: Add smart whitespace stripping for CJK text #801

shukebeta · 2025-10-13T07:01:29Z

Summary

This PR adds an intelligent whitespace stripping feature specifically designed for CJK (Chinese, Japanese, Korean) text, which often contains unwanted spaces and line breaks when extracted from PDFs and scanned documents.

Problem Statement

When performing OCR on CJK documents (especially PDFs), users often encounter:

Unwanted spaces between CJK characters - OCR engines sometimes insert spaces between characters
Soft line breaks - Documents have forced line breaks due to page width constraints, but these are not semantic breaks
Hard line breaks - Legitimate sentence boundaries that should be preserved
Mixed language text - Documents containing both CJK and Latin text need intelligent handling

Solution

This PR implements a smart whitespace stripping algorithm that:

✅ Removes spaces between CJK characters only (preserves English word spacing)
✅ Removes soft line breaks (after non-punctuation characters)
✅ Preserves hard line breaks (after sentence-ending punctuation: 。！？；：)
✅ Handles paragraph breaks (double newlines → single newline)
✅ Works seamlessly with mixed CJK-Latin text

Implementation Details

User Interface

Added new setting strip-whitespaces to user preferences
Added "Strip whitespaces (Smart)" option to Post-processing menu
Option available to all users (not language-restricted)
Detailed tooltip explains the smart behavior

Technical Changes

normcap/gui/settings.py: Add strip-whitespaces setting
normcap/gui/menu_button.py: Add menu option with descriptive tooltip
normcap/detection/ocr/transformer.py: Implement smart stripping algorithm using regex-based CJK detection
Pipeline integration: Pass setting through tray → detector → recognize → transformer
Unit tests: Comprehensive test coverage for all edge cases

Algorithm

The algorithm uses Unicode ranges to detect CJK characters and applies different rules:

Spaces: Remove only between CJK chars, keep around Latin words
Line breaks: Check previous character for sentence-ending punctuation
Paragraph breaks: Preserve document structure

Test Coverage

Added comprehensive unit tests in test_transformer.py:

Language detection (Chinese, Japanese, Korean, mixed)
Soft/hard line break handling
Space removal between CJK characters
Space preservation around English words
Paragraph break handling
Mixed text scenarios

All tests pass ✅

Example Usage

Before (OCR output with artifacts):

这是 第一 段 文字
继续 在 这里
这是第二段。

这是 English word 混排 的 情况

After (with smart stripping enabled):

这是第一段文字继续在这里这是第二段。
这是 English word 混排的情况

Benefits

🎯 Significantly improves OCR results for CJK documents
🌏 Helps Chinese, Japanese, and Korean users
🔤 Works correctly with mixed-language documents
📄 Especially useful for PDF document OCR
⚡ Zero impact when disabled (default off)

Compatibility

No breaking changes
Feature is opt-in (disabled by default)
Works with existing language detection and OCR pipeline
Compatible with all supported platforms

Related Issues

This addresses common complaints about CJK OCR quality and unwanted whitespace in extracted text.

Testing: Tested on Ubuntu 24.04 with Python 3.10, Tesseract 5.3.4, and various Chinese documents.

…ed comments

…nd update UI/settings accordingly

shukebeta mentioned this pull request Oct 13, 2025

Barcode detection causes false positives with plain numeric text #802

Open

shukebeta force-pushed the feature/smart-whitespace-stripping branch 2 times, most recently from 990128f to 810c2be Compare October 14, 2025 04:08

Add smart whitespace handling for CJK text<｜begin▁of▁sentence｜>

1adf1c3

shukebeta force-pushed the feature/smart-whitespace-stripping branch from 810c2be to 1adf1c3 Compare October 14, 2025 04:11

davidwei added 3 commits October 14, 2025 17:19

eed-history: clean: simplify tray.py defensive code and remove orphan…

277e944

…ed comments

eed-history: clean: remove extra blank lines in transformer.py

5786604

Update document: Rename whitespace stripping function for CJK text a…

c753b1a

…nd update UI/settings accordingly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Add smart whitespace stripping for CJK text #801

feat: Add smart whitespace stripping for CJK text #801

Uh oh!

shukebeta commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

feat: Add smart whitespace stripping for CJK text #801

Are you sure you want to change the base?

feat: Add smart whitespace stripping for CJK text #801

Uh oh!

Conversation

shukebeta commented Oct 13, 2025

Summary

Problem Statement

Solution

Implementation Details

User Interface

Technical Changes

Algorithm

Test Coverage

Example Usage

Benefits

Compatibility

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant