[WIP] Merge Chinese Word Segmentation work #19166

seanbudd · 2025-11-05T01:53:21Z

Merges:

- Add `cppjieba` as a Git submodule under `third_party/cppjieba/` to provide robust Chinese word segmentation capabilities. - Update `.gitmodules` to point to the official `cppjieba` repository and configure it to track the `master` branch. - Update 'sconscript' to include the paths of 'cppjieba' and its dependency 'limonp' - Modify `copying.txt` to include the `cppjieba` license (MIT) alongside the project’s existing license, ensuring proper attribution and compliance. - Update documents

Co-authored-by: Sean Budd <seanbudd123@gmail.com>

- Introduce `JiebaSingleton` class in `cppjieba.hpp`/`cppjieba.cpp` with def file under nvdaHelper/cppjieba/' - Inherits from `cppjieba::Jieba` and exposes a thread-safe `getOffsets()` method - Implements Meyers’ singleton via `getInstance()` with a private constructor - Deletes copy constructor, copy assignment, move constructor, and move assignment to enforce single instance - Add C-style API in the same module: - `int initJieba()` to force singleton initialization - `int segmentOffsets(const char* text, int** charOffsets, int* outLen)` to perform segmentation and return character offsets - `void freeOffsets(int* ptr)` to release allocated offset buffer

- Change 'submodules' in 'jobs - buildNVDA - Build NVDA - Checkout NVDA' from 'true' to 'recursive' to ensure cppjieba's submodule is fetched. - This will cause the submodule of sonic to be fetched as well, which seems currently unused.

Co-authored-by: Sean Budd <seanbudd123@gmail.com>

merge for Python updated

- create `WordSegmentationStrategy' as an abstract base class to select segmentation strategy based on text content, following Strategy Pattern - implement `ChineseWordSegmentationStrategy` (for Chinese text) - implement `UniscribeWordSegmentationStrategy` (for other languages as default strategy)

add `WordSegmenter` class for word segmentation, integrating segmentation strategies

- replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions - integrate `WordSegmenter` for calculating word offsets

make `DisplayModelTextInfo`'s flag aligned with its superclass

@LeonarddeR

- redesign 2 properties of 'OffsetTextInfo' as enums to make them more configurable, inspired by @LeonarddeR - override them in some subclasses to simulate specific behaviors

…nForChineseText

…eText ### Link to issue number: blocked by #18548, closes #4075, related #16237 and [a part of OSPP 2025](https://summer-ospp.ac.cn/org/prodetail/25d3e0488?list=org&navpage=org) of NVDA. ### Summary of the issue: ### Description of user facing changes: Chinese text can be navigated by word via system caret or review cursor. ### Description of developer facing changes: ### Description of development approach: - [ ] update `textUtils` - [x] add `WordSegment` module - [x] create `WordSegmentationStrategy' as an abstract base class to select segmentation strategy based on text content, following Strategy Pattern - [x] implement `ChineseWordSegmentationStrategy` (for Chinese text) - [x] implement `UniscribeWordSegmentationStrategy` (for other languages as default strategy) - [x] update `textUtils/__init__.py` - [x] add `WordSegmenter` class for word segmentation, integrating segmentation strategies - [x] update `textInfos/offsets.py` - [x] replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions - [x] integrate `WordSegmenter` for calculating word offsets - [ ] update document ### Testing strategy: ### Known issues with pull request: Word segmentation functionality was integrated in `OffsetsTextInfo`. In `OffsetsTextInfo`, word segmentation is based on Uniscribe by default and Unicode as a fall-back. Subclasses of OffsetsTextInfo #### Supported 1. `NVDAObjectTextInfo` 2. `InputCompositionTextInfo`: 3. `JABTextInfo` 4. `SimpleResultTextInfo` 5. `VirtualBufferTextInfo`: use self-hosted function to calculate offset and invoke iits superclass' `_getWordOffsets` #### Unsupported 1. `DisplayModelTextInfo`: straightly disable 2. `EditTextInfo`: straightly use Uniscribe 3. `ScintillaTextInfo`: entirely use self-defined 'word', for Scintilla-based editors such Notepad++ source/NVDAObjects/window/scintilla.py 4. `VsTextEditPaneTextInfo`: use a special COM automation API, for Microsoft Visual Studio and Microsoft SQL Server Management Studio source/appModules/devenv.py 5. `TextFrameTextInfo`: use self-defined 'word', based on PowerPoint's COM object, for PowerPoint's text frame source/appModules/powerpnt.py 6. `LwrTextInfo`: based on pre-computed words during related progress, for structured text using 'LineWordResult' (e.g. Windows OCR) source/contentRecog/__init__.py #### Partial Supported Some architectures totally or priorly use native difination of a 'word', which softwares depend on may not be able to use the new functionallity. 1. `IA2TextTextInfo`: override and fall back source/NVDAObjects/IAccessible/\_\_init\_\_.py ### Code Review Checklist:  - [ ] Documentation: - Change log entry - User Documentation - Developer / Technical Documentation - Context sensitive help for GUI changes - [ ] Testing: - Unit tests - System (end to end) tests - Manual testing - [ ] UX of all users considered: - Speech - Braille - Low Vision - Different web browsers - Localization in other languages / culture than English - [ ] API is compatible with existing add-ons. - [ ] Security precautions taken.

…ForChinese

### Link to issue number: blocked by #18735, related #1890 (comment), and [a part of OSPP 2025](https://summer-ospp.ac.cn/org/prodetail/25d3e0488?list=org&navpage=org) of NVDA. ### Summary of the issue: ### Description of user facing changes: Braille output for Chinese become easier to read. ### Description of developer facing changes: ### Description of development approach: ### Testing strategy: ### Known issues with pull request: The displayed braille and text are not aligned with each other in Braille Viewer, which shouldn't have an effect on reading from a braille displayer since only braille is transferred and the carets of raw text and output braille are aligned. ### Code Review Checklist:  - [ ] Documentation: - Change log entry - User Documentation - Developer / Technical Documentation - Context sensitive help for GUI changes - [ ] Testing: - Unit tests - System (end to end) tests - Manual testing - [ ] UX of all users considered: - Speech - Braille - Low Vision - Different web browsers - Localization in other languages / culture than English - [ ] API is compatible with existing add-ons. - [ ] Security precautions taken.

Copilot

Pull Request Overview

This PR merges Chinese Word Segmentation functionality into NVDA, enabling Chinese text to be navigated by word using built-in input gestures. The implementation adds support for the cppjieba library for Chinese word segmentation, creates a configurable word segmentation framework, and includes braille output enhancements for Chinese text with word separators.

Key changes:

Integrated cppjieba library as a git submodule for Chinese word segmentation
Added word segmentation configuration options in Document Navigation settings
Implemented WordSegmenter class with multiple segmentation strategies (Uniscribe, Chinese)

Reviewed Changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
user_docs/en/changes.md	Documents new Chinese word navigation features and API changes
tests/unit/test_textUtils.py	Adds unit tests for WordSegmenter class
source/textUtils/wordSeg/wordSegUtils.py	Implements word segmentation utilities and offset conversion
source/textUtils/wordSeg/wordSegStrategy.py	Defines word segmentation strategies and cppjieba integration
source/textUtils/wordSeg/init.py	Provides initialization for word segmentation module
source/textUtils/segFlag.py	Defines character and word segmentation flags
source/textUtils/init.py	Adds WordSegmenter class for text segmentation
source/textInfos/offsets.py	Updates OffsetsTextInfo to use new segmentation flags
source/setup.py	Includes cppjieba dictionaries in distribution
source/gui/settingsDialogs.py	Adds word segmentation UI controls to settings
source/displayModel.py	Overrides segmentation flags for display model
source/core.py	Initializes word segmentation module on startup
source/config/featureFlagEnums.py	Defines WordNavigationUnitFlag enumeration
source/config/configSpec.py	Adds word segmentation configuration options
source/braille.py	Implements Chinese word separator insertion for braille
source/NVDAObjects/window/edit.py	Enforces Uniscribe segmentation for edit controls
projectDocs/dev/createDevEnvironment.md	Documents cppjieba dependency
nvdaHelper/cppjieba/sconscript	Build script for cppjieba library
nvdaHelper/cppjieba/cppjieba.hpp	Header file for cppjieba wrapper
nvdaHelper/cppjieba/cppjieba.def	Export definitions for cppjieba DLL
nvdaHelper/cppjieba/cppjieba.cpp	Implementation of cppjieba wrapper
nvdaHelper/archBuild_sconscript	Integrates cppjieba into build process
include/readme.md	Documents cppjieba as a new dependency
include/cppjieba	Git submodule reference for cppjieba
copying.txt	Adds cppjieba MIT license notice
.gitmodules	Registers cppjieba as git submodule
.gitignore	Excludes generated cppjieba files
.github/workflows/testAndPublish.yml	Updates to checkout submodules recursively
.gitattributes	Adds .hpp file handling for C++ headers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-05T01:58:52Z

user_docs/en/changes.md

 * In the Add-on Store, a new action has been added to see the latest changes for the current version of add-ons. (#14041, @josephsl, @nvdaes)
+* Chinese text can be navigated by word via build-in input gestures.
+  Several GUI elements are added for its configuration in `Document Navigation` panel. (#18735, @CrazySteve0605)
+* Braille output for Chinese contains spaces as word separaters. (#18865, @CrazySteve0605)


Corrected spelling of 'separaters' to 'separators'.

Suggested change

* Braille output for Chinese contains spaces as word separaters. (#18865, @CrazySteve0605)

* Braille output for Chinese contains spaces as word separators. (#18865, @CrazySteve0605)

Copilot · 2025-11-05T01:58:52Z

source/textUtils/wordSeg/wordSegStrategy.py

+
+	@classmethod
+	@initializerRegistry
+	def _initCppJieba(cls, forceInit: bool = False):  # TODO: make cppjieba alternative


[nitpick] The TODO comment 'make cppjieba alternative' is unclear. Consider elaborating on what alternative is needed or creating a tracking issue.

Suggested change

def _initCppJieba(cls, forceInit: bool = False): # TODO: make cppjieba alternative

def _initCppJieba(cls, forceInit: bool = False): # TODO: Implement a fallback for cppjieba (e.g., use a pure Python segmentation library like jieba, or create a stub). See issue #1234 for details.

Copilot · 2025-11-05T01:58:53Z

source/textUtils/__init__.py

+		self.wordSegFlag: WordSegFlag = wordSegFlag
+		self.strategy: wordSegStrategy.WordSegmentationStrategy = self._chooseStrategy()
+
+	def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy:  # TODO: optimize


[nitpick] The TODO comment 'optimize' is vague. Consider specifying what optimization is needed or creating a tracking issue with details.

Suggested change

def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy: # TODO: optimize

def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy: # TODO: Consider optimizing for performance if text is very large. See issue #1234.

Copilot · 2025-11-05T01:58:53Z

source/textInfos/offsets.py

+			case config.featureFlagEnums.WordNavigationUnitFlag.CHINESE:
+				return WordSegFlag.CHINESE
+			case _:
+				log.error(f"Unknown word segmentation standard, {self.__wordSegConf.calculated()!r}")


Using self.__wordSegConf with double underscore causes name mangling; should be self.wordSegConf to match the attribute defined in __init__.

Suggested change

log.error(f"Unknown word segmentation standard, {self.__wordSegConf.calculated()!r}")

log.error(f"Unknown word segmentation standard, {self.wordSegConf.calculated()!r}")

Copilot · 2025-11-05T01:58:53Z

source/textUtils/wordSeg/__init__.py

+			Thread(target=callable_to_call, args=args, kwargs=kwargs, daemon=True).start()
+		except Exception as e:
+			log.debug("Initializer %s.%s failed: %s", module_name, qualname, e)
+		return


The return statement at line 46 exits the function after the first initializer, preventing subsequent initializers from running. Move the return outside the loop or remove it.

Suggested change

return

Copilot · 2025-11-05T01:58:53Z

source/textUtils/wordSeg/wordSegUtils.py

+		if strStart == 0:
+			resultStart = 0
+		else:
+			resultStart = self.computedStrToEncodedOffsets[strStart]


[nitpick] Consider adding bounds checking before accessing computedStrToEncodedOffsets[strStart] to prevent IndexError if strStart is out of range.

Copilot · 2025-11-05T01:58:54Z

nvdaHelper/cppjieba/sconscript

+
+cppjiebaLib = env.SharedLibrary(target="cppjieba", source=sourceFiles)
+
+if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # insure dicts installation happens only once and avoid a scons' warning


Corrected spelling of 'insure' to 'ensure' in comment.

Suggested change

if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # insure dicts installation happens only once and avoid a scons' warning

if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # ensure dicts installation happens only once and avoid a scons' warning

cary-rowen · 2025-11-05T06:39:59Z

Will we do any polishing or cleanup work in this PR?

seanbudd · 2025-11-05T07:34:25Z

@cary-rowen - yes - this is WIP

cary-rowen · 2025-11-20T11:57:09Z

What are our main tasks in this PR? Is it just cleaning up the code? Is it possible to include bug fixes?
cc @CrazySteve0605
According to OSPP guidelines, is this still considered part of your job responsibilities?

CrazySteve0605 · 2025-11-20T12:08:00Z

cc @CrazySteve0605
According to OSPP guidelines, is this still considered part of your job responsibilities?

Yes. I locally finished fixing remaining issues in #18865, and will make a PR soon, for this functionality to be integrated more properly.

seanbudd · 2025-11-21T06:29:02Z

@cary-rowen @CrazySteve0605 - yes, PRs to this branch that improve the code quality and fix issues would be appreciated. Particularly resolving conflicts

cary-rowen · 2025-11-30T00:14:15Z

As we discussed offline @CrazySteve0605

According to feedback from the community, there are still some cases that need to be improved.

I think that when the known to-do items for this PR are solved and merged into the Master, we can provide follow-up polish PR.

Here, just co-pilot provides some comments worth addressing.

CrazySteve0605 and others added 30 commits July 23, 2025 17:30

Update what's new

fb4efef

Add comments for building script of cppjieba and its dependency

ae58e9b

Update projectDocs/dev/createDevEnvironment.md

06070c1

Co-authored-by: Sean Budd <seanbudd123@gmail.com>

Update include/readme.md

2273a60

Co-authored-by: Sean Budd <seanbudd123@gmail.com>

Remove changes in sconscript for localLIb

3d4d9f1

add building script for cppjieba

1fbf05f

Merge branch 'master' into integrateCPPJieba

f4cab8a

Merge branch 'master' into integrateCPPJieba

d4c3a92

Update .gitignore for cppjieba

da662be

Update building and setup script for cppjieba's dicts installation

38a12dc

update copyright headers based on @seanbudd's suggestions

c60c2da

Update include/readme.md

c853b64

Co-authored-by: Sean Budd <seanbudd123@gmail.com>

Merge branch 'master' into integrateCPPJieba

53dd3bb

Merge branch 'master' into integrateCPPJieba

eba63ab

merge for Python updated

update textUtils/__init__.py

9f62f04

add `WordSegmenter` class for word segmentation, integrating segmentation strategies

update textInfos/offsets.py

81f2040

- replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions - integrate `WordSegmenter` for calculating word offsets

update displayModel.py

da64cd8

make `DisplayModelTextInfo`'s flag aligned with its superclass

Pre-commit auto-fix

557f404

update type annotations

f72d348

Merge branch 'master' into integrateCPPJieba

19cad8a

add wrapper for word manager

adc22fb

update the word segmentation structure

4adac07

- redesign 2 properties of 'OffsetTextInfo' as enums to make them more configurable, inspired by @LeonarddeR - override them in some subclasses to simulate specific behaviors

Merge branch 'integrateCPPJieba' into wordNavigationForChineseText

407d4b2

Pre-commit auto-fix

0d40f0a

add copyright header

676fc42

add type annotations

ddd48e8

michaelDCurran and others added 13 commits October 10, 2025 07:17

Merge branch 'try-chineseWordSegmentation-staging' into wordNavigatio…

085ba2f

…nForChineseText

Pre-commit auto-fix

d55d077

fixup

2083095

make word segmentation module reinitialized after settings are saved

d32549f

Pre-commit auto-fix

b8ace76

Merge branch 'master' into try-chineseWordSegmentation-staging

c86b760

Merge branch 'try-chineseWordSegmentation-staging' into wordNavigatio…

042b778

…nForChineseText

fixup

d2714a3

remove duplicate importing lines

db90fff

Merge branch 'try-chineseWordSegmentation-staging' into brailleOutput…

0d27c91

…ForChinese

Pre-commit auto-fix

9cafffb

seanbudd requested a review from a team as a code owner November 5, 2025 01:53

seanbudd requested review from SaschaCowley and Copilot November 5, 2025 01:53

Copilot AI reviewed Nov 5, 2025

View reviewed changes

seanbudd marked this pull request as draft November 5, 2025 07:34

seanbudd changed the title ~~Merge Chinese Word Segmentation work~~ [WIP] Merge Chinese Word Segmentation work Nov 5, 2025

Adriani90 mentioned this pull request Nov 6, 2025

notepad++: Editing Chinese letters, NVDA does not get caret movement #14152

Closed

seanbudd added the conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. label Nov 11, 2025

seanbudd mentioned this pull request Nov 25, 2025

Feature Request: Add word navigation to NVDA #16237

Open

cary-rowen mentioned this pull request Nov 29, 2025

Add Chinese words segmentation support. #4075

Open

seanbudd added the merge-early Merge Early in a developer cycle label Dec 8, 2025

	* Braille output for Chinese contains spaces as word separaters. (#18865, @CrazySteve0605)
	* Braille output for Chinese contains spaces as word separators. (#18865, @CrazySteve0605)

	def _initCppJieba(cls, forceInit: bool = False): # TODO: make cppjieba alternative
	def _initCppJieba(cls, forceInit: bool = False): # TODO: Implement a fallback for cppjieba (e.g., use a pure Python segmentation library like jieba, or create a stub). See issue #1234 for details.

	def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy: # TODO: optimize
	def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy: # TODO: Consider optimizing for performance if text is very large. See issue #1234.

	log.error(f"Unknown word segmentation standard, {self.__wordSegConf.calculated()!r}")
	log.error(f"Unknown word segmentation standard, {self.wordSegConf.calculated()!r}")


		cppjiebaLib = env.SharedLibrary(target="cppjieba", source=sourceFiles)

		if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # insure dicts installation happens only once and avoid a scons' warning

Uh oh!

[WIP] Merge Chinese Word Segmentation work #19166

Are you sure you want to change the base?

[WIP] Merge Chinese Word Segmentation work #19166

Uh oh!

Conversation

seanbudd commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

cary-rowen commented Nov 5, 2025

Uh oh!

seanbudd commented Nov 5, 2025

Uh oh!

cary-rowen commented Nov 20, 2025

Uh oh!

CrazySteve0605 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanbudd commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cary-rowen commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

seanbudd commented Nov 5, 2025 •

edited

Loading

CrazySteve0605 commented Nov 20, 2025 •

edited

Loading

seanbudd commented Nov 21, 2025 •

edited

Loading