Skip to content

Conversation

CrazySteve0605 and others added 30 commits July 23, 2025 17:30
- Add `cppjieba` as a Git submodule under `third_party/cppjieba/` to provide
  robust Chinese word segmentation capabilities.
- Update `.gitmodules` to point to the official `cppjieba` repository and
  configure it to track the `master` branch.
- Update 'sconscript' to include the paths of 'cppjieba' and its dependency 'limonp'
- Modify `copying.txt` to include the `cppjieba` license (MIT) alongside the
  project’s existing license, ensuring proper attribution and compliance.
- Update documents
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
- Introduce `JiebaSingleton` class in `cppjieba.hpp`/`cppjieba.cpp` with def file under nvdaHelper/cppjieba/'
  - Inherits from `cppjieba::Jieba` and exposes a thread-safe `getOffsets()` method
  - Implements Meyers’ singleton via `getInstance()` with a private constructor
  - Deletes copy constructor, copy assignment, move constructor, and move assignment to enforce single instance
- Add C-style API in the same module:
  - `int initJieba()` to force singleton initialization
  - `int segmentOffsets(const char* text, int** charOffsets, int* outLen)` to perform segmentation and return character offsets
  - `void freeOffsets(int* ptr)` to release allocated offset buffer
- Change 'submodules' in 'jobs - buildNVDA - Build NVDA - Checkout NVDA' from 'true' to 'recursive' to ensure cppjieba's submodule is fetched.
- This will cause the submodule of sonic to be fetched as well, which seems currently unused.
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
- create `WordSegmentationStrategy' as an abstract base class to select segmentation strategy based on text content, following Strategy Pattern
- implement `ChineseWordSegmentationStrategy` (for Chinese text)
- implement `UniscribeWordSegmentationStrategy` (for other languages as default strategy)
add `WordSegmenter` class for word segmentation, integrating segmentation strategies
- replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions
- integrate `WordSegmenter` for calculating word offsets
make `DisplayModelTextInfo`'s flag aligned with its superclass
- redesign 2 properties of 'OffsetTextInfo' as enums to make them more configurable, inspired by @LeonarddeR
- override them in some subclasses to simulate specific behaviors
michaelDCurran and others added 13 commits October 10, 2025 07:17
…eText

### Link to issue number:
blocked by #18548, closes #4075, related #16237 and [a part of OSPP 2025](https://summer-ospp.ac.cn/org/prodetail/25d3e0488?list=org&navpage=org) of NVDA.

### Summary of the issue:

### Description of user facing changes:
Chinese text can be navigated by word via system caret or review cursor.
### Description of developer facing changes:

### Description of development approach:
- [ ] update `textUtils`
	- [x] add `WordSegment` module
		- [x] create `WordSegmentationStrategy' as an abstract base class to select segmentation strategy based on text content, following Strategy Pattern
		- [x] implement `ChineseWordSegmentationStrategy` (for Chinese text)
		- [x] implement `UniscribeWordSegmentationStrategy` (for other languages as default strategy)
	- [x] update `textUtils/__init__.py`
		- [x] add `WordSegmenter` class for word segmentation, integrating segmentation strategies
- [x] update `textInfos/offsets.py`
	- [x] replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions
	- [x] integrate `WordSegmenter` for calculating word offsets
- [ ] update document

### Testing strategy:

### Known issues with pull request:
Word segmentation functionality was integrated in `OffsetsTextInfo`.
In `OffsetsTextInfo`, word segmentation is based on Uniscribe by default and Unicode as a fall-back.

Subclasses of OffsetsTextInfo

#### Supported
1. `NVDAObjectTextInfo`
2. `InputCompositionTextInfo`:
3. `JABTextInfo`
4. `SimpleResultTextInfo`
5. `VirtualBufferTextInfo`: use self-hosted function to calculate offset and invoke iits superclass' `_getWordOffsets`

#### Unsupported
1. `DisplayModelTextInfo`: straightly disable
2. `EditTextInfo`: straightly use Uniscribe
3. `ScintillaTextInfo`: entirely use self-defined 'word', for Scintilla-based editors such Notepad++  source/NVDAObjects/window/scintilla.py
4. `VsTextEditPaneTextInfo`: use a special COM automation API, for Microsoft Visual Studio and Microsoft SQL Server Management Studio source/appModules/devenv.py
5. `TextFrameTextInfo`: use self-defined 'word', based on PowerPoint's COM object, for PowerPoint's text frame  source/appModules/powerpnt.py
6. `LwrTextInfo`: based on pre-computed words during related progress, for structured text using 'LineWordResult' (e.g. Windows OCR) source/contentRecog/__init__.py

#### Partial Supported
Some architectures totally or priorly use native difination of a 'word', which softwares depend on may not be able to use the new functionallity.
1.  `IA2TextTextInfo`: override and fall back  source/NVDAObjects/IAccessible/\_\_init\_\_.py

### Code Review Checklist:

<!--
This checklist is a reminder of things commonly forgotten in a new PR.
Authors, please do a self-review of this pull-request.
Check items to confirm you have thought about the relevance of the item.
Where items are missing (eg unit / system tests), please explain in the PR.
To check an item `- [ ]` becomes `- [x]`, note spacing.
You can also check the checkboxes after the PR is created.
A detailed explanation of this checklist is available here:
https://github.com/nvaccess/nvda/blob/master/projectDocs/dev/githubPullRequestTemplateExplanationAndExamples.md#code-review-checklist
-->

- [ ] Documentation:
  - Change log entry
  - User Documentation
  - Developer / Technical Documentation
  - Context sensitive help for GUI changes
- [ ] Testing:
  - Unit tests
  - System (end to end) tests
  - Manual testing
- [ ] UX of all users considered:
  - Speech
  - Braille
  - Low Vision
  - Different web browsers
  - Localization in other languages / culture than English
- [ ] API is compatible with existing add-ons.
- [ ] Security precautions taken.
### Link to issue number:
blocked by #18735, related #1890 (comment), and [a part of OSPP 2025](https://summer-ospp.ac.cn/org/prodetail/25d3e0488?list=org&navpage=org) of NVDA.

### Summary of the issue:

### Description of user facing changes:
Braille output for Chinese become easier to read.
### Description of developer facing changes:

### Description of development approach:

### Testing strategy:

### Known issues with pull request:
The displayed braille and text are not aligned with each other in Braille Viewer, which shouldn't have an effect on reading from a braille displayer since only braille is transferred and the carets of raw text and output braille are aligned.
### Code Review Checklist:

<!--
This checklist is a reminder of things commonly forgotten in a new PR.
Authors, please do a self-review of this pull-request.
Check items to confirm you have thought about the relevance of the item.
Where items are missing (eg unit / system tests), please explain in the PR.
To check an item `- [ ]` becomes `- [x]`, note spacing.
You can also check the checkboxes after the PR is created.
A detailed explanation of this checklist is available here:
https://github.com/nvaccess/nvda/blob/master/projectDocs/dev/githubPullRequestTemplateExplanationAndExamples.md#code-review-checklist
-->

- [ ] Documentation:
  - Change log entry
  - User Documentation
  - Developer / Technical Documentation
  - Context sensitive help for GUI changes
- [ ] Testing:
  - Unit tests
  - System (end to end) tests
  - Manual testing
- [ ] UX of all users considered:
  - Speech
  - Braille
  - Low Vision
  - Different web browsers
  - Localization in other languages / culture than English
- [ ] API is compatible with existing add-ons.
- [ ] Security precautions taken.
@seanbudd seanbudd requested a review from a team as a code owner November 5, 2025 01:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR merges Chinese Word Segmentation functionality into NVDA, enabling Chinese text to be navigated by word using built-in input gestures. The implementation adds support for the cppjieba library for Chinese word segmentation, creates a configurable word segmentation framework, and includes braille output enhancements for Chinese text with word separators.

Key changes:

  • Integrated cppjieba library as a git submodule for Chinese word segmentation
  • Added word segmentation configuration options in Document Navigation settings
  • Implemented WordSegmenter class with multiple segmentation strategies (Uniscribe, Chinese)

Reviewed Changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
user_docs/en/changes.md Documents new Chinese word navigation features and API changes
tests/unit/test_textUtils.py Adds unit tests for WordSegmenter class
source/textUtils/wordSeg/wordSegUtils.py Implements word segmentation utilities and offset conversion
source/textUtils/wordSeg/wordSegStrategy.py Defines word segmentation strategies and cppjieba integration
source/textUtils/wordSeg/init.py Provides initialization for word segmentation module
source/textUtils/segFlag.py Defines character and word segmentation flags
source/textUtils/init.py Adds WordSegmenter class for text segmentation
source/textInfos/offsets.py Updates OffsetsTextInfo to use new segmentation flags
source/setup.py Includes cppjieba dictionaries in distribution
source/gui/settingsDialogs.py Adds word segmentation UI controls to settings
source/displayModel.py Overrides segmentation flags for display model
source/core.py Initializes word segmentation module on startup
source/config/featureFlagEnums.py Defines WordNavigationUnitFlag enumeration
source/config/configSpec.py Adds word segmentation configuration options
source/braille.py Implements Chinese word separator insertion for braille
source/NVDAObjects/window/edit.py Enforces Uniscribe segmentation for edit controls
projectDocs/dev/createDevEnvironment.md Documents cppjieba dependency
nvdaHelper/cppjieba/sconscript Build script for cppjieba library
nvdaHelper/cppjieba/cppjieba.hpp Header file for cppjieba wrapper
nvdaHelper/cppjieba/cppjieba.def Export definitions for cppjieba DLL
nvdaHelper/cppjieba/cppjieba.cpp Implementation of cppjieba wrapper
nvdaHelper/archBuild_sconscript Integrates cppjieba into build process
include/readme.md Documents cppjieba as a new dependency
include/cppjieba Git submodule reference for cppjieba
copying.txt Adds cppjieba MIT license notice
.gitmodules Registers cppjieba as git submodule
.gitignore Excludes generated cppjieba files
.github/workflows/testAndPublish.yml Updates to checkout submodules recursively
.gitattributes Adds .hpp file handling for C++ headers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* In the Add-on Store, a new action has been added to see the latest changes for the current version of add-ons. (#14041, @josephsl, @nvdaes)
* Chinese text can be navigated by word via build-in input gestures.
Several GUI elements are added for its configuration in `Document Navigation` panel. (#18735, @CrazySteve0605)
* Braille output for Chinese contains spaces as word separaters. (#18865, @CrazySteve0605)
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'separaters' to 'separators'.

Suggested change
* Braille output for Chinese contains spaces as word separaters. (#18865, @CrazySteve0605)
* Braille output for Chinese contains spaces as word separators. (#18865, @CrazySteve0605)

Copilot uses AI. Check for mistakes.

@classmethod
@initializerRegistry
def _initCppJieba(cls, forceInit: bool = False): # TODO: make cppjieba alternative
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The TODO comment 'make cppjieba alternative' is unclear. Consider elaborating on what alternative is needed or creating a tracking issue.

Suggested change
def _initCppJieba(cls, forceInit: bool = False): # TODO: make cppjieba alternative
def _initCppJieba(cls, forceInit: bool = False): # TODO: Implement a fallback for cppjieba (e.g., use a pure Python segmentation library like jieba, or create a stub). See issue #1234 for details.

Copilot uses AI. Check for mistakes.
self.wordSegFlag: WordSegFlag = wordSegFlag
self.strategy: wordSegStrategy.WordSegmentationStrategy = self._chooseStrategy()

def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy: # TODO: optimize
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The TODO comment 'optimize' is vague. Consider specifying what optimization is needed or creating a tracking issue with details.

Suggested change
def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy: # TODO: optimize
def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy: # TODO: Consider optimizing for performance if text is very large. See issue #1234.

Copilot uses AI. Check for mistakes.
case config.featureFlagEnums.WordNavigationUnitFlag.CHINESE:
return WordSegFlag.CHINESE
case _:
log.error(f"Unknown word segmentation standard, {self.__wordSegConf.calculated()!r}")
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using self.__wordSegConf with double underscore causes name mangling; should be self.wordSegConf to match the attribute defined in __init__.

Suggested change
log.error(f"Unknown word segmentation standard, {self.__wordSegConf.calculated()!r}")
log.error(f"Unknown word segmentation standard, {self.wordSegConf.calculated()!r}")

Copilot uses AI. Check for mistakes.
Thread(target=callable_to_call, args=args, kwargs=kwargs, daemon=True).start()
except Exception as e:
log.debug("Initializer %s.%s failed: %s", module_name, qualname, e)
return
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return statement at line 46 exits the function after the first initializer, preventing subsequent initializers from running. Move the return outside the loop or remove it.

Suggested change
return

Copilot uses AI. Check for mistakes.
if strStart == 0:
resultStart = 0
else:
resultStart = self.computedStrToEncodedOffsets[strStart]
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider adding bounds checking before accessing computedStrToEncodedOffsets[strStart] to prevent IndexError if strStart is out of range.

Copilot uses AI. Check for mistakes.

cppjiebaLib = env.SharedLibrary(target="cppjieba", source=sourceFiles)

if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # insure dicts installation happens only once and avoid a scons' warning
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'insure' to 'ensure' in comment.

Suggested change
if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # insure dicts installation happens only once and avoid a scons' warning
if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # ensure dicts installation happens only once and avoid a scons' warning

Copilot uses AI. Check for mistakes.
@cary-rowen
Copy link
Contributor

Will we do any polishing or cleanup work in this PR?

@seanbudd seanbudd marked this pull request as draft November 5, 2025 07:34
@seanbudd
Copy link
Member Author

seanbudd commented Nov 5, 2025

@cary-rowen - yes - this is WIP

@seanbudd seanbudd changed the title Merge Chinese Word Segmentation work [WIP] Merge Chinese Word Segmentation work Nov 5, 2025
@seanbudd seanbudd added the conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. label Nov 11, 2025
@cary-rowen
Copy link
Contributor

What are our main tasks in this PR? Is it just cleaning up the code? Is it possible to include bug fixes?
cc @CrazySteve0605
According to OSPP guidelines, is this still considered part of your job responsibilities?

@CrazySteve0605
Copy link
Contributor

CrazySteve0605 commented Nov 20, 2025

cc @CrazySteve0605
According to OSPP guidelines, is this still considered part of your job responsibilities?

Yes. I locally finished fixing remaining issues in #18865, and will make a PR soon, for this functionality to be integrated more properly.

@seanbudd
Copy link
Member Author

seanbudd commented Nov 21, 2025

@cary-rowen @CrazySteve0605 - yes, PRs to this branch that improve the code quality and fix issues would be appreciated. Particularly resolving conflicts

@cary-rowen
Copy link
Contributor

As we discussed offline @CrazySteve0605

According to feedback from the community, there are still some cases that need to be improved.

I think that when the known to-do items for this PR are solved and merged into the Master, we can provide follow-up polish PR.

Here, just co-pilot provides some comments worth addressing.

@seanbudd seanbudd added the merge-early Merge Early in a developer cycle label Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. merge-early Merge Early in a developer cycle

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants