-
-
Notifications
You must be signed in to change notification settings - Fork 736
[WIP] Merge Chinese Word Segmentation work #19166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Add `cppjieba` as a Git submodule under `third_party/cppjieba/` to provide robust Chinese word segmentation capabilities. - Update `.gitmodules` to point to the official `cppjieba` repository and configure it to track the `master` branch. - Update 'sconscript' to include the paths of 'cppjieba' and its dependency 'limonp' - Modify `copying.txt` to include the `cppjieba` license (MIT) alongside the project’s existing license, ensuring proper attribution and compliance. - Update documents
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
- Introduce `JiebaSingleton` class in `cppjieba.hpp`/`cppjieba.cpp` with def file under nvdaHelper/cppjieba/' - Inherits from `cppjieba::Jieba` and exposes a thread-safe `getOffsets()` method - Implements Meyers’ singleton via `getInstance()` with a private constructor - Deletes copy constructor, copy assignment, move constructor, and move assignment to enforce single instance - Add C-style API in the same module: - `int initJieba()` to force singleton initialization - `int segmentOffsets(const char* text, int** charOffsets, int* outLen)` to perform segmentation and return character offsets - `void freeOffsets(int* ptr)` to release allocated offset buffer
- Change 'submodules' in 'jobs - buildNVDA - Build NVDA - Checkout NVDA' from 'true' to 'recursive' to ensure cppjieba's submodule is fetched. - This will cause the submodule of sonic to be fetched as well, which seems currently unused.
Co-authored-by: Sean Budd <seanbudd123@gmail.com>
merge for Python updated
- create `WordSegmentationStrategy' as an abstract base class to select segmentation strategy based on text content, following Strategy Pattern - implement `ChineseWordSegmentationStrategy` (for Chinese text) - implement `UniscribeWordSegmentationStrategy` (for other languages as default strategy)
add `WordSegmenter` class for word segmentation, integrating segmentation strategies
- replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions - integrate `WordSegmenter` for calculating word offsets
make `DisplayModelTextInfo`'s flag aligned with its superclass
- redesign 2 properties of 'OffsetTextInfo' as enums to make them more configurable, inspired by @LeonarddeR - override them in some subclasses to simulate specific behaviors
…eText ### Link to issue number: blocked by #18548, closes #4075, related #16237 and [a part of OSPP 2025](https://summer-ospp.ac.cn/org/prodetail/25d3e0488?list=org&navpage=org) of NVDA. ### Summary of the issue: ### Description of user facing changes: Chinese text can be navigated by word via system caret or review cursor. ### Description of developer facing changes: ### Description of development approach: - [ ] update `textUtils` - [x] add `WordSegment` module - [x] create `WordSegmentationStrategy' as an abstract base class to select segmentation strategy based on text content, following Strategy Pattern - [x] implement `ChineseWordSegmentationStrategy` (for Chinese text) - [x] implement `UniscribeWordSegmentationStrategy` (for other languages as default strategy) - [x] update `textUtils/__init__.py` - [x] add `WordSegmenter` class for word segmentation, integrating segmentation strategies - [x] update `textInfos/offsets.py` - [x] replace `useUniscribe` with `useUniscribeForCharOffset` & `useWordSegmenterForWordOffset` for segmentation extensions - [x] integrate `WordSegmenter` for calculating word offsets - [ ] update document ### Testing strategy: ### Known issues with pull request: Word segmentation functionality was integrated in `OffsetsTextInfo`. In `OffsetsTextInfo`, word segmentation is based on Uniscribe by default and Unicode as a fall-back. Subclasses of OffsetsTextInfo #### Supported 1. `NVDAObjectTextInfo` 2. `InputCompositionTextInfo`: 3. `JABTextInfo` 4. `SimpleResultTextInfo` 5. `VirtualBufferTextInfo`: use self-hosted function to calculate offset and invoke iits superclass' `_getWordOffsets` #### Unsupported 1. `DisplayModelTextInfo`: straightly disable 2. `EditTextInfo`: straightly use Uniscribe 3. `ScintillaTextInfo`: entirely use self-defined 'word', for Scintilla-based editors such Notepad++ source/NVDAObjects/window/scintilla.py 4. `VsTextEditPaneTextInfo`: use a special COM automation API, for Microsoft Visual Studio and Microsoft SQL Server Management Studio source/appModules/devenv.py 5. `TextFrameTextInfo`: use self-defined 'word', based on PowerPoint's COM object, for PowerPoint's text frame source/appModules/powerpnt.py 6. `LwrTextInfo`: based on pre-computed words during related progress, for structured text using 'LineWordResult' (e.g. Windows OCR) source/contentRecog/__init__.py #### Partial Supported Some architectures totally or priorly use native difination of a 'word', which softwares depend on may not be able to use the new functionallity. 1. `IA2TextTextInfo`: override and fall back source/NVDAObjects/IAccessible/\_\_init\_\_.py ### Code Review Checklist: <!-- This checklist is a reminder of things commonly forgotten in a new PR. Authors, please do a self-review of this pull-request. Check items to confirm you have thought about the relevance of the item. Where items are missing (eg unit / system tests), please explain in the PR. To check an item `- [ ]` becomes `- [x]`, note spacing. You can also check the checkboxes after the PR is created. A detailed explanation of this checklist is available here: https://github.com/nvaccess/nvda/blob/master/projectDocs/dev/githubPullRequestTemplateExplanationAndExamples.md#code-review-checklist --> - [ ] Documentation: - Change log entry - User Documentation - Developer / Technical Documentation - Context sensitive help for GUI changes - [ ] Testing: - Unit tests - System (end to end) tests - Manual testing - [ ] UX of all users considered: - Speech - Braille - Low Vision - Different web browsers - Localization in other languages / culture than English - [ ] API is compatible with existing add-ons. - [ ] Security precautions taken.
### Link to issue number: blocked by #18735, related #1890 (comment), and [a part of OSPP 2025](https://summer-ospp.ac.cn/org/prodetail/25d3e0488?list=org&navpage=org) of NVDA. ### Summary of the issue: ### Description of user facing changes: Braille output for Chinese become easier to read. ### Description of developer facing changes: ### Description of development approach: ### Testing strategy: ### Known issues with pull request: The displayed braille and text are not aligned with each other in Braille Viewer, which shouldn't have an effect on reading from a braille displayer since only braille is transferred and the carets of raw text and output braille are aligned. ### Code Review Checklist: <!-- This checklist is a reminder of things commonly forgotten in a new PR. Authors, please do a self-review of this pull-request. Check items to confirm you have thought about the relevance of the item. Where items are missing (eg unit / system tests), please explain in the PR. To check an item `- [ ]` becomes `- [x]`, note spacing. You can also check the checkboxes after the PR is created. A detailed explanation of this checklist is available here: https://github.com/nvaccess/nvda/blob/master/projectDocs/dev/githubPullRequestTemplateExplanationAndExamples.md#code-review-checklist --> - [ ] Documentation: - Change log entry - User Documentation - Developer / Technical Documentation - Context sensitive help for GUI changes - [ ] Testing: - Unit tests - System (end to end) tests - Manual testing - [ ] UX of all users considered: - Speech - Braille - Low Vision - Different web browsers - Localization in other languages / culture than English - [ ] API is compatible with existing add-ons. - [ ] Security precautions taken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR merges Chinese Word Segmentation functionality into NVDA, enabling Chinese text to be navigated by word using built-in input gestures. The implementation adds support for the cppjieba library for Chinese word segmentation, creates a configurable word segmentation framework, and includes braille output enhancements for Chinese text with word separators.
Key changes:
- Integrated cppjieba library as a git submodule for Chinese word segmentation
- Added word segmentation configuration options in Document Navigation settings
- Implemented WordSegmenter class with multiple segmentation strategies (Uniscribe, Chinese)
Reviewed Changes
Copilot reviewed 28 out of 29 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| user_docs/en/changes.md | Documents new Chinese word navigation features and API changes |
| tests/unit/test_textUtils.py | Adds unit tests for WordSegmenter class |
| source/textUtils/wordSeg/wordSegUtils.py | Implements word segmentation utilities and offset conversion |
| source/textUtils/wordSeg/wordSegStrategy.py | Defines word segmentation strategies and cppjieba integration |
| source/textUtils/wordSeg/init.py | Provides initialization for word segmentation module |
| source/textUtils/segFlag.py | Defines character and word segmentation flags |
| source/textUtils/init.py | Adds WordSegmenter class for text segmentation |
| source/textInfos/offsets.py | Updates OffsetsTextInfo to use new segmentation flags |
| source/setup.py | Includes cppjieba dictionaries in distribution |
| source/gui/settingsDialogs.py | Adds word segmentation UI controls to settings |
| source/displayModel.py | Overrides segmentation flags for display model |
| source/core.py | Initializes word segmentation module on startup |
| source/config/featureFlagEnums.py | Defines WordNavigationUnitFlag enumeration |
| source/config/configSpec.py | Adds word segmentation configuration options |
| source/braille.py | Implements Chinese word separator insertion for braille |
| source/NVDAObjects/window/edit.py | Enforces Uniscribe segmentation for edit controls |
| projectDocs/dev/createDevEnvironment.md | Documents cppjieba dependency |
| nvdaHelper/cppjieba/sconscript | Build script for cppjieba library |
| nvdaHelper/cppjieba/cppjieba.hpp | Header file for cppjieba wrapper |
| nvdaHelper/cppjieba/cppjieba.def | Export definitions for cppjieba DLL |
| nvdaHelper/cppjieba/cppjieba.cpp | Implementation of cppjieba wrapper |
| nvdaHelper/archBuild_sconscript | Integrates cppjieba into build process |
| include/readme.md | Documents cppjieba as a new dependency |
| include/cppjieba | Git submodule reference for cppjieba |
| copying.txt | Adds cppjieba MIT license notice |
| .gitmodules | Registers cppjieba as git submodule |
| .gitignore | Excludes generated cppjieba files |
| .github/workflows/testAndPublish.yml | Updates to checkout submodules recursively |
| .gitattributes | Adds .hpp file handling for C++ headers |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * In the Add-on Store, a new action has been added to see the latest changes for the current version of add-ons. (#14041, @josephsl, @nvdaes) | ||
| * Chinese text can be navigated by word via build-in input gestures. | ||
| Several GUI elements are added for its configuration in `Document Navigation` panel. (#18735, @CrazySteve0605) | ||
| * Braille output for Chinese contains spaces as word separaters. (#18865, @CrazySteve0605) |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'separaters' to 'separators'.
| * Braille output for Chinese contains spaces as word separaters. (#18865, @CrazySteve0605) | |
| * Braille output for Chinese contains spaces as word separators. (#18865, @CrazySteve0605) |
|
|
||
| @classmethod | ||
| @initializerRegistry | ||
| def _initCppJieba(cls, forceInit: bool = False): # TODO: make cppjieba alternative |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The TODO comment 'make cppjieba alternative' is unclear. Consider elaborating on what alternative is needed or creating a tracking issue.
| def _initCppJieba(cls, forceInit: bool = False): # TODO: make cppjieba alternative | |
| def _initCppJieba(cls, forceInit: bool = False): # TODO: Implement a fallback for cppjieba (e.g., use a pure Python segmentation library like jieba, or create a stub). See issue #1234 for details. |
| self.wordSegFlag: WordSegFlag = wordSegFlag | ||
| self.strategy: wordSegStrategy.WordSegmentationStrategy = self._chooseStrategy() | ||
|
|
||
| def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy: # TODO: optimize |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The TODO comment 'optimize' is vague. Consider specifying what optimization is needed or creating a tracking issue with details.
| def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy: # TODO: optimize | |
| def _chooseStrategy(self) -> wordSegStrategy.WordSegmentationStrategy: # TODO: Consider optimizing for performance if text is very large. See issue #1234. |
| case config.featureFlagEnums.WordNavigationUnitFlag.CHINESE: | ||
| return WordSegFlag.CHINESE | ||
| case _: | ||
| log.error(f"Unknown word segmentation standard, {self.__wordSegConf.calculated()!r}") |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using self.__wordSegConf with double underscore causes name mangling; should be self.wordSegConf to match the attribute defined in __init__.
| log.error(f"Unknown word segmentation standard, {self.__wordSegConf.calculated()!r}") | |
| log.error(f"Unknown word segmentation standard, {self.wordSegConf.calculated()!r}") |
| Thread(target=callable_to_call, args=args, kwargs=kwargs, daemon=True).start() | ||
| except Exception as e: | ||
| log.debug("Initializer %s.%s failed: %s", module_name, qualname, e) | ||
| return |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return statement at line 46 exits the function after the first initializer, preventing subsequent initializers from running. Move the return outside the loop or remove it.
| return |
| if strStart == 0: | ||
| resultStart = 0 | ||
| else: | ||
| resultStart = self.computedStrToEncodedOffsets[strStart] |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider adding bounds checking before accessing computedStrToEncodedOffsets[strStart] to prevent IndexError if strStart is out of range.
|
|
||
| cppjiebaLib = env.SharedLibrary(target="cppjieba", source=sourceFiles) | ||
|
|
||
| if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # insure dicts installation happens only once and avoid a scons' warning |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'insure' to 'ensure' in comment.
| if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # insure dicts installation happens only once and avoid a scons' warning | |
| if not os.path.exists(outDir.Dir("dicts").get_abspath()) or not os.listdir(outDir.Dir("dicts").get_abspath()): # ensure dicts installation happens only once and avoid a scons' warning |
|
Will we do any polishing or cleanup work in this PR? |
|
@cary-rowen - yes - this is WIP |
|
What are our main tasks in this PR? Is it just cleaning up the code? Is it possible to include bug fixes? |
Yes. I locally finished fixing remaining issues in #18865, and will make a PR soon, for this functionality to be integrated more properly. |
|
@cary-rowen @CrazySteve0605 - yes, PRs to this branch that improve the code quality and fix issues would be appreciated. Particularly resolving conflicts |
|
As we discussed offline @CrazySteve0605 According to feedback from the community, there are still some cases that need to be improved. I think that when the known to-do items for this PR are solved and merged into the Master, we can provide follow-up polish PR. Here, just co-pilot provides some comments worth addressing. |
Merges: