Feat/4.2 faster spacy #57

sidmohan0 · 2025-04-27T01:08:50Z

Implemented batch processing in SpacyAnnotator using nlp.pipe() for better performance.
Added model caching for spaCy models.
Made spaCy and Tesseract calls asynchronous using asyncio.to_thread.
Made the spaCy batch size configurable via the DATAFOG_SPACY_BATCH_SIZE environment variable.
Added comprehensive tests for the new functionality.
Updated the README.md to document the new features and configuration.
Ensured all code passed tox and pre-commit checks.

Completes Task 2 (TICKETS 4.2.4-4.2.7)

- Refactors SpacyAnnotator.annotate_text to use nlp.pipe for batching. - Adds DATAFOG_SPACY_BATCH_SIZE env var for configurable batch size. - Includes module-level caching for spaCy models. - Wraps blocking spaCy/Tesseract calls in asyncio.to_thread. - Adds tests for batch size configuration. - Updates README with new features and configuration.

sidmohan0 added 6 commits April 26, 2025 17:36

docs: add v4.2.0 work breakdown

466dc91

feat: Implement and test spaCy model caching

6b4ac9e

updated v4.2.0-tickets.md

26903de

feat: Implement spaCy batch processing via nlp.pipe

4117f89

Completes Task 2 (TICKETS 4.2.4-4.2.7)

asyc execution of blocking calls

0d26f2c

sidmohan0 merged commit c70bce0 into dev Apr 27, 2025
5 checks passed

sidmohan0 deleted the feat/4.2-faster-spacy branch April 28, 2025 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/4.2 faster spacy #57

Feat/4.2 faster spacy #57

Uh oh!

sidmohan0 commented Apr 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feat/4.2 faster spacy #57

Feat/4.2 faster spacy #57

Uh oh!

Conversation

sidmohan0 commented Apr 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant