Skip to content

Conversation

athoik
Copy link

@athoik athoik commented May 10, 2025

Description:
Replaced specific installation of tesseract-langpack-eng with tesseract-langpack-* to ensure support for all available OCR languages. This improves flexibility for multilingual OCR processing without requiring manual additions for each language.

Changes:
Updated os-packages.txt: replaced tesseract-langpack-eng with tesseract-langpack-*
Ensures all Tesseract language packs are installed via wildcard in dnf

Note:
Wildcard is escaped (*) to prevent shell expansion and allow dnf to interpret it correctly.

Description:
Replaced specific installation of tesseract-langpack-eng with tesseract-langpack-\* to ensure support for all available OCR languages.
This improves flexibility for multilingual OCR processing without requiring manual additions for each language.

Changes:

    Updated os-packages.txt: replaced tesseract-langpack-eng with tesseract-langpack-\*

    Ensures all Tesseract language packs are installed via wildcard in dnf

Note:
Wildcard is escaped (\*) to prevent shell expansion and allow dnf to interpret it correctly.

Signed-off-by: Athanasios Oikonomou <athoik@gmail.com>
Copy link

mergify bot commented May 10, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@athoik athoik changed the title Update Tesseract installation to include all language packs feat: Update Tesseract installation to include all language packs May 10, 2025
@dolfim-ibm dolfim-ibm self-requested a review May 11, 2025 19:07
@dolfim-ibm
Copy link
Contributor

@athoik do you know what is the added size for the container image?

@athoik
Copy link
Author

athoik commented May 11, 2025

@athoik do you know what is the added size for the container image?

All the OCR languages will add an extra 685MB

The following NEW packages will be installed:
  tesseract-ocr-afr tesseract-ocr-all tesseract-ocr-amh tesseract-ocr-ara tesseract-ocr-asm tesseract-ocr-aze tesseract-ocr-aze-cyrl tesseract-ocr-bel
  tesseract-ocr-ben tesseract-ocr-bod tesseract-ocr-bos tesseract-ocr-bre tesseract-ocr-bul tesseract-ocr-cat tesseract-ocr-ceb tesseract-ocr-ces
  tesseract-ocr-chi-sim tesseract-ocr-chi-sim-vert tesseract-ocr-chi-tra tesseract-ocr-chi-tra-vert tesseract-ocr-chr tesseract-ocr-cos tesseract-ocr-cym
  tesseract-ocr-dan tesseract-ocr-deu tesseract-ocr-div tesseract-ocr-dzo tesseract-ocr-enm tesseract-ocr-epo tesseract-ocr-est tesseract-ocr-eus
  tesseract-ocr-fao tesseract-ocr-fas tesseract-ocr-fil tesseract-ocr-fin tesseract-ocr-fra tesseract-ocr-frk tesseract-ocr-frm tesseract-ocr-fry
  tesseract-ocr-gla tesseract-ocr-gle tesseract-ocr-glg tesseract-ocr-grc tesseract-ocr-guj tesseract-ocr-hat tesseract-ocr-heb tesseract-ocr-hin
  tesseract-ocr-hrv tesseract-ocr-hun tesseract-ocr-hye tesseract-ocr-iku tesseract-ocr-ind tesseract-ocr-isl tesseract-ocr-ita tesseract-ocr-ita-old
  tesseract-ocr-jav tesseract-ocr-jpn tesseract-ocr-jpn-vert tesseract-ocr-kan tesseract-ocr-kat tesseract-ocr-kat-old tesseract-ocr-kaz tesseract-ocr-khm
  tesseract-ocr-kir tesseract-ocr-kmr tesseract-ocr-kor tesseract-ocr-kor-vert tesseract-ocr-lao tesseract-ocr-lat tesseract-ocr-lav tesseract-ocr-lit
  tesseract-ocr-ltz tesseract-ocr-mal tesseract-ocr-mar tesseract-ocr-mkd tesseract-ocr-mlt tesseract-ocr-mon tesseract-ocr-mri tesseract-ocr-msa
  tesseract-ocr-mya tesseract-ocr-nep tesseract-ocr-nld tesseract-ocr-nor tesseract-ocr-oci tesseract-ocr-ori tesseract-ocr-pan tesseract-ocr-pol
  tesseract-ocr-por tesseract-ocr-pus tesseract-ocr-que tesseract-ocr-ron tesseract-ocr-rus tesseract-ocr-san tesseract-ocr-script-arab
  tesseract-ocr-script-armn tesseract-ocr-script-beng tesseract-ocr-script-cans tesseract-ocr-script-cher tesseract-ocr-script-cyrl
  tesseract-ocr-script-deva tesseract-ocr-script-ethi tesseract-ocr-script-frak tesseract-ocr-script-geor tesseract-ocr-script-grek
  tesseract-ocr-script-gujr tesseract-ocr-script-guru tesseract-ocr-script-hang tesseract-ocr-script-hang-vert tesseract-ocr-script-hans
  tesseract-ocr-script-hans-vert tesseract-ocr-script-hant tesseract-ocr-script-hant-vert tesseract-ocr-script-hebr tesseract-ocr-script-jpan
  tesseract-ocr-script-jpan-vert tesseract-ocr-script-khmr tesseract-ocr-script-knda tesseract-ocr-script-laoo tesseract-ocr-script-latn
  tesseract-ocr-script-mlym tesseract-ocr-script-mymr tesseract-ocr-script-orya tesseract-ocr-script-sinh tesseract-ocr-script-syrc
  tesseract-ocr-script-taml tesseract-ocr-script-telu tesseract-ocr-script-thaa tesseract-ocr-script-thai tesseract-ocr-script-tibt
  tesseract-ocr-script-viet tesseract-ocr-sin tesseract-ocr-slk tesseract-ocr-slv tesseract-ocr-snd tesseract-ocr-spa tesseract-ocr-spa-old
  tesseract-ocr-sqi tesseract-ocr-srp tesseract-ocr-srp-latn tesseract-ocr-sun tesseract-ocr-swa tesseract-ocr-swe tesseract-ocr-syr tesseract-ocr-tam
  tesseract-ocr-tat tesseract-ocr-tel tesseract-ocr-tgk tesseract-ocr-tha tesseract-ocr-tir tesseract-ocr-ton tesseract-ocr-tur tesseract-ocr-uig
  tesseract-ocr-ukr tesseract-ocr-urd tesseract-ocr-uzb tesseract-ocr-uzb-cyrl tesseract-ocr-vie tesseract-ocr-yid tesseract-ocr-yor
0 upgraded, 159 newly installed, 0 to remove and 0 not upgraded.
Need to get 281 MB of archives.
After this operation, 685 MB of additional disk space will be used.

If that's a problem, I can add only Greek language I am interesting to perform OCR.
In that case it will be only 2MB extra.

@pommedeterresautee
Copy link

@dolfim-ibm could be specific to 2 cpu images?
They are "only" 3gb and the most likely to be used for tesseract.

Copy link

mergify bot commented Sep 17, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants