Skip to content

Unexpected UTF-8 problems #123

@mikegerber

Description

@mikegerber

image

This was on a Linux system, and the "A~-" was an "Ö".

  • Fix Ã. problem above
  • Fix LookupError: unknown encoding: EUC-TW problem

For plain text files it would be best to

  • Review CLI

    • cli.py (esp. process_dir)
    • ocrd_cli.py - any plain text files supported here?
    • cli_line_dirs.py
    • cli_summarize.py?
  • add --plain-encoding option so users have the chance to give it manually

  • Fall back to detecting

  • while warning about the auto detecting

  • What about the BOM now?

    • Do we have a test that checks if files with BOM are read correctly?

Later

  • Autodetect over all files
  • falling back to UTF-8 if the detected charset is way out there/unknown like EUC-TW

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions