Skip to content

Crash, most likely due to non-unicode characters in file name #9

@dmitry-irtegov

Description

@dmitry-irtegov

Hello!
Thanks for useful idea!

I tried to use your program on the big archive while using an UTF-8 locale and it crashed with the stack trace:
Traceback (most recent call last):
File "tarindexer.py", line 123, in
main()
File "tarindexer.py", line 118, in main
indextar(dbtarfile,indexfile)
File "tarindexer.py", line 66, in indextar
outfile.write(rec)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 40-47: surrogates not allowed
The file name that most likely triggered the crash is
\317\360\356\341\353\345\354\373\ \341\345\347\356\357\340\361\355\356\361\362\350\ \342\ \310\322.pdf
(as output by ls -b), which indeed does not look like the valid UTF-8.
Unfortunately I cannot send you the archive, mostly because the file and the surrounding files are rather big.
While having this file in the archive is my fault, I think the program should avoid the crash, may be printing ls -b-style output instead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions