Skip to content

2024-05-11 on GBIF data: character issues #77

@Mesibov

Description

@Mesibov

The dataset is clean UTF-8 but contains several unwanted control characters and the unnecessary formatting character "non-breaking space". The records involved are listed in the attached text file with their "id", field name and field entry, with the unwanted character replaced by "{HERE}". The DEL is particularly worrying to see.

Output from "gremlins" (https://www.datafix.com.au/cookbook/characters3.html#1):

carriage return (CR, u000d, 0d): none
non-breaking space (NBSP, u00a0, c2 a0): 116 in 19 records
soft hyphen (SHY, u00ad, c2 ad): none
zero-width space (ZWSP, u200b, e2 80 8b): none


Checking now for gremlin control characters, please wait...

data link escape (DLE, u0010, 10): 1 in 1 records
delete (DEL, u007f, 7f): 241 in 241 records
single character introducer (SCI, u009a, c2 9a): 1 in 1 records

character-issues.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions