Skip to content

Files w/ quoted values that have commas throw excetion  #38

@greghall76

Description

@greghall76

Describe the bug
File contains quoted numbder "2,126,000,000"....
Throws off index alignment between types extracted in headers and data....

File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 397, in run_inference
schemas_result = prl.parallel(records = lines,obj=dtype, d_schema = self.__schema)
File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in parallel
return [p.get() for p in results]
File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in
return [p.get() for p in results]

To Reproduce
Steps to reproduce the behavior:

  1. See example below...
    "id","country","year","sex","age","suicides_no","population","country-year","HDI for year"," gdp_for_year","gdp_per_capita","generation"
    0,"Albania",1987,"male","15-24 years",21,312900,"Albania1987",,"2,156,624,900",796,"Generation X"
    1,"Albania",1987,"male","35-54 years",16,308000,"Albania1987",,"2,156,624,900",796,"Silent"
    2,"Albania",1987,"female","15-24 years",14,289700,"Albania1987",,"2,156,624,900",796,"Generation X"
    3,"Albania",1987,"male","75+ years",1,21800,"Albania1987",,"2,156,624,900",796,"G.I. Generation"
    4,"Albania",1987,"male","25-34 years",9,274300,"Albania1987",,"2,156,624,900",796,"Boomers"
    5,"Albania",1987,"female","75+ years",1,35600,"Albania1987",,"2,156,624,900",796,"G.I. Generation"

  2. See code below...
    from multiprocessing import freeze_support, Process
    from csv_schema_inference import csv_schema_inference

def main():
#if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
conditions = {"INTEGER":"FLOAT"}
pathfile = "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/suicide_data.csv"

csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions)
aprox_schema = csv_infer.run_inference(pathfile)
csv_infer.pretty(aprox_schema)

if name == 'main':
freeze_support()
Process(target=main).start()

Expected behavior
Should have made it to some kind of schema inference.
e.g.
0
name
Username; Identifier;One-time password;Recovery code;First name;Last name;Department;Location
type
STRING
nullable
False
....

Desktop (please complete the following information):

  • OS: Ubuntu 22.04 and Python 3.10.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions