-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Describe the bug
File contains quoted numbder "2,126,000,000"....
Throws off index alignment between types extracted in headers and data....
File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 397, in run_inference
schemas_result = prl.parallel(records = lines,obj=dtype, d_schema = self.__schema)
File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in parallel
return [p.get() for p in results]
File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in
return [p.get() for p in results]
To Reproduce
Steps to reproduce the behavior:
-
See example below...
"id","country","year","sex","age","suicides_no","population","country-year","HDI for year"," gdp_for_year","gdp_per_capita","generation"
0,"Albania",1987,"male","15-24 years",21,312900,"Albania1987",,"2,156,624,900",796,"Generation X"
1,"Albania",1987,"male","35-54 years",16,308000,"Albania1987",,"2,156,624,900",796,"Silent"
2,"Albania",1987,"female","15-24 years",14,289700,"Albania1987",,"2,156,624,900",796,"Generation X"
3,"Albania",1987,"male","75+ years",1,21800,"Albania1987",,"2,156,624,900",796,"G.I. Generation"
4,"Albania",1987,"male","25-34 years",9,274300,"Albania1987",,"2,156,624,900",796,"Boomers"
5,"Albania",1987,"female","75+ years",1,35600,"Albania1987",,"2,156,624,900",796,"G.I. Generation" -
See code below...
from multiprocessing import freeze_support, Process
from csv_schema_inference import csv_schema_inference
def main():
#if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
conditions = {"INTEGER":"FLOAT"}
pathfile = "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/suicide_data.csv"
csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions)
aprox_schema = csv_infer.run_inference(pathfile)
csv_infer.pretty(aprox_schema)
if name == 'main':
freeze_support()
Process(target=main).start()
Expected behavior
Should have made it to some kind of schema inference.
e.g.
0
name
Username; Identifier;One-time password;Recovery code;First name;Last name;Department;Location
type
STRING
nullable
False
....
Desktop (please complete the following information):
- OS: Ubuntu 22.04 and Python 3.10.12