-
Notifications
You must be signed in to change notification settings - Fork 85
Description
I'm writing an mapReduce script (and thus are working with input / output streams).
If i use the unicodecsv module
#!/usr/bin/python
import sys
import unicodecsv as csv
def mapper():
reader = csv.reader(sys.stdin, delimiter='\t')
# writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
for line in reader:
print line
Then i get the error:
Traceback (most recent call last):
File "scripts/streaming/adwords/mapper.py", line 30, in <module>
mapper()
File "scripts/streaming/adwords/mapper.py", line 10, in mapper
for line in reader:
File "/usr/local/lib/python2.7/dist-packages/unicodecsv/py2.py", line 117, in next
row = self.reader.next()
_csv.Error: line contains NULL byte
If i read the file with pandas
data = pandas.read_csv(input_file, encoding='utf-16', sep='\t', skiprows=5, skip_footer=1, engine='python')
then everything works like a charm.
I don't know how to resolve this issue. I tried almost everything, even opening and saving (in utf-8) the file with libreOffice, but that can't be a solution because my csv files are to big for libreOffice.
If i open / save the file with libreOffice in utf-8 and run the script again the strings in the lines are prefixed with u. I know this has something to do with encodings but it's not clear to me how it works.
Preferably i want to read the (unicode (i guess)) input stream, map it line by line (and encode it to utf-8) and write it like writer.writerow((line[0] + line[2], line[5])) so that my reducer.py doesn't have to hassle with encodings.
any help would deeply be appreciated.