working with streams

I'm writing an mapReduce script (and thus are working with input / output streams).

If i use the `unicodecsv` module

```
#!/usr/bin/python
import sys
import unicodecsv as csv


def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    # writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

    for line in reader:
        print line
```

Then i get the error:

```
Traceback (most recent call last):
  File "scripts/streaming/adwords/mapper.py", line 30, in <module>
    mapper()
  File "scripts/streaming/adwords/mapper.py", line 10, in mapper
    for line in reader:
  File "/usr/local/lib/python2.7/dist-packages/unicodecsv/py2.py", line 117, in next
    row = self.reader.next()
_csv.Error: line contains NULL byte
```

If i read the `file` with `pandas`

```
data = pandas.read_csv(input_file, encoding='utf-16', sep='\t', skiprows=5, skip_footer=1, engine='python')
```

then everything works like a charm.

I don't know how to resolve this issue. I tried almost everything,  even opening and saving (in utf-8) the file with libreOffice, but that can't be a solution because my csv files are to big for libreOffice.

If i open / save the file with libreOffice in `utf-8` and run the script again the strings in the lines are prefixed with `u`. I know this has something to do with encodings but it's not clear to me how it works.

Preferably i want to read the (unicode (i guess)) input stream, map it line by line (and encode it to utf-8) and write it like `writer.writerow((line[0] + line[2], line[5]))` so that my reducer.py doesn't have to hassle with encodings.

any help would deeply be appreciated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

working with streams #73

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

working with streams #73

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions