-
Notifications
You must be signed in to change notification settings - Fork 34
Description
FileInput
and derived classes like StringFileInput
can handle lists of files from directory and glob.glob
parameters. Still all file content is read/passed as a single Packet
. Also .zip
files are handled by a dedicated class ZipFileInput
.
It should be possible to generalize FileInput
to have derived classes read from files no matter if files came from directory structures, glob.glob
expanded file lists or .zip files. Even a mixture of these should be handled. For example within NLExtract https://github.com/nlextract/NLExtract/blob/master/bag/src/bagfilereader.py can handle any file structure provided.
A second aspect is file chunking
: a FileInput
may split up a single file into Packets containing data structures extracted from that file. For example, FileInput
s like XmlElementStreamerFileInput
and LineStreamerFileInput
open/parse a file but pass file-content (lines, parsed elements) in
fine-grained chunks on each read()
. Currently these classes implement this fully
within their read()
function, but the generic pattern is that they
maintain a "context" for the open/parsed file.
So all in all this issue addresses two general aspects:
- handle any
file-specs
: directories, maps,Globbing
, zip-files and any mix of these - handle fine-grained file-chunking: on each invoke()/read() may supply part of a file: a line an XML element etc.
See also issue #49 for additional discussion which lead to this issue.
The Strategy Design Pattern may be applied (many refs on the web).