Skip to content

Handle file/compressed/directory structures and file-chunking in FileInput classes #51

@justb4

Description

@justb4

FileInput and derived classes like StringFileInput can handle lists of files from directory and glob.glob parameters. Still all file content is read/passed as a single Packet. Also .zip files are handled by a dedicated class ZipFileInput.

It should be possible to generalize FileInput to have derived classes read from files no matter if files came from directory structures, glob.glob expanded file lists or .zip files. Even a mixture of these should be handled. For example within NLExtract https://github.com/nlextract/NLExtract/blob/master/bag/src/bagfilereader.py can handle any file structure provided.

A second aspect is file chunking: a FileInput may split up a single file into Packets containing data structures extracted from that file. For example, FileInputs like XmlElementStreamerFileInput and LineStreamerFileInput
open/parse a file but pass file-content (lines, parsed elements) in
fine-grained chunks on each read(). Currently these classes implement this fully
within their read() function, but the generic pattern is that they
maintain a "context" for the open/parsed file.

So all in all this issue addresses two general aspects:

  • handle any file-specs: directories, maps, Globbing, zip-files and any mix of these
  • handle fine-grained file-chunking: on each invoke()/read() may supply part of a file: a line an XML element etc.

See also issue #49 for additional discussion which lead to this issue.
The Strategy Design Pattern may be applied (many refs on the web).

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions