Skip to content
Eugene Lazutkin edited this page Jun 19, 2018 · 54 revisions

stream-json is a micro-library, which provides a set of light-weight stream components to process huge JSON files with a minimal memory footprint. It can:

  • Parse JSON files far exceeding available memory.
    • Even individual primitive data items (keys, strings, and numbers) can be streamed piece-wise.
    • Processing humongous files can take minutes and even hours. Shaving even a microsecond from each operation can save a lot of time waiting for results. That's why all stream-json components were meticulously optimized.
  • Stream using a SAX-inspired event-based API.
  • Provide utilities to handle huge Django-like JSON database dumps.
  • Support JSON Streaming protocol.
  • Follows conventions of a no-dependency micro-library stream-chain.

It was meant to be a set of building blocks for data processing pipelines organized around JSON and JavaScript objects. Users can easily create their own "blocks" using provided facilities.

Documentation 1.x

This is an overview, which can be used as a cheat sheet. Click on individual components to see a detailed API documentation with examples.

The main module

The main module returns a factory function, which produces instances of Parser decorated with emit().

Parser

The heart of the package is Parser — a streaming JSON parser, which consumes text and produces a stream of tokens. Both the standard JSON and JSON Streaming are supported.

const {parser} = require('stream-json');
const pipeline = fs.createReadStream('data.json').pipe(parser());

Filters

Filters can edit a stream of tokens on the fly. The following filters are provided:

  • Pick picks out a subobject for processing ignoring the rest.
    {total: 10000000, data: [...]}
    // Pick can isolate 'data' and remove the outer object completely:
    [...]
    const {pick} = require('stream-json/filters/Pick');
    const picked = pipeline.pipe(pick({filter: 'data'}));
  • Replace replaces subobjects with something else or even removes them completely. It is used to remove unnecessary details, e.g., for performance reasons.
    [
      {data: {...}, extra: {...}},
      {data: {...}, extra: {...}},
      ...
    ]
    // Replace can remove 'extra' or replace it with something else,
    // like null (the default):
    [{data: {...}, extra: null}, ...]
    const {replace} = require('stream-json/filters/Replace');
    const replaced = pipeline.pipe(replace({filter: /^\d+\.extra\b/}));
    • Ignore removes subobjects completely. It is a helper class based on Replace.
      [{data: {...}, extra: {...}}, ...]
      // Ignore can remove 'extra':
      [{data: {...}}, ...]
      const {ignore} = require('stream-json/filters/Ignore');
      const ignored = pipeline.pipe(ignore({filter: /^\d+\.extra\b/}));
  • Filter filters out subobjects preserving an original shape of incoming data.
    {total: 10000000, data: [...]}
    // Filter can isolate 'data' preserving the original shape
    {data: [...]}
    const {filter} = require('stream-json/filters/Filtered');
    const filtered = pipeline.pipe(filter({filter: /^data\b/}));

Filters are used after Parser and can be chained to achieve a desirable effect.

Streamers

In many cases working at a token level can be tedious. Frequently, while a source file is huge, individual data pieces are relatively small and can fit into a memory. A typical example is a database dump. stream-json provides the following streaming helpers:

  • StreamArray assumes that a token stream represents an array of objects and streams out assembled JavaScript objects.
    [1, "a", [], {}, true]
    // StreamArray will produce an object stream:
    {key: 0, value: 1}
    {key: 1, value: 'a'}
    {key: 2, value: []}
    {key: 3, value: {}}
    {key: 4, value: true}
    const {streamArray} = require('stream-json/streamers/StreamArray');
    const stream = pipeline.pipe(streamArray());
  • StreamObject assumes that a token stream represents an object and streams out its top-level properties.
    {"a": 1, "b": "a", "c": [], "d": {}, "e": true}
    // StreamObject will produce an object stream:
    {key: 'a', value: 1}
    {key: 'b', value: 'a'}
    {key: 'c', value: []}
    {key: 'd', value: {}}
    {key: 'e', value: true}
    const {streamObject} = require('stream-json/streamers/StreamObject');
    const stream = pipeline.pipe(streamObject());
  • StreamValues assumes that a token stream represents subsequent values and stream them out one by one.
    1 "a" [] {} true
    // StreamValues will produce an object stream:
    {key: 0, value: 1}
    {key: 1, value: 'a'}
    {key: 2, value: []}
    {key: 3, value: {}}
    {key: 4, value: true}
    const {streamValues} = require('stream-json/streamers/StreamValues');
    const stream = pipeline.pipe(streamValues());

Streamers are used after Parser and optional filters. All of them support an efficient filtering of objects while assembling: if it was determined that we have no interest in a certain object, it will be abandoned and skipped without spending any more time on it.

Essentials

Classes and functions to make streaming data processing enjoyable:

  • Assembler receives a token stream and assembles JavaScript objects. It is used as a building block for streamers.
    const {chain} = require('stream-chain');
    const Asm = require('stream-json/Assembler');
    
    const pipeline = chain([
      fs.createReadStream('data.json.gz'),
      zlib.createGunzip(),
      parser()
    ]);
    
    const asm = Asm.connectTo(pipeline);
    asm.on('done', asm => console.log(asm.current));
  • Stringer is a Transform stream. It receives a token stream and converts it to text representing a JSON object. It is very useful when you want to edit a stream with filters and a custom code, and save it back to a file.
    const {stringer} = require('stream-json/Stringer');
    
    chain([
      fs.createReadStream('data.json.gz'),
      zlib.createGunzip(),
      parser(),
      pick({filter: 'data'}),
      stringer(),
      zlib.Gzip(),
      fs.createWriteStream('edited.json.gz')
    ]);
  • Emitter is a Writable stream. It consumes a token stream and emits tokens as events on itself.
    const {emitter} = require('stream-json/Emitter');
    
    const e = emitter();
    
    chain([
      fs.createReadStream('data.json'),
      parser(),
      e
    ]);
    
    let counter = 0;
    e.on('startObject', () => ++counter);
    e.on('finish', () => console.log(counter, 'objects'));

Utilities

The following functions are included:

  • emit() listens to a token stream and emits tokens as events on that stream. This is a light-weight version of Emitter.
    const emit = require('stream-json/utils/emit');
    
    const pipeline = chain([
      fs.createReadStream('data.json'),
      parser()
    ]);
    emit(pipeline);
    
    let counter = 0;
    pipeline.on('startObject', () => ++counter);
    pipeline.on('finish', () => console.log(counter, 'objects'));
    When a main module is requested, it returns a function, which creates a Parser instance, then it applies emit() so user can use this simple API for immediate processing.
  • withParser() creates an instance of Parser, creates an instance of a data stream with a provided function, connects them, and returns as a chain.
    const withParser = require('stream-json/utils/withParser');
    
    const pipeline = withParser(pick, {filter: 'data'});
    Each stream provided by stream-json implements withParser(options) as a static method:
    const StreamArray = require('stream-json/streamers/StreamArray');
    
    const pipeline = StreamArray.withParser();

Advanced use

Performance tuning

Performance considerations are discussed in a separate document dedicated to Performance.

Migrating from a previous version

Documentation 0.6.x

README, which includes the documentation.

Credits

The test file tests/sample.json.gz is a combination of several publicly available datasets merged and compressed with gzip:

Clone this wiki locally