Skip to content

Commit 25d49ee

Browse files
kamiazyaclaude
andauthored
perf: JavaScript parser performance improvements (#614)
* perf: JavaScript parser performance improvements - Token type constants: Changed from Symbol to numeric constants for faster comparison - Location tracking: Now disabled by default (add trackLocation: true for token location info) - Lazy position tracking: Error messages include position info even when trackLocation: false - Unified token structure: Reduced token count by combining delimiter info into field tokens - Object.create(null) for records: ~3.6x faster object creation - Non-destructive buffer reading: Reduced GC pressure by 46% - Regex removal for unquoted fields: ~15% faster parsing - String comparison optimization: ~10% faster - Quoted field parsing optimization: Eliminated overhead Benchmark results: - 1,000 rows: 3.57 ms → 1.77 ms (50% faster) - 5,000 rows: 19.47 ms → 8.96 ms (54% faster) - Throughput: 23.8 MB/s → 49.0 MB/s (2.06x improvement) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: update new factory function tests to use unified token format Update createStringCSVLexerTransformer.spec.ts and createCSVRecordAssemblerTransformer.spec.ts to use the new unified token structure with Delimiter enum instead of separate Field/FieldDelimiter/RecordDelimiter token types. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * style: format code * fix: update test expectations for new parser behavior Missing CSV fields now return empty string instead of undefined. Updated test expectations in: - StringCSVParserStream.test.ts - BinaryCSVParserStream.test.ts - parseBinaryToIterableIterator.test.ts - parse.spec.ts * fix: change toStrictEqual to toEqual in browser tests Object.create(null) records don't match toStrictEqual due to prototype differences. Changed to toEqual to properly compare object values. Updated test files: - parseResponse.browser.spec.ts - parseBinaryStream.browser.spec.ts - parseBinary.browser.spec.ts * fix: remove concurrent test execution in stream tests Stream tests were hanging in CI and locally when run in parallel. Removed .concurrent from describe and it blocks in: - StringCSVParserStream.test.ts - BinaryCSVParserStream.test.ts * refactor: use direct vitest imports in stream tests Removed unnecessary aliases for describe and it. Directly importing from vitest is cleaner and more standard. * style: format test files Applied biome formatting to: - src/parser/stream/StringCSVParserStream.test.ts - src/parser/stream/BinaryCSVParserStream.test.ts * feat(parser): enhance FlexibleStringCSVLexer with reusable array pooling - Implemented ReusableArrayPool to optimize memory usage for string segments in FlexibleStringCSVLexer. - Updated FlexibleStringCSVLexer to utilize the new array pool for segment management, improving performance in parsing. - Adjusted token handling to ensure correct delimiter usage, replacing EOF with Record where appropriate. - Modified tests in CSVRecordAssemblerTransformer and StringCSVLexerTransformer to reflect changes in token structure and delimiter handling. - Cleaned up test cases for better readability and consistency. * Pad record-view fill rows in fill strategy * Add record-view assembler and tighten column strategies * revert: remove CSVRecordView feature CSVRecordView機能を削除し、過剰な最適化を切り戻しました。 object形式のcolumnCountStrategy制限(fill/strictのみ)は維持されています。 - FlexibleCSVRecordViewAssembler関連のファイルを削除 - types.tsからCSVRecordView型と'record-view' outputFormatを削除 - FlexibleCSVObjectRecordAssemblerからrecord-view機能を削除 - 関連するテストとドキュメントを更新 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * feat: apply const type parameters consistently across all parser classes プロジェクト全体でconst type parametersを一貫して適用しました。 変更内容: - FlexibleCSVArrayRecordAssembler - FlexibleCSVObjectRecordAssembler - FlexibleStringArrayCSVParser - FlexibleStringObjectCSVParser - FlexibleBinaryArrayCSVParser - FlexibleBinaryObjectCSVParser すべてのクラスで`Header extends ReadonlyArray<string>`を `const Header extends ReadonlyArray<string>`に変更。 これにより、ユーザーは`as const`を書かなくてもリテラル型が 正しく推論されるようになりました: Before: header: ['name', 'age'] as const After: header: ['name', 'age'] テスト:1341個すべて成功 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: address code review comments レビューコメントに基づいて以下を修正: 1. parseResponse.spec.ts - describeブロック名を修正(parseRequest → parseResponse) 2. column-count-strategy-rename.md - breaking changeをminorに変更し、移行ガイドを追加 3. performance-improvements.md - ベンチマーク結果の記述を修正(46% → 83% faster) 4. vitest.setup.ts - endOnFailureのコメントを正確に修正(デフォルトはfalse) 5. createStringCSVLexer.ts - 戻り値の型にTrackLocationジェネリックを追加 すべてのテストが成功(1341個) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * test: improve StringCSVLexerTransformer test quality and update benchmarks - Remove debug options ({ numRuns: 10 }) to restore full property-based test coverage - Add .flat() to transform result in third test for correct array shape matching - Fix edge case: empty CSV string should produce no tokens - Update performance numbers with final benchmark measurements: - Object format (1,000 rows): 61.2 MB/s (was 49.0 MB/s) - Object format (5,000 rows): 67.9 MB/s (was 53.3 MB/s) - Array format (1,000 rows): 87.6 MB/s (was 89.6 MB/s) - Array format (5,000 rows): 86.4 MB/s (new measurement) - Array format is 43% faster (1.43× throughput) vs Object format 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * style: improve formatting of expected tokens in StringCSVLexerTransformer tests * test: consolidate test files and improve coverage config - Consolidate FlexibleCSVRecordAssembler test files (6 → 2 files) - Merged array-output, field-count-limit, object-output, and prototype-safety tests into .test.ts - All 77 tests passing - Consolidate FlexibleStringCSVLexer test files (4 → 2 files) - Merged buffer-overflow and undefined-check tests into .test.ts - All 33 tests passing - Improve coverage configuration in vite.config.ts - Add explicit exclude patterns for test files - Add reporter configuration - Set reportsDirectory - Add @vitest/coverage-v8 to devDependencies for future migration Total: 110 test cases passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * refactor: enhance comments and improve header handling in FlexibleCSVRecordAssembler tests --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent 8adf5d9 commit 25d49ee

File tree

65 files changed

+4302
-3165
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+4302
-3165
lines changed
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
"web-csv-toolbox": minor
3+
---
4+
5+
**BREAKING CHANGE**: Restrict `columnCountStrategy` options for object output to `fill`/`strict` only.
6+
7+
Object format now rejects `keep` and `truncate` strategies at runtime, as these strategies are incompatible with object output semantics. Users relying on `keep` or `truncate` with object format must either:
8+
- Switch to `outputFormat: 'array'` to use these strategies, or
9+
- Use `fill` (default) or `strict` for object output
10+
11+
This change improves API clarity by aligning strategy availability with format capabilities and documenting the purpose-driven strategy matrix (including sparse/header requirements).

.changeset/lexer-api-changes.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
---
2+
"web-csv-toolbox": minor
3+
---
4+
5+
## Lexer API Changes
6+
7+
This release includes low-level Lexer API changes for performance optimization.
8+
9+
### Breaking Changes (Low-level API only)
10+
11+
These changes only affect users of the low-level Lexer API. **High-level APIs (`parseString`, `parseBinary`, etc.) are unchanged.**
12+
13+
1. **Token type constants**: Changed from `Symbol` to numeric constants
14+
2. **Location tracking**: Now disabled by default. Add `trackLocation: true` to Lexer options if you need token location information. Note: Error messages still include position information even when `trackLocation: false` (computed lazily only when errors occur).
15+
3. **Struct of token objects**: Changed to improve performance. Token properties changed and reduce tokens by combining delimiter and newline information into a field.
16+
17+
### Who is affected?
18+
19+
**Most users are NOT affected.** Only users who directly use `FlexibleStringCSVLexer` and rely on `token.location` or `Symbol`-based token type comparison need to update their code.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
---
2+
"web-csv-toolbox": patch
3+
---
4+
5+
## JavaScript Parser Performance Improvements
6+
7+
This release includes significant internal optimizations that improve JavaScript-based CSV parsing performance.
8+
9+
### Before / After Comparison
10+
11+
| Metric | Before (v0.14) | After | Improvement |
12+
|--------|----------------|-------|-------------|
13+
| 1,000 rows parsing | 3.57 ms | 1.42 ms | **60% faster** |
14+
| 5,000 rows parsing | 19.47 ms | 7.03 ms | **64% faster** |
15+
| Throughput (1,000 rows) | 24.3 MB/s | 61.2 MB/s | **2.51x** |
16+
| Throughput (5,000 rows) | 24.5 MB/s | 67.9 MB/s | **2.77x** |
17+
18+
### Optimization Summary
19+
20+
| Optimization | Target | Improvement |
21+
|--------------|--------|-------------|
22+
| Array copy method improvement | Assembler | -8.7% |
23+
| Quoted field parsing optimization | Lexer | Overhead eliminated |
24+
| Object assembler loop optimization | Assembler | -5.4% |
25+
| Regex removal for unquoted fields | Lexer | -14.8% |
26+
| String comparison optimization | Lexer | ~10% |
27+
| Object creation optimization | Lexer | ~20% |
28+
| Non-destructive buffer reading | GC | -46% |
29+
| Token type numeric conversion | Lexer/GC | -7% / -13% |
30+
| Location tracking made optional | Lexer | -19% to -31% |
31+
| Object.create(null) for records | Assembler | -31% |
32+
| Empty-row template cache | Assembler | ~4% faster on sparse CSV |
33+
| Row buffer reuse (no per-record slice) | Assembler | ~6% faster array format |
34+
| Header-length builder preallocation | Assembler | Capacity stays steady on wide CSV |
35+
| Object assembler row buffer pooling | Assembler | Lower GC spikes on object output |
36+
| Lexer segment-buffer pooling | Lexer | Smoother GC for quoted-heavy input |
37+
38+
### Final Performance Results (Pure JavaScript)
39+
40+
| Format | Throughput |
41+
|--------|------------|
42+
| Object format (1,000 rows) | **61.2 MB/s** |
43+
| Array format (1,000 rows) | **87.6 MB/s** |
44+
| Object format (5,000 rows) | **67.9 MB/s** |
45+
| Array format (5,000 rows) | **86.4 MB/s** |
46+
47+
Array format is approximately 43% faster (1.43× throughput) than Object format for the same data.

benchmark/package.json

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,13 @@
44
"private": true,
55
"type": "module",
66
"scripts": {
7-
"start": "tsx main.ts",
8-
"queuing-strategy": "tsx queuing-strategy.bench.ts"
7+
"start": "node --import tsx main.ts",
8+
"queuing-strategy": "node --import tsx queuing-strategy.bench.ts",
9+
"quick": "node --import tsx scripts/quick-bench.mts",
10+
"unified": "node --import tsx scripts/unified-token-bench.mts",
11+
"profile:cpu": "node --cpu-prof --cpu-prof-dir=./profiles --import tsx scripts/profile-cpu.mts",
12+
"profile:memory": "node --heap-prof --heap-prof-dir=./profiles --import tsx scripts/profile-memory.mts",
13+
"profile:memory:gc": "node --heap-prof --heap-prof-dir=./profiles --expose-gc --import tsx scripts/profile-memory.mts"
914
},
1015
"license": "MIT",
1116
"dependencies": {
@@ -14,4 +19,4 @@
1419
"tsx": "catalog:",
1520
"web-csv-toolbox": "workspace:*"
1621
}
17-
}
22+
}

config/vitest.setup.ts

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
import fc from "fast-check";
22

33
fc.configureGlobal({
4-
// This is the default value, but we set it here to be explicit.
4+
// Set to true to stop property tests on first failure (default is false).
5+
// This speeds up test runs by avoiding unnecessary iterations after a counterexample is found.
6+
endOnFailure: true,
57
});
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# ColumnCountStrategy Guide
2+
3+
`columnCountStrategy` controls how the parser handles rows whose column counts differ from the header. The available strategies depend on the output format and whether a header is known in advance.
4+
5+
## Compatibility Matrix
6+
7+
| Strategy | Short rows | Long rows | Object | Array (explicit header) | Array (header inferred) | Headerless (`header: []`) |
8+
|------------|------------------------------------|------------------------------|--------|-------------------------|-------------------------|----------------------------|
9+
| `fill` | Pad with `""` | Trim excess columns |||||
10+
| `strict` | Throw error | Throw error |||||
11+
| `keep` | Keep as-is (ragged rows) | Keep as-is |||| ✅ (mandatory) |
12+
| `truncate` | Keep as-is | Trim to header length ||| ❌ (requires header) ||
13+
| `sparse` | Pad with `undefined` | Trim excess columns ||| ❌ (requires header) ||
14+
15+
## Strategy Details
16+
17+
### `fill` (default)
18+
- Guarantees fixed-length records matching the header.
19+
- Object: missing values become `""`, enabling consistent string-based models.
20+
- Array output: missing values also become empty strings.
21+
22+
### `strict`
23+
- Treats any column-count mismatch as a fatal error, useful for schema validation.
24+
- Requires a header (explicit or inferred).
25+
26+
### `keep`
27+
- Leaves each row untouched. Arrays can vary in length, making it ideal for ragged data or headerless CSVs.
28+
- Headerless mode (`header: []`) enforces `keep`.
29+
30+
### `truncate`
31+
- Drops trailing columns that exceed the header length while leaving short rows untouched.
32+
- Only available when a header is provided (array output).
33+
34+
### `sparse`
35+
- Similar to `fill`, but pads missing entries with `undefined`. This is useful when you want to distinguish between missing and empty values.
36+
- Requires an explicit header to determine the target length.
37+
38+
## Choosing a Strategy
39+
40+
1. **Need strict schema enforcement?** Use `strict`.
41+
2. **Need consistent string values?** Use `fill` (object default).
42+
3. **Need ragged rows / headerless CSV?** Use `keep` (array output).
43+
4. **Need to ignore trailing columns?** Use `truncate` (array output with header).
44+
5. **Need optional columns?** Use `sparse` (array output with header).
45+
46+
Pair this guide with the [Output Format Guide](./output-format-guide.md) to decide which combination best fits your workload.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Output Format Guide
2+
3+
Many APIs (e.g. `parseString`, `createCSVRecordAssembler`, stream transformers) expose an `outputFormat` option so you can choose the most suitable record representation for your workload. This guide summarizes each format's behavior, strengths, and constraints.
4+
5+
## Quick Comparison
6+
7+
| Format | Representation | Best for | ColumnCountStrategy support | Headerless (`header: []`) | `includeHeader` | Notes |
8+
|----------|-------------------------------------|-----------------------------------------|-----------------------------|---------------------------|-----------------|-------|
9+
| `object` | Plain object `{ headerKey: value }` | JSON interoperability, downstream libs | `fill`, `strict` ||| Default output. Values are always strings. |
10+
| `array` | Readonly array / named tuple | Maximum throughput, flexible schemas | All strategies (`fill`, `keep`, `truncate`, `sparse`, `strict`) | ✅ (with `keep`) || Headerless mode requires `outputFormat: "array"` + `columnCountStrategy: "keep"`. |
11+
12+
## Object Format (`"object"`)
13+
- Produces pure objects keyed by header names.
14+
- Missing columns are padded with empty strings in `fill` mode, or rejected in `strict`.
15+
- Recommended when you plan to serialize to JSON, access fields by name exclusively, or hand records to other libraries.
16+
17+
```ts
18+
const assembler = createCSVRecordAssembler({
19+
header: ["name", "age"] as const,
20+
// outputFormat defaults to "object"
21+
});
22+
for (const record of assembler.assemble(tokens)) {
23+
record.name; // string
24+
}
25+
```
26+
27+
## Array Format (`"array"`)
28+
- Emits header-ordered arrays (typed as named tuples when a header is provided).
29+
- Supports every columnCountStrategy, including `keep` for ragged rows and `sparse` for optional columns.
30+
- Only format that supports headerless mode.
31+
32+
```ts
33+
const assembler = createCSVRecordAssembler({
34+
header: ["name", "age"] as const,
35+
outputFormat: "array",
36+
columnCountStrategy: "truncate",
37+
});
38+
const [row] = assembler.assemble(tokens);
39+
row[0]; // "Alice"
40+
```
41+
42+
## Choosing the Right Format
43+
44+
1. **Need plain JS objects / JSON serialization?** Use `object`.
45+
2. **Need the fastest throughput or ragged rows?** Use `array` with the appropriate `columnCountStrategy`.
46+
47+
For more details on column-count handling, see the [ColumnCountStrategy guide](./column-count-strategy-guide.md).

package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,7 @@
218218
"@types/node": "^24.10.1",
219219
"@vitest/browser-webdriverio": "^4.0.3",
220220
"@vitest/coverage-istanbul": "4.0.3",
221+
"@vitest/coverage-v8": "4.0.3",
221222
"@wasm-tool/rollup-plugin-rust": "^3.0.5",
222223
"changesets-github-release": "^0.1.0",
223224
"fast-check": "^4.1.1",

pnpm-lock.yaml

Lines changed: 49 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

src/core/constants.ts

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -100,17 +100,18 @@ export const DEFAULT_STREAM_BACKPRESSURE_CHECK_INTERVAL = 100;
100100
export const DEFAULT_ASSEMBLER_BACKPRESSURE_CHECK_INTERVAL = 10;
101101

102102
/**
103-
* FiledDelimiter is a symbol for field delimiter of CSV.
104-
* @category Constants
105-
*/
106-
export const FieldDelimiter = Symbol.for("web-csv-toolbox.FieldDelimiter");
107-
/**
108-
* RecordDelimiter is a symbol for record delimiter of CSV.
109-
* @category Constants
110-
*/
111-
export const RecordDelimiter = Symbol.for("web-csv-toolbox.RecordDelimiter");
112-
/**
113-
* Field is a symbol for field of CSV.
103+
* Delimiter type enumeration for unified token format.
104+
*
105+
* Used in the new FieldToken format to indicate what follows the field value.
106+
* This enables a more efficient token format where only field tokens are emitted.
107+
*
114108
* @category Constants
115109
*/
116-
export const Field = Symbol.for("web-csv-toolbox.Field");
110+
export enum Delimiter {
111+
/** Next token is a field (followed by field delimiter like comma) */
112+
Field = 0,
113+
/** Next token is a record delimiter (newline) */
114+
Record = 1,
115+
// /** End of file/stream */
116+
// EOF = 2,
117+
}

0 commit comments

Comments
 (0)