-
-
Notifications
You must be signed in to change notification settings - Fork 11
bug(csv): Add comprehensive test cases and fix bug for CRLF/LF line ending compatibility #536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
bug(csv): Add comprehensive test cases and fix bug for CRLF/LF line ending compatibility #536
Conversation
|
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThe lexer no longer trims trailing CRLF/LF on flush and now slices buffers by the full delimiter length when emitting delimiters; tests and parse helpers were updated to expect RecordDelimiter tokens and consistent LOCATION_SHAPE metadata across LF, CRLF, and mixed line endings. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Parser/Reader
participant Lexer as Lexer
participant TG as TokenGenerator
rect rgba(243,247,255,0.9)
note right of Lexer: Old flow (pre-change)
Caller->>Lexer: provide buffer "field\r\n"
Lexer->>TG: flush(buffer)
TG->>TG: trim trailing CRLF/LF (removed)
TG->>Caller: emit Field(token) -- RecordDelimiter omitted
end
rect rgba(246,255,243,0.9)
note right of Lexer: New flow (post-change)
Caller->>Lexer: provide buffer "field\r\n"
Lexer->>TG: flush(buffer)
TG->>TG: slice buffer by full delimiter length
TG->>Caller: emit Field(token with LOCATION_SHAPE)
TG->>Caller: emit RecordDelimiter(token with LOCATION_SHAPE)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @kanishka1804, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the robustness of the CSV parsing library by addressing critical compatibility issues with different line ending conventions (CRLF and LF). It introduces extensive testing to ensure consistent and correct behavior across various EOL scenarios, including mixed formats and EOLs embedded within fields. A key bug in the Lexer's EOL trimming logic has been rectified, allowing for accurate parsing of trailing empty records. The overall impact is a more reliable and standard-compliant CSV parser. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant improvement to the CSV parser by ensuring compatibility with both LF and CRLF line endings. The changes include a crucial bug fix in the Lexer to correctly handle trailing EOLs, which is well-supported by a comprehensive new suite of tests, including unit, end-to-end, and property-based tests. The added tests greatly enhance confidence in the parser's robustness. My review includes a couple of suggestions to further improve the completeness and correctness of the new test cases. Overall, this is a high-quality contribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/Lexer.ts (2)
159-173: Fix delimiter slicing for multi-character delimiters.Using slice(1) corrupts tokenization when delimiter length > 1 (e.g., "::"), causing the remaining delimiter chars to be treated as data. Use #fieldDelimiterLength consistently.
- if (this.#buffer.startsWith(this.#delimiter)) { - this.#buffer = this.#buffer.slice(1); + if (this.#buffer.startsWith(this.#delimiter)) { + this.#buffer = this.#buffer.slice(this.#fieldDelimiterLength); const start: Position = { ...this.#cursor }; - this.#cursor.column += this.#fieldDelimiterLength; - this.#cursor.offset += this.#fieldDelimiterLength; + this.#cursor.column += this.#fieldDelimiterLength; + this.#cursor.offset += this.#fieldDelimiterLength; return { type: FieldDelimiter, value: this.#delimiter, location: { start, end: { ...this.#cursor }, rowNumber: this.#rowNumber, }, }; }Also consider adding a unit/property test with multi-char delimiters to prevent regression (see suggestion in Lexer.spec.ts comment).
277-281: Comment contradicts the condition (buffering vs flushing).The code defers emission when NOT flushing and the match fully consumes the buffer. Update the comment for clarity.
- // If we're flushing and the match doesn't consume the entire buffer, - // then return null + // When buffering (not flushing) and the match fully consumes the buffer, + // defer emission to wait for the next chunk (field may be followed by delimiter/quote) if (this.#flush === false && match[0].length === this.#buffer.length) { return null; }
🧹 Nitpick comments (3)
src/Lexer.test.ts (1)
164-164: Minor test description typos.
- Rename “RCLF” → “CRLF”.
- Fix “thorw” → “throw”.
- test("should parse a field with escaped and delimiter and record delimiter and EOF(RCLF)", () => { + test("should parse a field with escaped and delimiter and record delimiter and EOF(CRLF)", () => { @@ - test("should thorw an error if the field is not closed", () => { + test("should throw an error if the field is not closed", () => { @@ - test("should thorw DOMException named AbortError if the signal is aborted", () => { + test("should throw DOMException named AbortError if the signal is aborted", () => {Also applies to: 280-281, 296-296
src/parse.spec.ts (2)
371-381: Tighten test typing to avoidany[].Optional: define a Row alias for readability.
-async function toArray(csv: string | ReadableStream<string>): Promise<any[]>; -async function toArray(csv: Uint8Array | ArrayBuffer | ReadableStream<Uint8Array> | Response): Promise<any[]>; +type Row = Record<string, string>; +async function toArray(csv: string | ReadableStream<string>): Promise<Row[]>; +async function toArray(csv: Uint8Array | ArrayBuffer | ReadableStream<Uint8Array> | Response): Promise<Row[]>; async function toArray( - csv: string | Uint8Array | ArrayBuffer | ReadableStream<any> | Response -): Promise<any[]> { + csv: string | Uint8Array | ArrayBuffer | ReadableStream<any> | Response +): Promise<Row[]> {
423-435: Add a CRLF+CRLF double-trailing case for completeness.You already test LF+LF. Add CRLF+CRLF to assert an empty trailing record is created too.
it('should ignore a single trailing line ending (LF or CRLF) but handle double trailing EOL', async () => { @@ const expectedWithEmpty = [...expected, { name: '', age: '' }]; expect(await toArray(csvWithDoubleTrailing)).toEqual(expectedWithEmpty); + + // CRLF + CRLF should also produce an empty trailing record + const csvWithDoubleTrailingCRLF = 'name,age\r\nAlice,42\r\n\r\n'; + expect(await toArray(csvWithDoubleTrailingCRLF)).toEqual(expectedWithEmpty); });
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/Lexer.spec.ts(3 hunks)src/Lexer.test.ts(3 hunks)src/Lexer.ts(1 hunks)src/parse.spec.ts(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2024-06-10T19:31:23.296Z
Learnt from: kamiazya
PR: kamiazya/web-csv-toolbox#32
File: src/__tests__/helper.ts:0-0
Timestamp: 2024-06-10T19:31:23.296Z
Learning: The `csv` function that was previously commented out in `src/__tests__/helper.ts` has been confirmed as removed in commit `f567442793c67651e6dbb934623b05540c33c8d1`.
Applied to files:
src/parse.spec.ts
📚 Learning: 2024-06-10T19:31:23.296Z
Learnt from: kamiazya
PR: kamiazya/web-csv-toolbox#29
File: src/transformers/LexerTransformer.ts:146-150
Timestamp: 2024-06-10T19:31:23.296Z
Learning: The documentation within the `LexerTransformer` class was corrected to accurately describe the behavior of the `#nextToken` method when not flushing and the buffer does not end with a quotation mark, as per commit `d5c7acd6cd074296ab8dd39c62acb3efd84d4707`.
Applied to files:
src/Lexer.ts
🧬 Code graph analysis (3)
src/parse.spec.ts (1)
src/parse.ts (1)
parse(186-210)
src/Lexer.spec.ts (4)
src/__tests__/helper.ts (3)
eol(219-221)row(228-237)field(141-150)src/escapeField.ts (1)
escapeField(23-65)src/common/constants.ts (3)
Field(15-15)FieldDelimiter(5-5)RecordDelimiter(10-10)src/Lexer.ts (1)
Lexer(19-301)
src/Lexer.test.ts (1)
src/common/constants.ts (3)
Field(15-15)FieldDelimiter(5-5)RecordDelimiter(10-10)
🔇 Additional comments (5)
src/Lexer.test.ts (3)
11-33: Mixed CRLF/LF tokenization test looks correct.Offsets, line/column, and rowNumber transitions are consistent. Good coverage.
152-160: EOF(LF): expecting a final RecordDelimiter is correct.Matches the updated Lexer semantics when flushing.
194-202: EOF(CRLF): final RecordDelimiter expectation is correct.Consistent with CRLF handling and location updates.
src/Lexer.spec.ts (2)
25-79: Property test for EOL tokenization looks solid.Good use of toMatchObject with LOCATION_SHAPE and quoted fields to isolate RecordDelimiter behavior.
295-326: EOF final RecordDelimiter expectation is appropriate.This matches the “no trimming on flush” change.
To guard against regressions, consider adding a targeted test for multi-character delimiters (would have caught a slicing bug in Lexer):
+ it("should tokenize correctly with multi-character delimiters", () => { + const options = { delimiter: "::" as const, quotation: '"' as const }; + const lexer = new Lexer(options); + const csv = '"a"::"b"\r\n"c"::"d"'; + const tokens = [...lexer.lex(csv)]; + expect(tokens).toMatchObject([ + { type: Field, value: "a", location: LOCATION_SHAPE }, + { type: FieldDelimiter, value: "::", location: LOCATION_SHAPE }, + { type: Field, value: "b", location: LOCATION_SHAPE }, + { type: RecordDelimiter, value: "\r\n", location: LOCATION_SHAPE }, + { type: Field, value: "c", location: LOCATION_SHAPE }, + { type: FieldDelimiter, value: "::", location: LOCATION_SHAPE }, + { type: Field, value: "d", location: LOCATION_SHAPE }, + ]); + });
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/Lexer.spec.ts (1)
25-79: Assert location on Field tokens for consistency and stronger checks.Other tests validate
locationon Field tokens; this new property test doesn’t. Addlocation: LOCATION_SHAPEto Field entries to ensure position metadata is consistently validated.- { type: Field, value: field }, + { type: Field, value: field, location: LOCATION_SHAPE },
🧹 Nitpick comments (1)
src/LexerTransformer.spec.ts (1)
164-177: Great regression example; add a CRLF twin for completeness.The lone LF example is perfect to lock in the fix. Consider adding an explicit CRLF example too to harden cross‑platform coverage.
{ options: { delimiter: ",", quotation: '"' }, chunks: ["\n"], expected: [ { type: RecordDelimiter, value: "\n", location: LOCATION_SHAPE, }, ], }, + { + options: { delimiter: ",", quotation: '"' }, + chunks: ["\r\n"], + expected: [ + { + type: RecordDelimiter, + value: "\r\n", + location: LOCATION_SHAPE, + }, + ], + },
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/Lexer.spec.ts(3 hunks)src/Lexer.ts(2 hunks)src/LexerTransformer.spec.ts(1 hunks)src/parse.spec.ts(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/Lexer.ts
- src/parse.spec.ts
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2024-06-10T19:31:23.296Z
Learnt from: kamiazya
PR: kamiazya/web-csv-toolbox#29
File: src/transformers/LexerTransformer.ts:146-150
Timestamp: 2024-06-10T19:31:23.296Z
Learning: The documentation within the `LexerTransformer` class was corrected to accurately describe the behavior of the `#nextToken` method when not flushing and the buffer does not end with a quotation mark, as per commit `d5c7acd6cd074296ab8dd39c62acb3efd84d4707`.
Applied to files:
src/LexerTransformer.spec.ts
🧬 Code graph analysis (2)
src/LexerTransformer.spec.ts (1)
src/common/constants.ts (1)
RecordDelimiter(10-10)
src/Lexer.spec.ts (4)
src/__tests__/helper.ts (3)
eol(219-221)row(228-237)field(141-150)src/escapeField.ts (1)
escapeField(23-65)src/common/constants.ts (3)
Field(15-15)FieldDelimiter(5-5)RecordDelimiter(10-10)src/Lexer.ts (1)
Lexer(19-301)
🔇 Additional comments (2)
src/Lexer.spec.ts (2)
292-298: LGTM — location added to Field tokens.Adding
location: LOCATION_SHAPEhere aligns expectations with the lexer’s position tracking.
321-327: LGTM — explicit final RecordDelimiter at EOF.Pushing a trailing
RecordDelimiterwhenEOFis true correctly mirrors the lexer’s behavior post‑fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/Lexer.spec.ts (1)
25-79: Well-structured property test for EOL tokenization.The test correctly validates that the Lexer emits proper tokens (Field, FieldDelimiter, RecordDelimiter) with location metadata for CSVs using either LF or CRLF. The test data constraints (
minLength: 1) ensure non-empty inputs, and the expected token array logic properly accounts for fields, delimiters, and record boundaries.Optional: Consider clarifying the test scope in the name.
The test name mentions "for various EOL" which could imply mixed line endings within a single CSV. However, each generated CSV uses a consistent EOL (
const eol = g(FC.eol)), and the property test exercises different EOL types across multiple runs. If mixed EOLs are tested elsewhere (e.g., inLexer.test.tsas noted in the PR summary), consider renaming to "for LF and CRLF delimiters" to avoid ambiguity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/Lexer.spec.ts(3 hunks)src/parse.spec.ts(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/parse.spec.ts
🧰 Additional context used
🧬 Code graph analysis (1)
src/Lexer.spec.ts (4)
src/__tests__/helper.ts (3)
eol(219-221)row(228-237)field(141-150)src/escapeField.ts (1)
escapeField(23-65)src/common/constants.ts (3)
Field(15-15)FieldDelimiter(5-5)RecordDelimiter(10-10)src/Lexer.ts (1)
Lexer(19-301)
🔇 Additional comments (2)
src/Lexer.spec.ts (2)
296-296: LGTM: Location metadata now consistent.Adding
location: LOCATION_SHAPEaligns this expectation with other tests and ensures location data is validated for Field tokens when fields are quoted or non-empty.
321-327: Correct handling of trailing EOL.When
EOFis true, the CSV ends with an EOL character, and the Lexer should emit a corresponding RecordDelimiter token. Thedata.length > 0guard is defensive (the test constraint ensuresminLength: 1) but harmless and makes the intent explicit.
|
@kanishka1804 Please check CI failure, seems like there are some format issue. |
|
Hello @kamiazya, |
Bundle ReportChanges will decrease total bundle size by 1.35kB (-0.17%) ⬇️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: web-csv-toolbox-CSV-esmAssets Changed:
Files in
|
CodSpeed Performance ReportMerging #536 will not alter performanceComparing Summary
|
…ending compatibility
e45035d to
80e0b50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/parse.spec.ts (1)
385-463: Line-ending compatibility suite is comprehensive and matches #515 acceptance criteria.LF vs CRLF parity, mixed endings, quoted EOL preservation, and all four double‑EOL combinations are covered. Good job.
🧹 Nitpick comments (3)
src/LexerTransformer.spec.ts (1)
165-175: Cover single CRLF example as well for symmetry.Add a matching example for "\r\n" so both LF and CRLF single‑EOL cases are locked in.
{ options: { delimiter: ",", quotation: '"' }, chunks: ["\n"], expected: [ { type: RecordDelimiter, value: "\n", location: LOCATION_SHAPE, }, ], }, + { + options: { delimiter: ",", quotation: '"' }, + chunks: ["\r\n"], + expected: [ + { + type: RecordDelimiter, + value: "\r\n", + location: LOCATION_SHAPE, + }, + ], + },src/Lexer.test.ts (1)
282-290: Nit: fix test title typo “RCLF” → “CRLF”.Purely cosmetic; avoids confusion.
- test("should parse a field with escaped and delimiter and record delimiter and EOF(RCLF)", () => { + test("should parse a field with escaped and delimiter and record delimiter and EOF(CRLF)", () => {src/parse.spec.ts (1)
371-384: Type the helper’s return to improve checks.Returning a structured type helps catch expectation mistakes in tests.
-async function toArray(csv: string | ReadableStream<string>): Promise<any[]>; +async function toArray(csv: string | ReadableStream<string>): Promise<Record<string, string>[]>; -async function toArray( +async function toArray( csv: Uint8Array | ArrayBuffer | ReadableStream<Uint8Array> | Response, -): Promise<any[]>; +): Promise<Record<string, string>[]>; -async function toArray( +async function toArray( csv: string | Uint8Array | ArrayBuffer | ReadableStream<any> | Response, -): Promise<any[]> { +): Promise<Record<string, string>[]> { const records = []; for await (const row of parse(csv as any)) { records.push(row); } return records; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/Lexer.spec.ts(3 hunks)src/Lexer.test.ts(3 hunks)src/Lexer.ts(2 hunks)src/LexerTransformer.spec.ts(1 hunks)src/parse.spec.ts(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Lexer.ts
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2024-06-10T19:31:23.296Z
Learnt from: kamiazya
PR: kamiazya/web-csv-toolbox#32
File: src/__tests__/helper.ts:0-0
Timestamp: 2024-06-10T19:31:23.296Z
Learning: The `csv` function that was previously commented out in `src/__tests__/helper.ts` has been confirmed as removed in commit `f567442793c67651e6dbb934623b05540c33c8d1`.
Applied to files:
src/parse.spec.ts
📚 Learning: 2024-06-10T19:31:23.296Z
Learnt from: kamiazya
PR: kamiazya/web-csv-toolbox#29
File: src/transformers/LexerTransformer.ts:146-150
Timestamp: 2024-06-10T19:31:23.296Z
Learning: The documentation within the `LexerTransformer` class was corrected to accurately describe the behavior of the `#nextToken` method when not flushing and the buffer does not end with a quotation mark, as per commit `d5c7acd6cd074296ab8dd39c62acb3efd84d4707`.
Applied to files:
src/LexerTransformer.spec.ts
🧬 Code graph analysis (4)
src/parse.spec.ts (1)
src/parse.ts (1)
parse(186-210)
src/Lexer.test.ts (1)
src/common/constants.ts (3)
Field(15-15)FieldDelimiter(5-5)RecordDelimiter(10-10)
src/Lexer.spec.ts (4)
src/__tests__/helper.ts (3)
eol(219-221)row(228-237)field(141-150)src/escapeField.ts (1)
escapeField(23-65)src/common/constants.ts (3)
Field(15-15)FieldDelimiter(5-5)RecordDelimiter(10-10)src/Lexer.ts (1)
Lexer(19-302)
src/LexerTransformer.spec.ts (1)
src/common/constants.ts (1)
RecordDelimiter(10-10)
🔇 Additional comments (4)
src/Lexer.test.ts (2)
11-121: Mixed CRLF/LF tokenization test reads correctly.Locations/offsets and delimiter values align with the Lexer behavior. Nice addition.
240-248: EOF(LF) RecordDelimiter assertion matches new no-trim behavior.Expectation is correct post-fix.
src/Lexer.spec.ts (2)
24-77: Property test for EOL tokenization is solid.Good coverage across options/data with quoted fields; expected tokens and locations look right.
294-325: EOF RecordDelimiter expectation is correct with the trailing‑EOL fix.This guards against regressions in end‑of‑file handling.
|
@kanishka1804 Sorry, I haven't had a chance to look into it in detail either, but it seems highly likely that there's some problem with the automated tests. The GitHub Action job timed out. It's probable that the overall behavior has become inconsistent due to the bug fix. The original issue was just to add tests, so if you were only adding tests, I think it could be merged. Do you want to continue development with the current scope, which includes this bug fix? Of course, I would appreciate it if you continue as is. However, I suspect you became interested in this project through Hacktoberfest, so changing this to a smaller-scoped PR might be a good idea for you. |
|
@kanishka1804 Hi, just a reminder, since it's last day of the Hactoberfest. |
Closes #515
Summary of Changes
Feature Implementation (Tests Added/Updated):
describe('Line Ending Compatibility...')block with four end-to-end tests covering:fast-checkto verify the Lexer's behavior across random EOL combinations.Bug Fixes (Required for Tests to Pass):
*#tokens()method that was prematurely trimming the final EOL from the buffer. This fix ensures the Lexer properly produces aRecordDelimitertoken for every EOL found, allowing the parser to correctly generate empty records for double-trailing EOLs (which was failing before).src/Lexer.test.ts,src/Lexer.spec.ts) that were incorrectly expecting the final EOL to be trimmed.Summary by CodeRabbit
Bug Fixes
Tests