Add match input stream #214

konefah · 2025-11-17T09:06:36Z

This library meets part of my needs. It provides a solid foundation and already offers several useful features. However, it does not support using an InputStream to handle a continuous data stream without loading everything into memory — a feature that is essential for my use case. Without this capability, I cannot fully leverage the library in my context.

To address this limitation, I propose an update to the library that adds this missing functionality. With this modification, the matchAndReport method could take an InputStream as input.

marianobarrios · 2025-11-17T22:52:52Z

Hi konefah,

Thanks for your comments and the pull request.

I checked it and I saw some issues regarding Unicode and case normalization, which would have to be done differently if we never had the whole string in memory. That said, I think it could be done, but it's not completely trivial.

That's why I would ask you about your use case if possible. Where are you reading the strings from? How long are they?

Best regards.

konefah · 2025-11-18T12:22:35Z

Hi marianobarrios & thank you for your feedback!
Indeed, this is not completely trivial. My goal is to stream a potentially very large file without loading it entirely into memory, while validating its content using a DFA library.
During processing, as soon as a character does not match the compiled regex, the file should be rejected; otherwise, it is accepted at the end of the stream.
I’ve added a test (testSimpleFileInputStream) that illustrates my requirement using a simple file log.json.
I’m aware that this example doesn’t cover all cases related to Unicode. It’s just meant to clarify my need as much as possible. Your suggestions regarding Unicode handling and normalization cases would be greatly appreciated.

marianobarrios · 2025-11-18T12:47:08Z

Thanks for the explanation.

Again, trying to understand the use case (to see if the extra complexity is justified). In your case, I see that you could split the file using new-lines, as each JSON is in a different line. Is it possible to rely on that?

Additionally, would you mind sharing the regular expression that you are using?

konefah · 2025-11-18T13:23:27Z

Thanks for your feedback!
No, we cannot rely on that. It could be any type of file, not necessarily JSON. We are not required to split the file using new lines.
I’ve added a test in the MatchTest class at the very bottom, named testSimpleFileInputStream. You’ll find the regular expression there.=> 5cf5f7a

marianobarrios · 2025-11-18T13:46:04Z

But what about the new-lines? Additionally, parsing JSON using a regular expressions is not really possible (JSON is not a regular language). It only works in your case because you are asking for a specific JSON.

Sorry, but really need to understand the actual use case: I which context this program will run, where do these requirements come from.

konefah · 2025-11-20T10:33:54Z

Hi @marianobarrios ,
I’m afraid I can’t say much more about our use case because the subject is somewhat sensitive. We do not have control over the input string: it may span one or several lines, inside a small or large file. The choice of JSON format is arbitrary — it just as easily could have been a TXT, CSV, or another type of file.

marianobarrios · 2025-11-20T11:18:11Z

OK, but out cannot parse an arbitrary JSON with a regular expression...

konefah · 2025-11-20T16:36:24Z

Thank you for your feedback! I will replace the JSON file with the TXT file for the test. Your suggestions regarding Unicode handling and normalization cases would be greatly appreciated.

marianobarrios · 2025-11-21T09:08:30Z

What I am asking is some real-world use case in which this is useful. I cannot imagine any. It's important to have some use-case that justifies the added complexity.

konefah · 2025-11-28T15:32:07Z

Hi @marianobarrios,
I will attempt to provide a few points to address your request, hoping this will allow us to move ahead.

Let me start by comparing the situation before and after my modification:

Before: The matches method could only accept a parameter of type CharSequence.
After: My modification allows the same method to also accept an InputStream as a parameter.

The matches method is used to validate text.
To better understand:
CharSequence is an in-memory representation of characters (text). It corresponds to a sequence of characters already loaded in memory.
InputStream, on the other hand, is a low-level byte stream used to read raw data from: files, network sockets, resources
It represents a stream of bytes coming from a source (not yet in memory).

Analogy:
InputStream = water pipe bringing raw water (bytes)
CharSequence = water already filled in a glass (characters ready to use)

With this modification, it becomes possible to validate any data source and any data size without loading it into memory.

Here are some examples:
Network Stream Validation=> Reading data from a TCP socket or WebSocket.
Validation examples:
Ensure the stream follows a protocol (e.g., valid JSON).
Detect injection attempts.

Embedded Resource Validation=> Loading a configuration file from classpath or a JAR.
Validation examples:
Validate syntax (e.g., YAML, JSON).
Check for mandatory keys.

Data Pipeline Validation=> ETL or streaming data processing.
Validation examples:
Validate structure before ingestion.
Check data quality (e.g., missing values, wrong formats)

Nixos NIXOS added 2 commits November 16, 2025 16:03

add implementation for the match-and-report input stream

521450e

update readme file

4f1a3e4

add a test Simple FileInputStream

5cf5f7a

use log.txt file instead

b34d931

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add match input stream #214

Add match input stream #214

Uh oh!

konefah commented Nov 17, 2025

Uh oh!

marianobarrios commented Nov 17, 2025

Uh oh!

konefah commented Nov 18, 2025

Uh oh!

marianobarrios commented Nov 18, 2025

Uh oh!

konefah commented Nov 18, 2025

Uh oh!

marianobarrios commented Nov 18, 2025

Uh oh!

konefah commented Nov 20, 2025

Uh oh!

marianobarrios commented Nov 20, 2025

Uh oh!

konefah commented Nov 20, 2025

Uh oh!

marianobarrios commented Nov 21, 2025

Uh oh!

konefah commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add match input stream #214

Are you sure you want to change the base?

Add match input stream #214

Uh oh!

Conversation

konefah commented Nov 17, 2025

Uh oh!

marianobarrios commented Nov 17, 2025

Uh oh!

konefah commented Nov 18, 2025

Uh oh!

marianobarrios commented Nov 18, 2025

Uh oh!

konefah commented Nov 18, 2025

Uh oh!

marianobarrios commented Nov 18, 2025

Uh oh!

konefah commented Nov 20, 2025

Uh oh!

marianobarrios commented Nov 20, 2025

Uh oh!

konefah commented Nov 20, 2025

Uh oh!

marianobarrios commented Nov 21, 2025

Uh oh!

konefah commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants