Skip to content

Conversation

@konefah
Copy link

@konefah konefah commented Nov 17, 2025

This library meets part of my needs. It provides a solid foundation and already offers several useful features. However, it does not support using an InputStream to handle a continuous data stream without loading everything into memory — a feature that is essential for my use case. Without this capability, I cannot fully leverage the library in my context.

To address this limitation, I propose an update to the library that adds this missing functionality. With this modification, the matchAndReport method could take an InputStream as input.

@marianobarrios
Copy link
Owner

Hi konefah,

Thanks for your comments and the pull request.

I checked it and I saw some issues regarding Unicode and case normalization, which would have to be done differently if we never had the whole string in memory. That said, I think it could be done, but it's not completely trivial.

That's why I would ask you about your use case if possible. Where are you reading the strings from? How long are they?

Best regards.

@konefah
Copy link
Author

konefah commented Nov 18, 2025

Hi marianobarrios & thank you for your feedback!
Indeed, this is not completely trivial. My goal is to stream a potentially very large file without loading it entirely into memory, while validating its content using a DFA library.
During processing, as soon as a character does not match the compiled regex, the file should be rejected; otherwise, it is accepted at the end of the stream.
I’ve added a test (testSimpleFileInputStream) that illustrates my requirement using a simple file log.json.
I’m aware that this example doesn’t cover all cases related to Unicode. It’s just meant to clarify my need as much as possible. Your suggestions regarding Unicode handling and normalization cases would be greatly appreciated.

@marianobarrios
Copy link
Owner

Thanks for the explanation.

Again, trying to understand the use case (to see if the extra complexity is justified). In your case, I see that you could split the file using new-lines, as each JSON is in a different line. Is it possible to rely on that?

Additionally, would you mind sharing the regular expression that you are using?

@konefah
Copy link
Author

konefah commented Nov 18, 2025

Thanks for your feedback!
No, we cannot rely on that. It could be any type of file, not necessarily JSON. We are not required to split the file using new lines.
I’ve added a test in the MatchTest class at the very bottom, named testSimpleFileInputStream. You’ll find the regular expression there.=> 5cf5f7a

@marianobarrios
Copy link
Owner

But what about the new-lines? Additionally, parsing JSON using a regular expressions is not really possible (JSON is not a regular language). It only works in your case because you are asking for a specific JSON.

Sorry, but really need to understand the actual use case: I which context this program will run, where do these requirements come from.

@konefah
Copy link
Author

konefah commented Nov 20, 2025

Hi @marianobarrios ,
I’m afraid I can’t say much more about our use case because the subject is somewhat sensitive. We do not have control over the input string: it may span one or several lines, inside a small or large file. The choice of JSON format is arbitrary — it just as easily could have been a TXT, CSV, or another type of file.

@marianobarrios
Copy link
Owner

OK, but out cannot parse an arbitrary JSON with a regular expression...

@konefah
Copy link
Author

konefah commented Nov 20, 2025

Thank you for your feedback! I will replace the JSON file with the TXT file for the test. Your suggestions regarding Unicode handling and normalization cases would be greatly appreciated.

@marianobarrios
Copy link
Owner

What I am asking is some real-world use case in which this is useful. I cannot imagine any. It's important to have some use-case that justifies the added complexity.

@konefah
Copy link
Author

konefah commented Nov 28, 2025

Hi @marianobarrios,
I will attempt to provide a few points to address your request, hoping this will allow us to move ahead.

Let me start by comparing the situation before and after my modification:

Before: The matches method could only accept a parameter of type CharSequence.
After: My modification allows the same method to also accept an InputStream as a parameter.

The matches method is used to validate text.
To better understand:
CharSequence is an in-memory representation of characters (text). It corresponds to a sequence of characters already loaded in memory.
InputStream, on the other hand, is a low-level byte stream used to read raw data from: files, network sockets, resources
It represents a stream of bytes coming from a source (not yet in memory).

Analogy:
InputStream = water pipe bringing raw water (bytes)
CharSequence = water already filled in a glass (characters ready to use)

With this modification, it becomes possible to validate any data source and any data size without loading it into memory.

Here are some examples:
Network Stream Validation=> Reading data from a TCP socket or WebSocket.
Validation examples:
Ensure the stream follows a protocol (e.g., valid JSON).
Detect injection attempts.

Embedded Resource Validation=> Loading a configuration file from classpath or a JAR.
Validation examples:
Validate syntax (e.g., YAML, JSON).
Check for mandatory keys.

Data Pipeline Validation=> ETL or streaming data processing.
Validation examples:
Validate structure before ingestion.
Check data quality (e.g., missing values, wrong formats)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants