Skip to content

Conversation

shewitt-au
Copy link
Contributor

@shewitt-au shewitt-au commented Aug 3, 2025

In Progress Experimental Lexer

I have been working on a new lexer. In no hurry. Just hacking away at it when time permits. It's nowhere near ready. But it's in a state now where it seems to work, as far as I can tell, and I haven't looked real hard. I've no doubt there are still issues. There are still issues. Error handling is minimal. There's debugging code. It uses an external header-only lexing library (lexertl17) ('ve never used lexertl before, I picked it because it's the lexer Boost uses). But it's fast! Really fast! I would not normally post messy dev code like this, but I feel the speed difference justifies it.

I am under no illusions that this should be merged in its current state. A PR seems the only vechicle to share this kind of stuff however.

ImHex changes here

Timing Tests

These tests are from the "hex.builtin.task.analyzing_data" background task.

Release build
-------------

Old lexer
 [03:40:28] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 4806ms
 [03:41:20] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 5056ms
 [03:42:11] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 4431ms
 [03:43:07] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 4742ms
 [03:44:05] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 3852ms

 Average: 4577.4

New lexer
 [03:45:47] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 2326ms
 [03:46:33] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 2041ms
 [03:47:14] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 2208ms
 [03:47:51] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 2321ms
 [03:48:42] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 1965ms

 Average: 2172.2

 Result: 4577.4/2172.2 = 2.10726452444526

Debug buid
----------

Old lexer
 [03:57:40] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 17950ms
 [03:59:33] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 17307ms
 [04:00:29] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 17155ms
 [04:01:14] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 17140ms
 [04:02:08] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 21217ms

 Average: 18153.8

New lexer
 [04:04:11] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 7381ms
 [04:05:09] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 6991ms
 [04:05:56] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 6600ms
 [04:06:46] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 5551ms
 [04:07:21] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 5669ms

 Average: 6438.4

 Result: 2.81961356858847

Thoughts

There are CI build errors. Some of them are my fault. I think, but can’t be sure, that some are not. I'm getting better at using GIT, but I'm still crap at it.

I can't be sure that as I make my lexer more fit for purpose that the 2x+ performance gains won't be whittled away. Although I can't be sure exactly what makes it faster, it seems obvious it’s lexertl.

I’ve added a pre-build step to the make system. Lexertl supports generating a “static lexer” (it generates source code to build the state machine at compile time). I was planning on using this in release builds. It’ causing problems on some platforms. I’ve never used CMake before ImHex. I could use some help here.

I guess I'm posting this to see if anyone is interested. Without meaning to reopen old wounds, I don’t want to waste my time. I’ve made a few PRs that I thought that deserved consideration that were rejected out of hand. That said, I’ll probably complete it anyway if I’m honest. It’s an interesting problem.

I was planning on rewriting the whole lex/pre-process/parse stack. But in no hurry.

There are some decisions I've made in the lexer that I would move to the parser. <= and >= being lexed as two seperate tokens for example (matching the old lexer). The lexer could be simpler. Although the line is blury, at times it feels like it's straying into parser-land.

Part of me suspects I've made some stupid mistake. 2x+ seems too good to be true.

The Deferred Confusion Anti-Pattern

We’ve all written code that later we find hard to understand ourselves. Hand-written lexers/parsers excel here. I feel a more formal framework, once you get your head around it, is beneficial.

@shewitt-au shewitt-au closed this Aug 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant