New lexer dev #190

shewitt-au · 2025-08-03T17:31:56Z

In Progress Experimental Lexer

I have been working on a new lexer. In no hurry. Just hacking away at it when time permits. It's nowhere near ready. But it's in a state now where it seems to work, as far as I can tell, and I haven't looked real hard. I've no doubt there are still issues. There are still issues. Error handling is minimal. There's debugging code. It uses an external header-only lexing library (lexertl17) ('ve never used lexertl before, I picked it because it's the lexer Boost uses). But it's fast! Really fast! I would not normally post messy dev code like this, but I feel the speed difference justifies it.

I am under no illusions that this should be merged in its current state. A PR seems the only vechicle to share this kind of stuff however.

ImHex changes here

Timing Tests

These tests are from the "hex.builtin.task.analyzing_data" background task.

Release build
-------------

Old lexer
 [03:40:28] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 4806ms
 [03:41:20] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 5056ms
 [03:42:11] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 4431ms
 [03:43:07] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 4742ms
 [03:44:05] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 3852ms

 Average: 4577.4

New lexer
 [03:45:47] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 2326ms
 [03:46:33] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 2041ms
 [03:47:14] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 2208ms
 [03:47:51] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 2321ms
 [03:48:42] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 1965ms

 Average: 2172.2

 Result: 4577.4/2172.2 = 2.10726452444526

Debug buid
----------

Old lexer
 [03:57:40] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 17950ms
 [03:59:33] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 17307ms
 [04:00:29] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 17155ms
 [04:01:14] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 17140ms
 [04:02:08] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 21217ms

 Average: 18153.8

New lexer
 [04:04:11] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 7381ms
 [04:05:09] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 6991ms
 [04:05:56] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 6600ms
 [04:06:46] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 5551ms
 [04:07:21] [INFO]  [builtin | Analyzing data.] ***analyzing_data***: 5669ms

 Average: 6438.4

 Result: 2.81961356858847

Thoughts

There are CI build errors. Some of them are my fault. I think, but can’t be sure, that some are not. I'm getting better at using GIT, but I'm still crap at it.

I can't be sure that as I make my lexer more fit for purpose that the 2x+ performance gains won't be whittled away. Although I can't be sure exactly what makes it faster, it seems obvious it’s lexertl.

I’ve added a pre-build step to the make system. Lexertl supports generating a “static lexer” (it generates source code to build the state machine at compile time). I was planning on using this in release builds. It’ causing problems on some platforms. I’ve never used CMake before ImHex. I could use some help here.

I guess I'm posting this to see if anyone is interested. Without meaning to reopen old wounds, I don’t want to waste my time. I’ve made a few PRs that I thought that deserved consideration that were rejected out of hand. That said, I’ll probably complete it anyway if I’m honest. It’s an interesting problem.

I was planning on rewriting the whole lex/pre-process/parse stack. But in no hurry.

There are some decisions I've made in the lexer that I would move to the parser. <= and >= being lexed as two seperate tokens for example (matching the old lexer). The lexer could be simpler. Although the line is blury, at times it feels like it's straying into parser-land.

Part of me suspects I've made some stupid mistake. 2x+ seems too good to be true.

The Deferred Confusion Anti-Pattern

We’ve all written code that later we find hard to understand ourselves. Hand-written lexers/parsers excel here. I feel a more formal framework, once you get your head around it, is beneficial.

…tternLanguage into new_lexer_dev

shewitt-au added 30 commits June 16, 2025 00:05

Add lexert17 submodule

16166be

Add new_lexer.cpp

76f9fb2

Get stuff to build

85cbfec

Start setting up for parallel lexing

a226237

Move lexertl17 submodule to external

97df40b

Initial hookup

f112bef

Progress

ffadf8a

Remove enum class base type. Why so big?

8ac462b

Init function

c31a240

Fix potential ODR violation

8705e94

Progress

c31e378

Progress

073ac46

Progress

c667a4f

Progress

cc35009

Progress

813ee20

Progress

465177f

Progress

7cacb1d

Progress

847771b

Progress

145977e

Remove some dead code

441b47c

Remove some dead code

f5d6240

Single line comment

26703c7

Fix crash in ML comment

8828518

Merge ML comments

6d1dd69

MLC: remove stuff

1738491

Comments

20d02ff

Rephrase comment

4a59308

Single line doc comment

257212b

Start on ML doc comments. Some bug fixes.

57d565f

ML doc comments

1908bca

shewitt-au added 29 commits August 12, 2025 00:46

String and char handling

5d30e83

Undo some accidental formatting changes

b426189

int parsing got broken somewhere. Fix

a38e719

Add debugging code. It's finding issues.

873fb92

Add note

9c0a600

Fix floating point suffix

cdba4dd

Fix include test

59a787c

Fix ints

1fbdb92

Fix floating point lexing

4e37bce

Fix floating point lexing

461f9f1

Merge branch 'new_lexer_dev' of https://github.com/shewitt-au/StevePa…

8a92a9d

…tternLanguage into new_lexer_dev

Merge branch 'WerWolv:master' into new_lexer_dev

3fd1e61

Update submodules

19911d4

Compare token streams

aca693d

Reinstate static lexer build step

4a9370a

Change debug test macro

12f7b53

Exception handling in lexer build

c2a99aa

Comments

402ab22

Comments and renaming stuff

c094387

WIP: issues with mods by others

7db62ef

New line macro

b5c1530

Non-newline whitespace macro

7dee928

Disable debugging code

8ff9354

Little bits here and there

25b1564

Simplify comments

a10af43

Add reset method. Comments

7264c1f

Error handling

ab53949

Add missing header

b494acd

Attempt to fix macOS build

b91064e

shewitt-au closed this Aug 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

New lexer dev #190

New lexer dev #190

Uh oh!

shewitt-au commented Aug 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

New lexer dev #190

New lexer dev #190

Uh oh!

Conversation

shewitt-au commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

In Progress Experimental Lexer

Timing Tests

Thoughts

The Deferred Confusion Anti-Pattern

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shewitt-au commented Aug 3, 2025 •

edited

Loading