-
Notifications
You must be signed in to change notification settings - Fork 100
Description
Hello there!
So today I tried to push a 2.5 megabyte file to my team’s Git server and it said “the pre-receive hook timed out after 120 seconds”.
I asked the CI priests (don’t laugh, it’s their official job description) what the 🔥 it could be doing, and they said “the only thing we do on JSON files is this” and supplied a small script involving the Python version of json_minify
.
I ran it on my file. It finished in 76 minutes 12 seconds.
So I looked at the algorithm and found the cause.
- On the top level, it has a loop going through “interesting” places in the input, where “interesting” is defined as “a potential string delimiter, comment delimiter, or a newline”. It also maintains several flags reflecting the syntax at current position.
- If the current syntax is “inside string” and it finds a double quote, it might be a closing delimiter. So it checks if there is an even number (incl. zero) of backslashes preceding the double quote. If so, it is indeed a closing string delimiter.
- The way it checks for an even number of backslashes is… curious. It searches for the regular expression
\\*$
in the whole substring from position 0 to current. This succeeds always before the end, and matches the longest span of backslashes, but it tries every possible starting position from 0 to the current position, unless the regular expression engine is smart enough to notice the anchor at the end and match backwards. (Spoiler warning: Most of them are not.)
As a proof of concept, I made this change:
if val == '"' and not (in_multi or in_single):
- escaped = end_slashes_re.search(string, 0, match.start())
+ pos = match.start()
+ while pos > 0 and string[pos - 1] == '\\':
+ pos -= 1
# start of string or unescaped quote character to end string
- if not in_string or (escaped is None or len(escaped.group()) % 2 == 0): # noqa
+ if not in_string or ((match.start() - pos) % 2 == 0): # noqa
in_string = not in_string
index -= 1 # include " character in next catch
and it passed all existing tests and processed my 2.5M file in 0.697 seconds. It is definitely a lot less pythonic than the original way, but it sure does the job.
You might want to review all ports where this check is done via the same regexp, and verify which of them exhibit quadratic complexity.
There is nothing peculiar about my JSON file; just generate one that has a moderately large number of double quote characters.