Algorithm is accidentally quadratic

Hello there!

So today I tried to push a 2.5 megabyte file to my team’s Git server and it said “the pre-receive hook timed out after 120 seconds”.

I asked the CI priests (don’t laugh, it’s their official job description) what the :fire: it could be doing, and they said “the only thing we do on JSON files is this” and supplied a small script involving the Python version of `json_minify`.

I ran it on my file. It finished in 76 minutes 12 seconds.

So I looked at the algorithm and found the cause.

* On the top level, it has a loop going through “interesting” places in the input, where “interesting” is defined as “a potential string delimiter, comment delimiter, or a newline”. It also maintains several flags reflecting the syntax at current position.
* If the current syntax is “inside string” and it finds a double quote, it might be a closing delimiter. So it checks if there is  an even number (incl. zero) of backslashes preceding the double quote. If so, it is indeed a closing string delimiter.
* The way it checks for an even number of backslashes is… curious. It searches for the regular expression `\\*$` in the whole substring from position 0 to current. This *succeeds* always before the end, and matches the longest span of backslashes, but it *tries* every possible starting position from 0 to the current position, unless the regular expression engine is smart enough to notice the anchor at the end and match backwards. (Spoiler warning: Most of them are not.)

As a proof of concept, I made this change:

```diff
         if val == '"' and not (in_multi or in_single):
-            escaped = end_slashes_re.search(string, 0, match.start())
+            pos = match.start()
+            while pos > 0 and string[pos - 1] == '\\':
+                pos -= 1

             # start of string or unescaped quote character to end string
-            if not in_string or (escaped is None or len(escaped.group()) % 2 == 0):  # noqa
+            if not in_string or ((match.start() - pos) % 2 == 0):  # noqa
                 in_string = not in_string
             index -= 1  # include " character in next catch
```

and it passed all existing tests and processed my 2.5M file in 0.697 seconds. It is definitely a lot less pythonic than the original way, but it sure does the job.

You might want to review all ports where this check is done via the same regexp, and verify which of them exhibit quadratic complexity.

There is nothing peculiar about my JSON file; just generate one that has a moderately large number of double quote characters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Algorithm is accidentally quadratic #64

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Algorithm is accidentally quadratic #64

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions