Skip to content

Algorithm is accidentally quadratic #64

@yurikhan

Description

@yurikhan

Hello there!

So today I tried to push a 2.5 megabyte file to my team’s Git server and it said “the pre-receive hook timed out after 120 seconds”.

I asked the CI priests (don’t laugh, it’s their official job description) what the 🔥 it could be doing, and they said “the only thing we do on JSON files is this” and supplied a small script involving the Python version of json_minify.

I ran it on my file. It finished in 76 minutes 12 seconds.

So I looked at the algorithm and found the cause.

  • On the top level, it has a loop going through “interesting” places in the input, where “interesting” is defined as “a potential string delimiter, comment delimiter, or a newline”. It also maintains several flags reflecting the syntax at current position.
  • If the current syntax is “inside string” and it finds a double quote, it might be a closing delimiter. So it checks if there is an even number (incl. zero) of backslashes preceding the double quote. If so, it is indeed a closing string delimiter.
  • The way it checks for an even number of backslashes is… curious. It searches for the regular expression \\*$ in the whole substring from position 0 to current. This succeeds always before the end, and matches the longest span of backslashes, but it tries every possible starting position from 0 to the current position, unless the regular expression engine is smart enough to notice the anchor at the end and match backwards. (Spoiler warning: Most of them are not.)

As a proof of concept, I made this change:

         if val == '"' and not (in_multi or in_single):
-            escaped = end_slashes_re.search(string, 0, match.start())
+            pos = match.start()
+            while pos > 0 and string[pos - 1] == '\\':
+                pos -= 1

             # start of string or unescaped quote character to end string
-            if not in_string or (escaped is None or len(escaped.group()) % 2 == 0):  # noqa
+            if not in_string or ((match.start() - pos) % 2 == 0):  # noqa
                 in_string = not in_string
             index -= 1  # include " character in next catch

and it passed all existing tests and processed my 2.5M file in 0.697 seconds. It is definitely a lot less pythonic than the original way, but it sure does the job.

You might want to review all ports where this check is done via the same regexp, and verify which of them exhibit quadratic complexity.

There is nothing peculiar about my JSON file; just generate one that has a moderately large number of double quote characters.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions