Now Reading
Debugging Catastrophic Backtracking for Common Expressions in Python

Debugging Catastrophic Backtracking for Common Expressions in Python

2023-09-07 12:22:12

Not too long ago, I used to be debugging a Python utility that had grow to be caught whereas processing sure inputs. The method was taking over 100% CPU time however not making progress. To attempt to work out the place the appliance was getting caught, I turned to a useful profiling instrument referred to as py-spy.

py-spy is a sampling profiler for Python that allows you to see the place your code is spending time with out modifying it. It may be used to get data from operating Python processes, which may be very useful! I used the py-spy dump command to connect to the caught Python course of and print out a snapshot of the present name stack throughout all threads.

You should use py-spy on a selected course of ID by operating

In my case, the end result seemed like this:

  %Personal   %Whole  OwnTime  TotalTime  Operate (filename)
100.00% 100.00%   18.00s    18.00s   search (re.py)
  0.00% 100.00%   0.000s    18.00s   process_item (pipelines/summarization.py)
  0.00% 100.00%   0.000s    18.00s   execute (scrapy/cmdline.py)
  0.00% 100.00%   0.000s    18.00s   <module> (pex)
  0.00% 100.00%   0.000s    18.00s   _run_command (scrapy/cmdline.py)
  0.00% 100.00%   0.000s    18.00s   run_module (runpy.py)
  0.00% 100.00%   0.000s    18.00s   <module> (scrapy/__main__.py)
  0.00% 100.00%   0.000s    18.00s   _run_code (runpy.py)
  0.00% 100.00%   0.000s    18.00s   run (scrapy/instructions/crawl.py)
  0.00% 100.00%   0.000s    18.00s   get_quotes_summary (ml/summarization.py)
  0.00% 100.00%   0.000s    18.00s   _run_module_code (runpy.py)
  0.00% 100.00%   0.000s    18.00s   run (twisted/web/asyncioreactor.py)
  0.00% 100.00%   0.000s    18.00s   begin (scrapy/crawler.py)
  0.00% 100.00%   0.000s    18.00s   _run_print_help (scrapy/cmdline.py)

This utility is a Scrapy spider, which explains why Scrapy and Twisted present up within the stack hint.

Here’s what the columns imply:

  • %Personal: % of time at present spent within the perform.
  • %Whole: % of time at present spent within the perform and all of the capabilities it referred to as (youngsters).
  • OwnTime: Whole quantity of CPU time that was spent within the perform itself, excluding any capabilities it referred to as.
  • TotalTime: Whole quantity of CPU time that was spent within the perform and all its youngsters.
  • Operate (filename): That is the identify of the perform and the file the place it’s outlined.

From the output above, we will see that the re.search perform from Python’s regex module takes up 100% of the method time, and the execution time is spent on this perform instantly slightly than in any youngsters that it calls.

Now that I knew the perpetrator was a regex and the place it was being referred to as, I seemed on the code:

pattern_str = r'<abstract>((n*.*n*)*)</abstract>'
regex_match = re.search(pattern_str, content material.strip())

This regex parses output that’s returned from a big language mannequin, and tries to match textual content inside <abstract> tags as output by the LLM. It additionally explicitly matches newlines earlier than and after every block of textual content. It defines seize teams for the entire content material of the abstract tags, in addition to every newline-delimited match group.

So what’s the issue right here?

By default, the match characters for * (zero or extra), + (a number of), and ? (zero or one) are matched greedily. If we take a look at the Python documentation, we see:

*?, +?, ??

The '*', '+', and '?' quantifiers are all grasping; they match as a lot textual content as doable. Typically this behaviour isn’t desired; if the RE <.*> is matched towards <a> b <c>, it can match the complete string, and never simply <a>. Including ? after the quantifier makes it carry out the match in non-greedy or minimal style; as few characters as doable can be matched. Utilizing the RE <.*?> will match solely <a>.

The issue is the expression ((n*.*n*)*) – by repeating a newline-sandwiched sample greedily, it results in exponential runtime on strings with many newlines, because the engine tries each doable method to match newlines.

This is called catastrophic backtracking. The regex engine matches so far as it could actually initially with a grasping subpattern, then backtracks making an attempt each doable matching mixture earlier than failing.

Fixing with a Non-Grasping Sample

What we actually need right here is simply to match all of the textual content inside <abstract> tags in a non-greedy style – if there are a number of <abstract> tags, their contents ought to match individually.

To keep away from catastrophic backtracking, the bottom line is to make the repeating subpattern non-greedy, by including the character ? to the tip as proven above within the documentation. Moreover, since we need to match newlines contained in the tags, we add the re.DOTALL flag to make sure that the . character matches newlines as nicely.

pattern_str = r"<abstract>(.*?)</abstract>"
regex_match = re.search(pattern_str, content material.strip(), flags=re.DOTALL)

Now, we get the smallest doable sequence of characters which are in between <abstract> tags. This avoids getting caught making an attempt to match newlines exhaustively, and can be a lot simpler to learn!

Making the sample non-greedy prevents the combinatorial explosion and makes the runtime linear slightly than exponential within the worst case.

Variations in efficiency

Take a look at script

Right here is an small script I ran to match the efficiency of each regexes. The concept behind this script is to make use of a sample with a number of newlines (anbnc) and multiply this by some issue to induce a string with extra newlines that may get caught by the repeating grasping match within the unique regex. Moreover, we take away the <abstract> tags surrounding the enter as it’s not wanted to display the conduct.

Moreover, we use the re.compile technique to create an everyday expression object that may be reused many occasions in the identical program extra effectively. This name additionally lets us go within the re.DOTALL flag as soon as when the regex is created slightly than on each invocation.

We use the time.perf_counter_ns() to measure the run time of every regex search in nanoseconds and get as correct a price for the execution time as doable.

See Also

Word that each regexes are compiled earlier than they’re used to make sure consistency.

import re
import time


def measure_regex_time_ns(multiplier: int):
    # Create enter strings based mostly on the multiplier
    input_str = 'anbnc' * multiplier

    # Problematic regex sample with potential for catastrophic backtracking
    problematic_pattern = re.compile(r'((n*.*n*)*)')

    # Simplified sample that does not trigger backtracking, with re.DOTALL flag and non-greedy matching
    simple_pattern_non_greedy = re.compile(r'(.*?)', re.DOTALL)

    # Measure time taken for matching with problematic sample utilizing re.search
    start_time = time.perf_counter_ns()
    regex_match = problematic_pattern.search(input_str)
    end_time = time.perf_counter_ns()
    problematic_time = end_time - start_time

    # Measure time taken for matching with simplified sample utilizing re.search
    start_time = time.perf_counter_ns()
    regex_match = simple_pattern_non_greedy.search(input_str)
    end_time = time.perf_counter_ns()
    simple_time_non_greedy = end_time - start_time

    return problematic_time, simple_time_non_greedy


def foremost():
    # Take a look at the perform with multiples for string size (1, 10, 100, 1000, 10000, 100000)
    multipliers = [int(10 ** i) for i in range(6)]
    for multiplier in multipliers:
        problematic_time, simple_time = measure_regex_time_ns(multiplier)
        print("Multiplier:", multiplier)
        print("Problematic Regex Time (ns):", problematic_time)
        print("Easy Regex Time (ns):", simple_time)
        print("Ratio of Occasions:", problematic_time / simple_time)
        print("-" * 30 + "n" * 3)


if __name__ == "__main__":
    foremost()

Outcomes

Right here have been the outcomes with varied enter values for multiplier from operating on my laptop computer (Macbook Professional M1 Max, 32GB RAM) utilizing Python 3.11.3.

Multiplier: 1
Problematic Regex Time: 2708
Easy Regex Time: 500
Ratio of Occasions: 5.416
------------------------------



Multiplier: 10
Problematic Regex Time: 3292
Easy Regex Time: 292
Ratio of Occasions: 11.273972602739725
------------------------------



Multiplier: 100
Problematic Regex Time: 19042
Easy Regex Time: 250
Ratio of Occasions: 76.168
------------------------------



Multiplier: 1000
Problematic Regex Time: 160291
Easy Regex Time: 250
Ratio of Occasions: 641.164
------------------------------



Multiplier: 10000
Problematic Regex Time: 1809958
Easy Regex Time: 708
Ratio of Occasions: 2556.437853107345
------------------------------



Multiplier: 100000
Problematic Regex Time: 20424917
Easy Regex Time: 2792
Ratio of Occasions: 7315.514684813754
------------------------------

As you’ll be able to see, the problematic regex demonstrates more and more poor efficiency on lengthy enter sequences, and scales extremely poorly in comparison with the straightforward regex.

Under is a log-log plot of multiplier vs execution time which illustrates this relationship. Each the X and Y axes are scaled logarithmically to make the exponential relationship between the variables clearer.

Regex performance graph

As you’ll be able to see, the execution time scales exponentially for the problematic regex because the enter dimension scales. Against this, the straightforward regex’s execution time grows far more slowly and solely will increase by 1 order of magnitude even because the enter dimension grows by 5 orders of magnitude!

Plotting code

Right here is the code used to generate this plot:

import matplotlib.pyplot as plt
import numpy as np

# Knowledge to plot
multipliers = np.array([1, 10, 100, 1000, 10000, 100000])
problematic_times = np.array([2000, 2583, 19250, 164875, 1860708, 25549500])
simple_times = np.array([458, 250, 250, 208, 11625, 19000])

# Plotting the information
plt.determine(figsize=(10, 6))
plt.loglog(multipliers, problematic_times, label='Problematic Regex Time (ns)', marker='o')
plt.loglog(multipliers, simple_times, label='Easy Regex Time (ns)', marker='x')
plt.xlabel('Multiplier')
plt.ylabel('Execution Time (ns)')
plt.title('Execution Time of Regex Patterns vs. Multiplier')
plt.legend()
plt.grid(True, which="each", ls="--")
plt.present()
  • A number of ranges of grasping matching could cause regex efficiency to spin uncontrolled. Keep away from utilizing a number of ranges of grasping matching if in any respect doable.
  • Use non-greedy matching and flags to get higher outcomes.
  • Use re.compile to compile regexes which are used many occasions in the identical utility and enhance efficiency.

Because of GPT-4 and Code Interpreter for serving to me generate the plots for this weblog publish. I’m notoriously dangerous at matplotlib so having GPT-4 generate the charts made it quite a bit simpler for me.

Additionally due to Ben Frederickson, who created the py-spy library and in addition wrote an ideal publish on catastrophic backtracking (linked under).

Sources

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top