[SOLVED] Is the for loop in my code the speed bottleneck?

Issue

The following code looks through 2500 markdown files with a total of 76475 lines, to check each one for the presence of two strings.

#!/usr/bin/env python3
# encoding: utf-8

import re
import os

zettelkasten = '/Users/will/Dropbox/zettelkasten'

def zsearch(s, *args):
    for x in args:
        r = (r"(?=.* " + x + ")")
        p = re.search(r, s, re.IGNORECASE)
        if p is None:
            return None
    return s

for filename in os.listdir(zettelkasten):
    if filename.endswith('.md'):
        with open(os.path.join(zettelkasten, filename),"r") as fp:
            for line in fp:
                result_line = zsearch(line, "COVID", "vaccine")
                if result_line != None:
                    UUID = filename[-15:-3]
                    print(f'›[[{UUID}]] OR', end=" ")

This correctly gives output like:

›[[202202121717]] OR ›[[202003311814]] OR 

, but it takes almost two seconds to run on my machine, which I think is much too slow. What, if anything, can be done to make it faster?

Solution

The main bottleneck is the regular expressions you’re building.

If we print(f"{r=}") inside the zsearch function:

>>> zsearch("line line covid line", "COVID", "vaccine")
r='(?=.* COVID)'
r='(?=.* vaccine)'

The (?=.*) lookahead is what is causing the slowdown – and it’s also not needed.

You can achieve the same result by searching for:

r=' COVID'
r=' vaccine'

Answered By – KarlT

Answer Checked By – Gilberto Lyons (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *