Skip to content

Convert regex

convert_regex

Module provides a converter from Microsoft VB / COM regular expression to Python regular expressions to overcome differences in the standard syntax. Basic Eval() functionality is supported as well (i.e. basic formatting of found groups/elements).

The use case was a flexible file renaming tool including some boolean logic. It was used f. e. to assign a report created at the beginning of the following month as report of the previous month. Basically, all renaming logic is hidden in regular expressions.

Example / doctest (note escaping in output to avoid error):

>>> from utils_mystuff import convertRegExVB2Python

>>> # test - conversion
>>> print(convertRegExVB2Python('"$1_Report_" & $5+($2>$4)*1 & "_" & Format($2, "00") & "_$5$4$3$6"'))
"\1_Report_" + str(\5+(\2>\4)*-1) + "_" + "{:02d}".format(\2) + "_\5\4\3\6"
>>> print(convertRegExVB2Python('"$1_Report_" & $2+($3+($4<28)=0) & Format($3+($4<28)-12*($3+($4<28)=0), "00") & "_Korr.pdf"'))
"\1_Report_" + str(\2+(\3+(\4<28)*-1==0)*-1) + "{:02d}".format(\3+(\4<28)*-1-12*(\3+(\4<28)*-1==0)*-1) + "_Korr.pdf"
>>> print(convertRegExVB2Python("$1_$2"))
\1_\2
>>> print(convertRegExVB2Python('"$1_Report_" & $5+($2>$4)*1 & "_" & Format($2, "00") & "_$5$4$3$6"'))
"\1_Report_" + str(\5+(\2>\4)*-1) + "_" + "{:02d}".format(\2) + "_\5\4\3\6"
>>> print(convertRegExVB2Python('"$1_Report_" & $5+($2>$4)*1 + "_" + Format($2, "00") & "_$5$4$3$6"'))
"\1_Report_" + str(\5+(\2>\4)*-1) + "_" + "{:02d}".format(\2) + "_\5\4\3\6"
>>> print(convertRegExVB2Python('"$1$2_Report_" & $3+($4+($5<28)=0) & Format($4+($5<28)-12*($4+($5<28)=0), "00") & "$6"'))
"\1\2_Report_" + str(\3+(\4+(\5<28)*-1==0)*-1) + "{:02d}".format(\4+(\5<28)*-1-12*(\4+(\5<28)*-1==0)*-1) + "\6"

>>> # test - clean leading zero")
>>> print(clean_for_eval('"(-1*1*0) + "_" + "{:02d}".format(08) + "_20210901.pdf"'))
"(-1*1*0) + "_" + "{:02d}".format(8) + "_20210901.pdf"
>>> print(clean_for_eval('"123456_Report_" + str(2021+(08>09)*-01*1*0) + "_" + "{:02d}".format(08) + "_20210901.pdf"'))
"123456_Report_" + str(2021+(8>9)*-1*1*0) + "_" + "{:02d}".format(8) + "_20210901.pdf"
>>> print(clean_for_eval('"123456_Report_" + str(2022+(4+(02<28)*-1==0)*-1) + "{:02d}".format(04+(02<28)*-1-12*(04+(02<28)*-1==0)*-1) + ".pdf"'))
"123456_Report_" + str(2022+(4+(2<28)*-1==0)*-1) + "{:02d}".format(4+(2<28)*-1-12*(4+(2<28)*-1==0)*-1) + ".pdf"

clean_for_eval(expression: str) -> str

clean_for_eval - clean regular expression after substitution

delete leading zeros from integer constants to overcome eval() error

Parameters:

Name Type Description Default
expression str

regular expression to be cleaned

required

Returns:

Name Type Description
str str

cleaned regular expression

Source code in src/utils_mystuff/convert_regex.py
def clean_for_eval(expression: str) -> str:
    """
    clean_for_eval - clean regular expression after substitution

    delete leading zeros from integer constants to overcome eval() error

    Arguments:
        expression (str): regular expression to be cleaned

    Returns:
        str: cleaned regular expression
    """

    operator_char = [" ", "<", "=", ">", "+", "-", "*", "/", "%", "("]

    find_start = 0
    for char in operator_char:
        pos = expression.find(f"{char}0")
        while pos >= 0:
            if expression[pos + 2:pos + 3] in "0123456789":
                expression = expression[0:pos + 1] + expression[pos + 2:]
                find_start = pos
            else:
                find_start = pos + 1
            pos = expression.find(f"{char}0", find_start)

    return expression

convertRegExVB2Python(reVB: str) -> str

convertRegExVB2Python - convert Microsoft VB / COM regular expression to Python regular expression

Problem: syntax for regular expressions in the Microsoft VB / COM standard is different from Python.

Differences in the area of the standard syntax:

  • for group replacements notation is \n instead $n

Differences if regular expressions are used together with Eval():

  • different numerical value for True: VB environment = -1, Python = 1 -> boolean expressions need to be adjusted
  • string concatenation is '&' in VB and '+' in Python assumption: '+' is not used for source expressions even allowed
  • '+' may not be applied on int/str in Python -> numerical expressions must be bracketed in str()
  • language functions are to be replaced. Implemented: "Format" for numerical values

Parameters:

Name Type Description Default
reVB str

regular expression according to VB / COM standard for Microsoft RegEx engine

required

Returns:

Name Type Description
str str

converted regular expression for Python RegEx engine

Source code in src/utils_mystuff/convert_regex.py
def convertRegExVB2Python(reVB: str) -> str:
    """
    convertRegExVB2Python - convert Microsoft VB / COM regular expression to Python regular expression

    Problem: syntax for regular expressions in the Microsoft VB / COM standard is different from Python.

    Differences in the area of the standard syntax:

    - for group replacements notation is \\n instead $n

    Differences if regular expressions are used together with Eval():

    - different numerical value for True: VB environment = -1, Python = 1
      -> boolean expressions need to be adjusted
    - string concatenation is '&' in VB and '+' in Python
      assumption: '+' is not used for source expressions even allowed
    - '+' may not be applied on int/str in Python
      -> numerical expressions must be bracketed in str()
    - language functions are to be replaced. Implemented:
      "Format" for numerical values

    Arguments:
        reVB (str): regular expression according to VB / COM standard for Microsoft RegEx engine

    Returns:
        str: converted regular expression for Python RegEx engine
    """

    def split_terms(regex: str, quotemark, delim: str = "&+") -> list[str]:

        terms: list[str] = []
        regex_idx: int = 0
        term_start: int = 0
        quotemark_cnt: int = 0
        bracket_cnt: int = 0

        regex += " "
        while regex_idx < len(regex):
            if regex[regex_idx] == quotemark:
                quotemark_cnt += 1
            elif regex[regex_idx] in "(){}[]":
                bracket_cnt += 1
            elif (
                (regex[regex_idx] in delim or regex_idx >= len(regex) - 1) and
                quotemark_cnt % 2 == 0 and
                bracket_cnt % 2 == 0
            ):
                term = regex[term_start:regex_idx].strip()
                if (term[-1] in quotemark) or (terms[-1][-1] in quotemark) or (term.lower().find("format(") >= 0):
                    terms.append(term)
                else:
                    terms[-1] = terms[-1] + "+" + term
                term_start = regex_idx + 1
            regex_idx += 1

        return terms

    def convert_other(term: str, embedstr: bool = True) -> str:

        compare_operator = [">", "<", "="]
        foundbool = False

        term_idx = 0
        while term_idx < len(term):
            if term[term_idx] in compare_operator:
                if term[term_idx:term_idx + 1] == "<>":
                    term = term[:term_idx] + "!=" + term[term_idx + 2:]
                if term[term_idx:term_idx + 1] == "=":
                    term = term[:term_idx] + "==" + term[term_idx + 1:]
                    term_idx += 1
                while term[term_idx] != ")" and term_idx < len(term):
                    term_idx += 1
                if term[term_idx] == ")":
                    term = term[0:term_idx] + ")*-1" + term[term_idx + 1:]
                    term_idx += 3
                    foundbool = True
                elif term_idx == len(term):
                    term = "(" + term + ")*-1"
                    foundbool = True
            term_idx += 1

        if foundbool:
            term = term.replace("*-1*1", "*-1").replace("*-1*-1", "*1")

        if embedstr:
            return "str(" + term + ")"
        else:
            return term

    def convert_term(term: str, quotemark) -> str:

        if term[0] == quotemark:
            return term
        elif term.lower()[0:len("format(")] == "format(":
            # only simple integer formatting accepted
            params = term[len("format") + 1:len(term) - 1].split(",")
            params[0] = convert_other(params[0], False)
            return f"\"{{:0{params[1].count('0')}d}}\".format({params[0]})"
        else:
            return convert_other(term)

    # check quotation mark -> RegEx for use with Eval()
    if reVB[0] == "'":
        quotemark = "'"
    elif reVB[0] == "\"":
        quotemark = "\""
    else:
        quotemark = ""

    if quotemark != "":
        # top-level split of  regular expression into string terms
        terms = split_terms(reVB, quotemark)
        rePy = ""
        for term in terms:
            rePy = convert_term(term, quotemark) if rePy == "" else rePy + " + " + convert_term(term, quotemark)
    else:
        rePy = reVB

    # replace groups
    for grp_idx in range(9, 0, -1):
        rePy = rePy.replace(f"${grp_idx}", rf"\{grp_idx}")

    return rePy

convert_regexVB2python(reVB: str) -> str

convert_regexVB2python - convert Microsoft VB / COM regular expression to Python regular expression

Alternative caller for convertRegExVB2Python. See details there.

Parameters:

Name Type Description Default
reVB str

regular expression according to VB / COM standard for Microsoft RegEx engine

required

Returns:

Name Type Description
str str

converted regular expression for Python RegEx engine

Source code in src/utils_mystuff/convert_regex.py
def convert_regexVB2python(reVB: str) -> str:
    """
    convert_regexVB2python - convert Microsoft VB / COM regular expression to Python regular expression

    Alternative caller for convertRegExVB2Python. See details there.

    Arguments:
        reVB (str): regular expression according to VB / COM standard for Microsoft RegEx engine

    Returns:
        str: converted regular expression for Python RegEx engine
    """

    return convertRegExVB2Python(reVB)

convert_regexVB_2_python(reVB: str) -> str

convert_regexVB_2_python - convert Microsoft VB / COM regular expression to Python regular expression

Alternative caller for convertRegExVB2Python. See details there.

Parameters:

Name Type Description Default
reVB str

regular expression according to VB / COM standard for Microsoft RegEx engine

required

Returns:

Name Type Description
str str

converted regular expression for Python RegEx engine

Source code in src/utils_mystuff/convert_regex.py
def convert_regexVB_2_python(reVB: str) -> str:
    """
    convert_regexVB_2_python - convert Microsoft VB / COM regular expression to Python regular expression

    Alternative caller for convertRegExVB2Python. See details there.

    Arguments:
        reVB (str): regular expression according to VB / COM standard for Microsoft RegEx engine

    Returns:
        str: converted regular expression for Python RegEx engine
    """

    return convertRegExVB2Python(reVB)