[SOLVED] Overlapping regular expression substitution in Python, but contingent on values of capture groups

Issue

I’m currently writing a program in Python that is supposed to transliterate all the characters in a language from one orthography into another. There are two things at hand here, one of which is already solved, and the second is the problem. In the first step, characters from the source orthography are converted into the target orthography, e.g.

š -> sh

ł -> lh

m̓ -> m’

l̓ -> l’

(ffr: the apostrophe-looking character is a single closing quotation mark.)

Getting closer to the problem: in certain cases, there are some graphemes that are written in a standard way (m’, n’, l’, y’, w’) that are written differently based on what’s immediately around them. Specifically, the character may move to precede the consonant character if the grapheme is immediately following and preceding a vowel and the vowel that precedes the grapheme is at a higher ‘level’ in a hierarchy. It’s sort of a complicated rule to explain, but here’s some examples, where I include the first stage of the transliteration:

əm̓ə -> um’u -> um’u (no change)
əm̓i -> um’i -> um’i (no change)
im̓ə -> im’u -> i’mu (’ character moves to precede; i > u)
em̓i -> e’mi -> em’i (’ character moves to precede; e > i)

The hierarchy is that the character should move towards the vowel at the highest hierarchy as such:
e > i > a > u

Here is the code that I have that deals with this second step pretty much as well as it can be done. It is pretty clean and takes care of the problem succinctly:

import re

def glottalized_resonant_mover(linestring):
    
    '''
    moves glottal character over according to glottalized resonant 
    hierarchy:

    case description: VR’W for some vowels V, W; some glottalized 
    resonant R’

    hierarchy: e > i > a > u
               3 > 2 > 1 > 0

    if h(V) > h(W), then string is V’RW
    
    '''

    hi_scores = {'e' : 3,
                'i' : 2,
                'a' : 1,
                'u' : 0}

    def hierarchy_sub(matchobj):
        '''moves glottalized resonant if a vowel pulls it one way
        or the other
        '''

        if hi_scores[matchobj.group(1)] > hi_scores[matchobj.group(4)]:

            swap_string = ''.join(
                [
                matchobj.group(1),
                matchobj.group(3),
                matchobj.group(2),
                matchobj.group(4)
                ]
            )
            return swap_string

        else:
            return matchobj.group(0)


    glot_res_re = re.compile('(a|e|i|u)(l|m|n|w|y)(’)(a|e|i|u)')
    swapstring = glot_res_re.sub(hierarchy_sub, linestring)
    
    return swapstring

sample = ['’im’ush', 'ttham’uqwus', 'xwtsekwul’im’us']

for i in sample:
    print(glottalized_resonant_mover(i))

So, for when this code is given the transliterated words im’ush, ttham’uqwus, and xwtsekwul’im’us, it works perfectly for the first two words, but not the third. Summarized clearly:

’im’ush'         -> ’i’mush √
ttham’uqwus      -> ttha’muqwus √
xwtsekwul’im’us  -> xwtsekwul’im’us X should be: xwtsekwul’i’mus

The problem is that there are two capture groups in the third word: there’s ul’i and then there’s im’u which both share the i.

Now, this program is being fed lines of text, where the first stage of transliteration occurs, and then this second step should occur. Some documents are thousands of lines long, and there’s a lot of these documents. There are also other things that I mean to implement (checking against wordlists, etc.) that will take up much computational power, so I’d like to keep this as quick as possible while still being comprehensible.

Also, it is true that I could just write a sequence for each and just have another big list of character sequences to replace, but then I lose some of the portability as well as the ability to easily make edits later.

So, if there’s supposed to be a question: what is the best way to solve this problem that still preserves the approach and some of the qualities of my original solution?

Solution

Yes, only very small changes are needed.

  VR'W -> V'RW

In fact, only the first 3 characters need to be manipulated, with ‘W’ as a necessary condition, so the problem we have to solve becomes:

  VR'(W) -> V'R

Using lookahead assertion: (? =…) can match VR'(W)

Previous: VR'W

(a|e|i|u)(l|m|n|w|y)(')(a|e|i|u)

The subsequent ones match only three letters but look forward one W: VR'(W)

(a|e|i|u)(l|m|n|w|y)(')(?=(a|e|i|u))

So ‘W’ is the condition, not in operation range, it can be matched again.

import re

def glottalized_resonant_mover(linestring):
    
    '''
    moves glottal character over according to glottalized resonant 
    hierarchy:

    case description: VR’W for some vowels V, W; some glottalized 
    resonant R’

    hierarchy: e > i > a > u
               3 > 2 > 1 > 0

    if h(V) > h(W), then string is V’RW
    
    '''

    hi_scores = {'e' : 3,
                'i' : 2,
                'a' : 1,
                'u' : 0}

    def hierarchy_sub(matchobj):
        '''moves glottalized resonant if a vowel pulls it one way
        or the other
        '''
        if hi_scores[matchobj.group(1)] > hi_scores[matchobj.group(4)]:

            swap_string = ''.join(
                [
                matchobj.group(1),
                matchobj.group(3),
                matchobj.group(2),
                #matchobj.group(4) <- Don't need the last one because 'lookahead'
                ]
            )
            return swap_string

        else:
            return matchobj.group(0)
       
    glot_res_re = re.compile('(a|e|i|u)(l|m|n|w|y)(’)(?=(a|e|i|u))')
    # glot_res_re = re.compile('(a|e|i|u)(l|m|n|w|y)(’)(a|e|i|u)')
    swapstring = glot_res_re.sub( hierarchy_sub, linestring)
    
    return swapstring

sample = ['’im’ush', 'ttham’uqwus', 'xwtsekwul’im’us']
answer =['’i’mush', 'ttha’muqwus', 'xwtsekwul’i’mus']
it1 = iter(sample)
it2 = iter(answer)
for i in sample:
    print(next(it1),'->',glottalized_resonant_mover(i), "==", next(it2))

Output:

’im’ush -> ’i’mush == ’i’mush
ttham’uqwus -> ttha’muqwus == ttha’muqwus
xwtsekwul’im’us -> xwtsekwul’i’mus == xwtsekwul’i’mus

Answered By – Tong Jiye

Answer Checked By – Marie Seifert (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *