[SOLVED] Regex to parse simple markdown with escaped characters without look-behind

Issue

Note: This has to work in JavaScript RegExp

I have to parse string like this:

yo (p:abc-123-def) meets  \(p:2) \(in the cinema\) \\ (p:3) (p:4\) won't 

What I need to extract are all (<entity>:<id>) markups but ignore escaped things like \(in the ciname\) or \\. From the above example, the regex should only match

(p:abc-123-def)
(p:3)

but not \(p:2) or \(p:4) since the brackets are escaped.

Now, I am still able to modify that markup so if there is a simpler way to do the whole thing I’m open to suggestions. If not, I’d need to be able to get those (<entity>:<id>) markups from a regex.

Something like this

(?<!\\)\([^(?<!\\)\(]*\)

would work but look-behind groups are not supported by all browsers.

Solution

It can get complex when backslashes are repeated many times, like: \\\\\\\\\\\\\\(p:1). You would need to know whether the number of backslashes is even or odd in order to know whether the ( is escaped or not.

Secondly, the colon occurring within parentheses might be escaped as well, and would then not count(?).

So I would suggest to work with something like (?:\\.|[^:)\\])* which deals with escaped characters (.) and puts some requirements for unescaped characters, like [^:)\\].

So this is the result:

(?<!\\)(?:\\.)*\((?:\\.|[^:)\\])*:(?:\\.|[^:)\\])*\)

This uses look-behind which is being supported in the latest versions of popular browsers.

If look-behind is not an option, then capture the character that precedes the potential backslashes, and make a capture group for the part you need:

(?:[^\\]|^)((?:\\.)*\((?:\\.|[^:)\\])*:(?:\\.|[^:)\\])*\))

So here you need to work with the first captured group.

Answered By – trincot

Answer Checked By – Cary Denson (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *