[SOLVED] Convert cyrilic to latin – latin intruders/exception

Issue

I am using simple dictionary to replace Cyrillic letters with Latin ones and most of the time its working just fine but I am having issues when there are some Latin letters as an input. Most of the time its company names.

Few examples:

PROCRED is being converted as RROSRED

ОВЕХ as OVEH

CITY as SITU

What can I do about this?

This is the dictionary I am using

public string ConvertCyrillicToLatin(string text)
        {
            Dictionary<string, string> words = new Dictionary<string, string>();

            words.Add("А", "A");
            words.Add("Б", "B");
            words.Add("В", "V");
            words.Add("Г", "G");
            words.Add("Д", "D");
            words.Add("Ђ", "Đ");
            words.Add("Е", "E");
            words.Add("Ж", "Ž");
            words.Add("З", "Z");
            words.Add("И", "I");
            words.Add("Ј", "J");
            words.Add("К", "K");
            words.Add("Л", "L");
            words.Add("Љ", "Lj");
            words.Add("М", "M");
            words.Add("Н", "N");
            words.Add("Њ", "Nj");
            words.Add("О", "O");
            words.Add("П", "P");
            words.Add("Р", "R");
            words.Add("С", "S");
            words.Add("Т", "T");
            words.Add("Ћ", "Ć");
            words.Add("У", "U");
            words.Add("Ф", "F");
            words.Add("Х", "H");
            words.Add("Ц", "C");
            words.Add("Ч", "Č");
            words.Add("Џ", "Dž");
            words.Add("Ш", "Š");
            words.Add("а", "a");
            words.Add("б", "b");
            words.Add("в", "v");
            words.Add("г", "g");
            words.Add("д", "d");
            words.Add("ђ", "đ");
            words.Add("е", "e");
            words.Add("ж", "ž");
            words.Add("з", "z");
            words.Add("и", "i");
            words.Add("ј", "j");
            words.Add("к", "k");
            words.Add("л", "l");
            words.Add("љ", "lj");
            words.Add("м", "m");
            words.Add("н", "n");
            words.Add("њ", "nj");
            words.Add("о", "o");
            words.Add("п", "p");
            words.Add("р", "r");
            words.Add("с", "s");
            words.Add("т", "t");
            words.Add("ћ", "ć");
            words.Add("у", "u");
            words.Add("ф", "f");
            words.Add("х", "h");
            words.Add("ц", "c");
            words.Add("ч", "č");
            words.Add("џ", "dž");
            words.Add("ш", "š");

            var source = text;
            foreach (KeyValuePair<string, string> pair in words)
            {
                source = source.Replace(pair.Key, pair.Value);
            }

            return source;
        }

UPDATE 1

As requested in the comment, here is my exemption list:

"СIТУ":"CITY",
"OBEX":"OBEX"

Now it have just these two examples, for test, but its impossible to have a real functional exemption list with so many possibilities.

I am expecting that if application came across a Latin letter, just to ignore it and leave it as it is. Its already working like that for Latin letters which doesnt exist as Cyrillic or which exist but have the same meaning, like letters AEODGTEJKLMN… I am having issues with letters which looks the same in both Latin and Cyrillic alphabet but have different meaning, letters like С(S), Х(H), У(Y), P(R)…

UPDATE 2

Here are the few examples of input asked in the comment. The slash sign of course doesnt exit in the input, I just added it so that you can distinguish the Latin part

…ПОВЕРИОЦ /LЕNS OBEX DОО/, У СКЛАДУ СА ОДРЕДБОМ…

…ИЗЈАВА ПРИВРЕДНОГ ДРУШТВА /GRАDЈЕVINSКО РRЕDUZЕСЕ IМРЕХ LОZNIСА/ СА АДРЕСОМ…

…ЗА УГОВОР О ОТВАРАЊУ КРЕДИТНЕ ЛИНИЈЕ СА КОМПАНИЈОМ /"DOWN CITУ"/ И РАСПОН МЕСЕЧНЕ КАМАТНЕ СТОПЕ…

…КОРИСТ ПОВЕРИОЦА /ATР BANK TOUR/, СА СЕДИШТЕМ…

Solution

In code below two dictionaries are used for converting text with Cyrillic character to the Latin. If a word contains the Latin characters the first LatinType dictionary is used. Otherwise the second CyrillicType is used.

class Program
{
    static void Main(string[] args)
    {
        var text = "...ПОВЕРИОЦ / LЕNS OBEX DОО/, У СКЛАДУ СА ОДРЕДБОМ..."
              + "...ИЗЈАВА ПРИВРЕДНОГ ДРУШТВА / GRАDЈЕVINSКО РRЕDUZЕСЕ IМРЕХ LОZNIСА / СА АДРЕСОМ..."
              + "...ЗА УГОВОР О ОТВАРАЊУ КРЕДИТНЕ ЛИНИЈЕ СА КОМПАНИЈОМ / DOWN CITУ / И РАСПОН МЕСЕЧНЕ КАМАТНЕ СТОПЕ...";

        var result = CyrillicToLatin.Convert(text);
    }
    public static class CyrillicToLatin
    {
        private static readonly Dictionary<string, string> ExclusionList = new()
            {
                { "ОТР COMPANY", "OTP COMPANY" }
            };

        private static readonly Dictionary<char, string> LatinType = new()
        {
            {'А', "A"},
            {'В', "B"},
            {'Е', "E"},
            {'К', "K"},
            {'М', "M"},
            {'Н', "H"},
            {'О', "O"},
            {'Р', "P"},
            {'С', "C"},
            {'Т', "T"},
            {'У', "Y"},
            {'Х', "X"}
        };

        private static readonly Dictionary<char, string> CyrillicType = new()
        {
            { 'А', "A" },
            { 'Б', "B" },
            { 'В', "V" },
            { 'Г', "G" },
            { 'Д', "D" },
            { 'Ђ', "Đ" },
            { 'Е', "E" },
            { 'Ж', "Ž" },
            { 'З', "Z" },
            { 'И', "I" },
            { 'Ј', "J" },
            { 'К', "K" },
            { 'Л', "L" },
            { 'Љ', "Lj" },
            { 'М', "M" },
            { 'Н', "N" },
            { 'Њ', "Nj" },
            { 'О', "O" },
            { 'П', "P" },
            { 'Р', "R" },
            { 'С', "S" },
            { 'Т', "T" },
            { 'Ћ', "Ć" },
            { 'У', "U" },
            { 'Ф', "F" },
            { 'Х', "H" },
            { 'Ц', "C" },
            { 'Ч', "Č" },
            { 'Џ', "Dž" },
            { 'Ш', "Š" },
            { 'а', "a" },
            { 'б', "b" },
            { 'в', "v" },
            { 'г', "g" },
            { 'д', "d" },
            { 'ђ', "đ" },
            { 'е', "e" },
            { 'ж', "ž" },
            { 'з', "z" },
            { 'и', "i" },
            { 'ј', "j" },
            { 'к', "k" },
            { 'л', "l" },
            { 'љ', "lj" },
            { 'м', "m" },
            { 'н', "n" },
            { 'њ', "nj" },
            { 'о', "o" },
            { 'п', "p" },
            { 'р', "r" },
            { 'с', "s" },
            { 'т', "t" },
            { 'ћ', "ć" },
            { 'у', "u" },
            { 'ф', "f" },
            { 'х', "h" },
            { 'ц', "c" },
            { 'ч', "č" },
            { 'џ', "dž" },
            { 'ш', "š" }
        };

        public static string Convert(string text)
        { 
            // Apply the exclusion list first               
            foreach (KeyValuePair<string, string> pair in ExclusionList)
            {
                text = text.Replace(pair.Key, pair.Value);
            }

            string pattern = @"[^,;()\s]+"; // Delimiters 

            var sb = new StringBuilder();
            var index = 0;

            foreach (Match match in Regex.Matches(text, pattern))
            {
                var dictionary = IsContainLatin(match.Value) ? LatinType : CyrillicType;
                var word = ConvertWord(match.Value, dictionary);
                if (index < match.Index)
                {
                    sb.Append(text[index..match.Index]);
                }
                sb.Append(word);
                index = match.Index + match.Length;
            }
            return sb.ToString();
        }

        private static string ConvertWord(string word, Dictionary<char, string> coding)
        {
            var result = new StringBuilder();
            foreach(char c in word)
            {
                string s = c.ToString();
                if (coding.TryGetValue(c, out string val))
                    s = val;
                result.Append(s);
            }
            return result.ToString();
        }

    private static bool IsContainLatin(string s)
        {
            foreach (char c in s)
                if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
                    return true;
            return false;
        }
    }
}

By this code the text from the "UPDATE 2" of the question will be coded to the following:

…POVERIOC / LENS OBEX DOO/, U SKLADU SA ODREDBOM……IZJAVA
PRIVREDNOG DRUŠTVA / GRADЈEVINSKO PREDUZECE IMPEX LOZNICA / SA
ADRESOM……ZA UGOVOR O OTVARANjU KREDITNE LINIJE SA KOMPANIJOM /
DOWN CITY / I RASPON MESEČNE KAMATNE STOPE…

Answered By – Jackdaw

Answer Checked By – Mildred Charles (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *