Issue
I ran into struggles converting a byte array korean chars in Java.
Wikipedia states that somehow 3 bytes are beeing used for each char, but not all bits are taken into account.
Is there a simple way of converting this very special…format? I don’t want to write loops and counters keeping track of bits and bytes, as it would get messy and I can’t imagine that there is no simple solution. A native java lib would be perfect, or maybe someone figured some smart bitshift logic out.
UPDATE 2:
A working solution has been posted by @DavidConrad below, I was wrong assuming it is UTF-8 encoded.
UPDATE:
These bytes
[91, -80, -8, -69, -25, 93, 32, -64, -78, -80, -18, -73, -50]
should output this:
[공사] 율곡로
But using
new String(shortStrBytes,"UTF8"); // or
new String(shortStrBytes,StandardCharsets.UTF_8);
turns them to this:
[����] �����
The returned string has 50% more chars
Solution
Since you added the bytes to the question, I have done a little research and some experimenting, and I believe that the text you have is encoded as EUC-KR. I got the expected Korean characters when interpreting them as that encoding.
// convert bytes to a Java String
byte[] data = {91, -80, -8, -69, -25, 93, 32, -64, -78, -80, -18, -73, -50};
String str = new String(data, "EUC-KR");
// now convert String to UTF-8 bytes
byte[] utf8 = str.getBytes(StandardCharsets.UTF_8);
System.out.println(HexFormat.ofDelimiter(" ").formatHex(utf8));
This prints the following hexadecimal values:
5b ea b3 b5 ec 82 ac 5d 20 ec 9c a8 ea b3 a1 eb a1 9c
Which is the proper UTF-8 encoding of those Korean characters and, with a terminal that supported them, printing the string should display them properly, too.
Answered By – David Conrad
Answer Checked By – Senaida (BugsFixing Volunteer)