Decoding strings with an unknown encoding

Unicode SnowmanWe’ve all been there, the system is humming along when you bring up the UI only to see “Espa?ol” staring you back in the face. What is this? Why is there a question mark in there? Well, you’ve been bitten by they mystical world of unicode. Usually this issue is an easy fix; always use UTF-8 encoding. However, what to do if you don’t own the whole code path? What if you have to support poorly written third party client libraries? What if you have client libraries in the wild that will never be updated? What if these client libraries out right lie about what encoding they are using? Follow me and we’ll find out.

Bytes To String

On the JVM there are two standard ways of converting bytes to Strings; new String(bytes[], encoding) and using the NIO Charset. Since they both accomplish the same feat my decision came down to performance. Luckily someone else did the heavy lifting and figured out that new String(bytes[], encoding) ekes out a small win over NIO Charset. However, this option poses a challenge. How do we know if the decoding succeeded? The NIO option can throw an Exception (effective but slow) or insert a Unicode Replacement Character (easy to find with a String.contains()) if it encounters a some bytes that cannot be decoded . new String(bytes[], encoding) does no such thing. It blindly decodes characters and will output random garbage in the resulting string if the decoding chokes on some bad byte values. We need a way to find those garbage characters.

Regex for non unicode characters

The clue that got me on the right track was an odd looking pattern in the javadoc for Pattern.

[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)

It seemed that \p{L} was some magical regex incantation for any unicode letter and after some additional searching it appeared that that was exactly the case. Of course what we really want to find are characters that are not unicode letters, spaces, punctuation or digits. Lucky for us there are matching groups for all of these, leading us to this regex:

[^\p{L}\p{Space}\p{Punct}\p{Digit}]

Performance Testing

Since performance is of particular concern it was important to test the overhead of this regex check. I fired up the Scala REPL and tested the using new String(bytes[], encoding) with the above regex as compared to NIO Coded using String.contains() to check for the replacement character. After all that work it turns out that the regex was significantly more expensive than String.contains(). So much so that the NIO code was about 2x as fast. So, in the end, I ended up going with the simpler NIO option.

The Code

import io.Codec
import java.nio.ByteBuffer

val UTF8 = "UTF-8"
val ISO8859 = "ISO-8859-1"
val REPLACEMENT_CHAR = '\uFFFD'

def bytesToString(bytes: Array[Byte], encoding: String) = {
  val upper = encoding.toUpperCase
  val codec = if (ISO8859 == upper) Codec.ISO8859 else Codec.UTF8
  val decoded = codec.decode(ByteBuffer.wrap(bytes)).toString
  if (!decoded.contains(REPLACEMENT_CHAR)) {
    decoded
  } else {
    val otherCodec = if (ISO8859 == upper) Codec.UTF8 else Codec.ISO8859
    val otherDecoded = otherCodec.decode(ByteBuffer.wrap(bytes)).toString
    if (!otherDecoded.contains(REPLACEMENT_CHAR)) {
      otherDecoded
    } else {
      val utf8 = if (ISO8859 == upper) otherDecoded else decoded
      utf8.replace(REPLACEMENT_CHAR, '?')
    }
  }
}

Update

Thanks to Joni Salonen for pointing out that I was wrong about new String() not inserting the unicode replacement char. In light of that info the following code is just a touch faster.

val UTF8 = "UTF-8"
val ISO8859 = "ISO-8859-1"
val REPLACEMENT_CHAR = '\uFFFD'
def bytesToString(bytes: Array[Byte], encoding: String) = {
  val upper = encoding.toUpperCase
  val firstEncoding = if (ISO8859 == upper) ISO8859 else UTF8
  val firstDecoded = new String(bytes, firstEncoding)
  if (!firstDecoded.contains(REPLECEMENT_CHAR)) {
    firstDecoded
  } else {
    val secondEncoding = if (ISO8859 == upper) UTF8 else ISO8859
    val secondDecoded = new String(bytes, secondEncoding)
    if (!secondDecoded.contains(REPLECEMENT_CHAR)) {
      secondDecoded
    } else {
      val utf8 = if (ISO8859 == upper) secondDecoded else firstDecoded
      utf8.replace(REPLECEMENT_CHAR, '?')
    }
  }
}

Follow

Get every new post delivered to your Inbox.