Feb 8, 2008

.Net Encoding Simplicity

Well, recently I was asked about
why this function does not work... ok, it works when the browser encoding is Unicode and it does not, when it is Shift JIS
here is the code:

public static string Convert(string input, Encoding source, Encoding destination)
{
if (String.IsNullOrEmpty(input))
{
return String.Empty;
}

return destination.GetString(
source.GetBytes(input)
);
}

"Well..." - I thought, - "It is not supposed to work." In fact, it is a popular mistake to misunderstand how CLR manages string. And the fact that is almost always is missing is that all strings in .Net are Unicode encoded (I suppose that in Java too).

I wonder why actually there are "source" and "destination" encodings...
When we write something myEncoding.GetBytes(meStringHere) we actuall saying the following to myEncoding "hey, give me the real bytes of this string if it were encoded with myEncoding encoding". And what it actually does is trying to map Unicode bytes to myEncoding encoding bytes using the GetBytes(char[], charsQuantity, byte[], bytesQuantity, Encoder) method (this method is actually unsafe and looks differently from what I have written here, but anyway....).

But, the most interesting part here is GetString method. When we call this one (myAnotherEncoding.GetString), we trying to say the following "hey, create me a Unicode string from the following byte array assuming that these bytes are myAnotherEncoding encoding bytes". This is certainly not true - just look at the method... it has two different (or perhaps different) encodings. And what we are doing for example when we try to use the following schema:

input string = "my input string with unicode \u00df or something"
source encoding = Unicode
destination encoding = Shift JIS (I know of two at least...)

We try to do the following:
unicode string -> Unicode bytes -> shift JIS produces Unicode string from Unicode bytes assuming that these bytes are Shift JIS' bytes...

That is not very correct, no, it's wrong.

For the better understanding of what actually happening when you use encoding and strings I suggest looking the following links (they does not give you advices on what-to-do-when, but provide a solid background to work with less issues):
Have a nice day!