Unicode and .NET

Original author: Jon Skeet .
  • Transfer
From the translator. Habré has already published articles both on Unicode and on lines in .NET. However, there was no Unicode article about .NET yet, so I decided to translate the article by the recognized .NET guru John Skeet. It closes the cycle I promised of three articles-translations of J. Skeet devoted to strings in .NET. As always, I will be glad to comments and corrections.
Логотип Юникода

Introduction


The topic of this article is quite extensive, and do not expect from it a detailed and deep analysis of all the nuances. If you think you are pretty good at Unicode, encodings, etc., this article may be almost or even completely useless for you. However, quite a lot of people do not understand how binary and text data ( binary and text ) differ , or what character encoding is. It is for such people that this article was written. Despite, in general, a superficial description, it touches on some difficult points, but this was done more so that the reader had an idea of ​​their existence, rather than to give detailed explanations and guides to action.

Resources


The links below are at least as useful as this article, and maybe more useful. I myself used them when writing this article. They have a lot of useful and high-quality materials, and if in this article you notice any inaccuracies, then these resources should be more accurate.



Binary and text data are two different things.


Most modern programming languages ​​(like some old ones) draw a clear line between binary (binary) content and symbolic (or textual) content. Although this difference is understood on an instinctive level, I will nevertheless make a definition.

Binary (binary) data is a sequence of octets (an octet consists of 8 bits) without any natural meaning or interpretation given to them. And even if there is an external “interpretation” of a particular set of octets like, say, an executable file or graphic image, the data itself is just a set of octets. Further, instead of the term “octet”, I will use “byte”, although, to be precise, not every byte is an octet. For example, there were computer architectures with 9-bit bytes. However, in this context, such details are not really needed, so hereinafter by the term "byte" I will mean exactly 8-bit byte.

Character (text) data is a sequence of characters.

Unicode Glossary . пределяет символ как:
  1. The smallest component of a written language containing semantic meaning; indicates an abstract meaning and / or form, in contrast to special forms (such as glyphs ); in code tables, some forms of visual representation of characters are of great importance for the reader to understand them.
  2. A synonym for an abstract symbol (see Definition D3 in Section 3.3, Characters and Coded Representations).
  3. The basic unit of coding in the Unicode coding system.
  4. English name for ideographic written elements of Chinese origin.

This definition may or may not be useful to you, but in most cases you can use an intuitive understanding of the symbol, as something like a certain element denoting a large letter "A" or the number "1", etc. However, there are other characters that are far from being so intuitively obvious. These include modifying characters that are designed to change other characters (for example, acute stress, it is also an acute ), control characters (for example, a newline), formatting characters (invisible, but affect other characters). Text data is character sets, and this is the main thing.

Unfortunately, in the recent past, the difference between binary and text data was very blurry, fuzzy. For example, for C programmers, the terms “byte” and “character” in most cases meant the same thing. In modern platforms such as .NET and Java, where the distinction between characters and bytes is clear and fixed in the input / output libraries, old habits can have negative consequences (for example, people can try to copy the contents of a binary file by reading character strings from it, which will distort the contents of this file).

So what is Unicode for?


The Unicode Consortium is trying to standardize the processing of character data, including conversions from binary to text and vice versa (called decoding and encoding, respectively). In addition, there is a set of ISO standards (10646 in various versions) that do the same; Unicode and ISO 10646 can be considered the same thing, since they are almost completely compatible. (In theory, ISO 10646 defines a wider potential character set, but this is unlikely to ever become a problem.) Most modern programming languages ​​and platforms, including .NET and Java, use Unicode to represent characters.

Unicode defines, among other things:
  • repertoire of abstract characters ( abstract character repertoire ) - a set of all characters that are supported by Unicode;
  • character set codes ( coded character set ) - each symbol comprises binding of the repertoire to the whole non-negative number called code point ( code point );
  • some character encoding forms — determine the correspondence between code points and sequences of “code units” (simply put, the correspondence between a code point expressed as a single integer of any length and a group of bytes encoding this number);
  • some character encoding schemes - determine the correspondence between sets of code units and serialized byte sequences.

The distinction between the form of characters and character encoding scheme of encoding a rather thin, however, it takes into account the byte order ( endianness ). (For example, in UCS-2 encoding, a sequence of code units 0xC2 0xA9 can be serialized as 0xC2 0xA9 or as 0xA9 0xC2 - this is what the character encoding scheme decides.)

The repertoire of abstract Unicode characters can contain, in theory, up to 1114112 characters, although many are already reserved as unusable, and the remaining ones will most likely never be assigned. Each character is encoded with a non-negative integer from 0 to 1114111 (0x10FFFF). For example, the capital A is encoded with the decimal number 65. A few years ago it was believed that all characters would fit into the range between 0 and 2 16 -1, which meant that any character could be represented using two bytes. Unfortunately, over time it took more characters, which led to the appearance of the so-called. "Surrogate pairs" ( surrogate pair ). With them, everything has become much more complicated (at least for me), and therefore most of this article will not deal with them - I will describe them briefly in the “Difficult moments” section .

So what does .NET provide?


Do not worry if all of the above looks strange. The differences described above should be known, but in fact they do not often come to the fore. Most of your tasks will most likely revolve around converting a certain set of bytes into some text and vice versa. In such situations, you will work with the System.Char structure (known as an alias in C # char ), the System.String char class ( string in C #), and the System.Text.Encoding string class .

Structure Char is the most basic character type in C #, one instance Char represents one Unicode character and occupies 2 bytes of memory, which means it can take any value from the range 0-65535. Keep in mind that not all numbers in this range are valid Unicode characters.

A class String is basically a sequence of characters. It is immutable, which means that after creating an instance of a string, you can no longer change it (instance) - the various methods of the class String , although they look like they change its contents, actually create and return a new line.

Class System.Text.Encoding provides a means of converting an array of bytes into an array of characters or a string, and vice versa. This class is abstract; its various implementations are both represented in .NET and can be written by users themselves. (The task of creating an implementation System.Text.Encoding is quite rare - in most cases, you will have enough of the classes that come with .NET.) Encoding allows you to separately specify encoders and decoders that handle state between calls. This is necessary for multibyte character encoding schemes when it is impossible to correctly decode all bytes received from a stream into characters. For example, if the UTF-8 decoder receives two bytes 0x41 0xC2 as input, it can return only the first character (capital letter “A”), however, it needs a third byte to determine the second character.

Built-in encoding schemes


The .NET class library contains various encoding schemes. The following is a description of these schemes and how to use them.

Ascii

ASCII is one of the most common and at the same time one of the most misunderstood character encodings. Contrary to popular misconception, ASCII is 7-bit encoding, not 8-bit: characters with codes (code points) greater than 127 do not exist. If someone says that he uses, for example, the code "ASCII 154", then we can assume that this someone does not understand what he is doing and saying. True, as an excuse, he may say something about “extended ASCII” ( extended ASCII ). So - there is no scheme called "extended ASCII." There are many 8-bit encoding schemes that are a superset of ASCII, and the term “extended ASCII” is sometimes used to refer to them, which is not entirely correct. The code point of each ASCII character matches the code point of a similar Unicode character: in other words, the ASCII character of the Latin letter “x” in lower case and the Unicode character of the same character are indicated by the same number - 120 (0x78 in hexadecimal notation). .NET class ASCIIEncoding (an instance of which can be easily obtained through the Encoding.ASCII property ), in my opinion, is a bit strange, since it seems to do the encoding by simply dropping all the bits after the base 7s. This means that, for example, the unicode character 0xB5 (the “micro” sign is µ ), after encoding in ASCII and decoding back to Unicode, will turn into a 0x35 character (digit “5”). (Instead, I would prefer that some special character be displayed, indicating that the original character was absent in ASCII and was lost.)

Utf-8

UTF-8 is a good and common way to represent Unicode characters. Each character is encoded by a sequence of bytes in an amount from one to four, inclusive. (All characters with code points less than 65536 are encoded in one, two, or three bytes; I have not tested how .NET encodes surrogate pairs: with two sequences of 1-3 bytes or one sequence of 4 bytes.) UTF-8 can display all Unicode characters and ASCII compatible so that any sequence of ASCII characters will be transcoded to UTF-8 without changes (i.e., a sequence of bytes representing characters in ASCII and a sequence of bytes representing the same characters in UTF-8 are the same). Moreover, the first byte encoding a character is enough to determine how many more bytes encode the same character, if any. UTF-8 by itself does not require byte order mark (Byte order mark - BOM), although it can be used as a way to indicate that the text is in UTF-8 format. A UTF-8 text containing a BOM always starts with a sequence of three bytes 0xEF 0xBB 0xBF. Чтобы закодировать строку в UTF-8 в .NET , просто используйте свойство Encoding.UTF8 . In fact, in most cases, you don’t even have to do this - many classes (including StreamWriter ) use UTF-8 by default when no other encoding is explicitly set. (Make no mistake, Encoding.Default it doesn’t apply here, it’s completely different.) Nevertheless, I advise you to always explicitly specify the encoding in your code, if only for the sake of readability and understanding.

UTF-16 and UCS-2

UTF-16 is just the encoding in which .NET works with characters. Each character is represented by a sequence of two bytes; accordingly, a surrogate pair takes 4 bytes. The ability to use surrogate pairs is the only difference between UTF-16 and UCS-2: UCS-2 (also known simply as Unicode) does not allow surrogate pairs and can represent characters in the range 0-65535 (0-0xFFFF). UTF-16 can have a different order of bytes (Endianness): it can be from high to low ( big-endian ), from low to high ( little-endian ), or be machine-dependent with the optional BOM (0xFF 0xFE for little-endian, 0xFE 0xFF for big-endian). В самом .NET , насколько я знаю, на проблему суррогатных пар «забили», и каждый символ в суррогатной паре рассматривается как самостоятельный символ, что приводит своеобразной «уравниловке» между UCS-2 и UTF-16. (The exact difference between UCS-2 and UTF-16 is a much deeper understanding of surrogate pairs, and I am not competent in this aspect.) UTF-16 in the big-endian view can be obtained using the Encoding.BigEndianUnicode property , and little- endian - using Encoding.Unicode . Both properties return an instance of the System.Text.UnicodeEncoding class , which can also be created using various constructor overloads: here you can specify whether to use or not use BOM and what order of bytes to set. I believe (although I have not tested this) that when decoding binary content, the BOM present in the content overrides the byte order settings set in the encoder, so that the programmer should not make any extra gestures if he decodes any content, even if the byte order and / or the presence of BOM in this content is unknown to him.

Utf-7

UTF-7, judging from my experience, is rarely used when it is, but it allows you to transcode Unicode (probably only the first 65535 characters) into ASCII characters (not bytes!). This can be useful when working with email in situations where mail gateways support only ASCII characters, or even only a subset of ASCII (for example, EBCDIC encoding ). My description looks slurred, because I never climbed into the details of UTF-7 and I am not going to do it from now on. If you need to use UTF-7, then you probably already know enough, and if you do not have the absolute need to use UTF-7, then I advise you not to. A class instance for encoding in UTF-7 can be obtained using the Encoding.UTF7 property .

Windows / ANSI Code Pages

Windows Code Pages are typically one- or two-byte character sets that encode up to 256 or 65,536 characters, respectively. Each code page has its own number, and the encoder for the code page with a known number can be obtained using the static method Encoding.GetEncoding (Int32) . In most cases, code pages are useful for working with old data, which is often stored in the “ default code page ”. The default encoder for the code page can be obtained using the Encoding.Default property. . Again, avoid using code pages whenever possible. For more information, contact MSDN.

ISO-8859-1 (Latin-1)

As in ASCII, each character in the Latin-1 code page has the same code as the code for the same character in Unicode. I did not bother to find out whether Latin-1 has a “hole” of unaccounted characters with codes from 128 to 159, or whether Latin-1 contains the same control characters here as Unicode. (I began to lean toward the idea of ​​a “hole”, but Wikipedia disagrees with me, so I’m still thinking. (The author’s thoughts are not clear, since the Wikipedia article clearly shows the presence of a gap ; probably, at the time of writing the original article, the contents Wikipedia article was different. - approx. transl. )) Latin-1 has a code page number of 28591, so use the method to get the encoder Encoding.GetEncoding(28591) .

Streams, readers and writers


Streams are binary in nature; they read and write bytes. Everything that accepts a string must convert it to bytes in a certain way, and this conversion can be successful for you or not. The equivalents of streams for reading and writing text are the abstract classes System.IO.TextReader and System.IO.TextWriter, respectively. If you already have a stream ready, you can use the System.IO.StreamReader (which directly inherits TextReader ) classes TextReader to read and the System.IO.StreamWriter (which directly inherits TextReader ) classes TextWriter ) for writing, passing the stream to the constructor of these classes and coding as you need. If you do not explicitly specify an encoding, UTF-8 will be used by default. Below is an example of code that converts a file from UTF-8 to UCS-2:
using System;
using System.IO;
using System.Text;

public class FileConverter
 {
     const int BufferSize = 8096;
     
     public static void Main(string[] args)
     {
         if (args.Length != 2)
         {
             Console.WriteLine 
                 ("Usage: FileConverter <input file> <output file>");
             return;
         }
         String inputFile = args[0];
         String outputFile = args[1];
         // Открыть TextReader для чтения существующего входного файла
         using (TextReader input = new StreamReader 
                (new FileStream (inputFile, FileMode.Open),
                 Encoding.UTF8))
         {
             // Открыть TextWriter для создания и записи в новый выходной файл
             using (TextWriter output = new StreamWriter 
                    (new FileStream (outputFile, FileMode.Create),
                     Encoding.Unicode))
             {
                 // Создать буфер
                 char[] buffer = new char[BufferSize];
                 int len;
                 
                 // Копировать данные порциями до достижения конца
                 while ( (len = input.Read (buffer, 0, BufferSize)) > 0)
                 {
                     output.Write (buffer, 0, len);
                 }
             }
         }
     }
 }

Note that in the code used by designers TextReader and TextWriter that accept flows. There are other constructor overloads that accept file paths as input, so you don’t have to manually open it FileStream ; I did this only as an example. There are other constructor overloads that also accept buffer size and the need to define a BOM — in general, take a look at the documentation. And finally, if you use .NET 2.0 and higher, it’s good to look at the static class System.IO.File , which also contains many convenient methods that allow you to work with encodings.

Difficult moments


Okay, these were just the basics of Unicode. There are many other nuances, some of which I have already hinted at, and I believe that people should know about them, even if they believe that this will never happen to them. I do not offer any general methodologies or guidelines - I am just trying to raise your awareness of possible problems. The following is a list, and it is by no means exhaustive. It is important that you understand that most of the problems and difficulties described are by no means the fault or errors of the Unicode Consortium; just like in the case of the date, time and any of the problems of internationalization, this is the “merit” of mankind, which over time itself created many fundamentally complex problems.

Culture-dependent search, sorting and so on.

These issues are described in my article on strings in .NET ( original , translation ).

Surrogate pairs

Now that Unicode contains more than 65536 characters, it cannot accommodate all of them in 2 bytes. This means that one instance of the structure Char cannot accept all possible characters. UTF-16 (and .NET) will solve this problem through the use of surrogate pairs ( a surrogate pair ) Are two 16-bit values, where each value ranges from 0xD800 to 0xDFFF. In other words, two "types of symbol" form one "real" symbol. (UCS-4 and UTF-32 completely solve this problem by the fact that they have a wider range of values ​​available: each character takes 4 bytes, and this is enough for everyone.) Surrogate pairs are a headache, because this means that the string , which consists of 10 characters, can actually contain from 10 to 5 inclusive of “real” Unicode characters. Fortunately, most applications do not use scientific or mathematical notations and Khan symbols, and so you don’t need to worry much about it.

Modifier characters

Пример модификации символа
Not all Unicode characters appear on the screen or paper as an icon / picture. Underline (accented) character can be represented as two other characters: ordinary, nepodchorknutogo symbol and the next underscore for them, which is called modifier (or combinable) symbol ( Combining character ). Some graphical interfaces support modifying characters, some do not, and your application will depend on what kind of assumption you make.

Normalization

Partly because of things like modifying characters, there can be several ways to represent what is, in a sense, a single character. The letter “á” with an accent, for example, can be represented by one symbol “a” without an accent and the modifier character of the accent following it, and can be represented only by one symbol representing the finished letter “a” with an accent. Character sequences can be normalized to use modifying characters wherever possible, or vice versa - not to use them wherever they can be replaced with a single character. Should your application consider two lines containing the letter “á” with an accent, but in one represented by two characters, and in the second one, as equal, or how different? What about sorting? Do third-party components and libraries used by you normalize strings, and in general, take into account such nuances? Answer these questions to you.

Debugging Unicode Issues


This section ( the original is a separate article - approx. Per. ) Describes what to do in very specific situations. Namely, you have some symbolic data (simply text) in one place (usually in the database) that go through different steps / layers / components and then are displayed to the user (usually on a web page). And, unfortunately for you, some characters are displayed incorrectly ( crackbacks ). Based on the many steps that your text data goes through, a problem can occur in many places. This page will help you simply and reliably find out what and where is “broken”.

Step 1: Understand Unicode Basics

Simply put, read the main text of the article. You can also pay attention to the links that are given at the beginning of the article. The fact is that without basic knowledge you will be taut.

Step 2: try to determine which conversions could occur

If you can understand where everything may break, then this section / step will be much easier to isolate. At the same time, keep in mind that the problem may not be in the process of extracting and converting text from the repository, but in the fact that the already “corrupted” text was entered into the repository earlier. (I had similar problems in the past when, for example, one old application distorted the text when writing and reading it to / from the database. The joke was that the conversion errors overlapped and mutually compensated, so the output the text turned out to be correct.In general, the application worked fine, but if it was touched, everything crumbled.) Actions that can “spoil” the text should include fetching from the database, reading from a file, transferring via a web connection and displaying text on the screen.

Step 3: check the data at each stage

The first rule: do not trust anything that logs character data as a sequence of glyphs (i.e. standard character icons). Instead, you should log data as a set of character codes in bytes. For example, if I have a line containing the word "hello", then I will display it as "0068 0065 006C 006C 006F". (Using hexadecimal codes will allow you to easily check a character against code tables.) To do this, you need to go through all the characters in the string and display its code for each character, which is done in the method below, which displays the result in the console:
static void DumpString (string value)
 {
     foreach (char c in value)
     {
         Console.Write("{0:x4} ", (int)c);
     }
     Console.WriteLine();
 }

Your own logging method will be different depending on your environment, but the basis of it should be exactly the same as I gave. I gave a more advanced way of debugging and logging character data in my article on strings.

The essence of my idea is to get rid of all kinds of problems with encodings, fonts, etc. This technique can be useful when working with specific Unicode characters. If you cannot correctly log hexadecimal codes of even simple ASCII text, you have big problems.

The next step is to make sure you have a test case that you can use. Find preferably a small set of source data on which your application is guaranteed to “fail”, make sure that you know exactly what the correct result should be, and pledge the resulting result in all problem areas.

After the problematic line is pledged, you need to make sure whether it is what it should be or not. The Unicode code charts web page will help you with this. . You can choose both a set of characters in which you are sure, and search for characters in alphabetical order. Verify that each character in the string has the correct value. As soon as you find the place in your application where the symbol data stream is damaged, examine this place, find out the cause of the error and correct it. Having corrected all the errors, make sure that the application is working correctly.

Conclusion

As in the case with the vast majority of errors arising in software development, problems with the text are solved using the universal “divide and conquer” strategy. Once you are confident in every step, you can be confident in general. If during the correction of such errors you encounter especially incomprehensible and strange manifestations of them, I strongly advise you to cover this section of the code with unit tests after correcting them; they will serve as both documentation of what might happen and protection against future regressions.

Sources