The vast number of code samples with C / C ++ strings (in any published sources) It operates with zero-terminal arrays (ASCIZ) elements char (in the style of C), or container type string (in the style of C ++), built as an add-on such arrays.
All of this works wonderfully with strings Latin (English) characters, but can go racing on lines, containing symbols of foreign language alphabets (Russian, Chinese, Arabic and Hebrew). There is not that simple ... and very poorly described in the literature, that is understandable: English-speaking authors pay little attention to questions of foreign language localization, and domestic authors, in the majority, rewriting and adapting the English-language publications, do not pay attention to this aspect of the question.
C language - a very old programming language, and C ++ inherits from it formats and constrained by the requirements of syntactic compatibility with C. In order for the, not to have a C / C ++ problem with such lines (called localized) you need to understand what's going on with these localizations ...
Historical, symbols (char) submitted (1963 year) ASCII standard as younger 7 bits of one byte, while senior 8th bit intended for error control, occurred during data transfer.
This encoding allows you to encode all possible 128 different symbols, and this number is hardly enough to the characters of the English alphabet (big and small), digital (code 0x30-0x39), managers (less code 0x20) and special characters. When it comes to the submission of national alphabets, type the alternate character table, such as KOI-7 for Russian language.
Switching to a stream input-output table for alternative symbol by symbol with code 0x18 (code is called: Device Control 2) in the stream, and return to the ASCII main table - the symbol with code 0x17 (Device Control 1).
Later, since the mid 80s, with time IBM PC widespread and replace them other families of computers, ASCII standard was extended by the 8th bit byte char, byte could represent 256 characters: junior 127 represents the original ASCII table (with Latin script), and older - national alphabet.
But, since the national alphabets may be diverse, is to support each of them was required to enter a code page, for example, for the Russian language, this might be the page CP-866 (MS в), CP-1251 (в Windows), No-8r (в UNIX, Linux) - and each of these pages is offering its, characterized by other, the order of Russian characters. When this, for correct display (or decoding) any localized character string it is necessary to know the code page in which it is presented.
In order for the, to put an end to this Babel of language code pages, was offered (1991city) representation UNICODE standard, in which the coding system, each character is coded 32-bit value (4 bytes, but not all 32-bit values are valid). This standard allows you to encode a huge number of characters of different writing systems.
Documentation, encoded UNICODE standard, may comprise a single text Japanese and Chinese characters, Latin letters, Cyrillic, Greek alphabet (a, e, i, Fr., p, l, Phi, Oh ...), mathematical symbols, musical notation, musical symbols, symbols extinct, rare, exotic peoples. There is no need to switch the code page. For example, Here are the some of the symbols of language, designated as “singaliskii”:
1 | ඣ, ඤ, ඥ |
The first UNICODE standard was released in the 91 th year. Last at the moment - in 2017 and he describes 136755 different symbols.
But UNICODE - is still only standard representation of each character. To represent this character in a particular operating system (or the programming language) need more character coding system UNICODE.
- encoding system is widely used:
UTF-8 - used to represent each character 4 bytes, direct numerical value UNICODE code - UTF-16 - to represent the most commonly used symbols used 2 bytes (first 65536 positions), and the rest are in the form of a "surrogate pairs". This encoding is used on Windows operating systems starting with Windows NT.
- UTF-32 - to represent each character uses a variable-length sequence of bytes: from 1 byte for the ASCII characters of the main table, to 6 byte for rarely used characters (Russian alphabet characters are encoded with 2 bytes). This encoding was created later than other operating systems for Plan 9 and Inferno in 1992. Ken Thompson and Robert Pike with colleagues, and it entered as a single and bulk encoding of character strings in later Python programming languages and Go. This encoding is used, Today everywhere, in the POSIX / UNIX operating systems, Linux.
Returning to the, that C / C ++ old family of programming languages, to represent them in localized characters necessary to introduce a new data type - wide characters wchar_t instead of char (the type of data appeared in the C89 standard, but, fully with API support, Only in the C99 standard). Instead of line C of the form library functions str *() for wide offer their full counterparts, but as wcs *() (instead of the prefix prefix str write wcs). Different systems may have different wchar_t bit (in Linux is int32_t, в Windows int16_t) but for a programmer that does not matter and does not create differences.
1 2 3 4 5 6 | #include <stdio.h> #include <wchar.h> int main( void ) { printf( "размер символа wchar_t вашей реализации = %d байт\n", (int)sizeof( wchar_t ) ); } |
1 2 | $ ./0 размер символа wchar_t вашей реализации = 4 байт |
For work and converting multi-byte sequences recorded in UTF-8 encoding in C / C ++ introduced family of functions mb *(): mbtowc(), mblen(), mbstowcs(), wcstombs() and etc. It is a mechanism for mutual conversion between char array[] (which also expressed UTF-8 strings) и wchar_t[]. If you are not faced with UTF-8 encoding (that likely occurs in Windows), then this group of functions you should not take.
Similarly,, instead container class C ++ string class introduced a similar container wide characters wstring.
Specifically about the technique of Wide localized strings will be discussed in the next article. In the meantime, the 1st elementary example ... without comment - as an occasion for reflection (note and explain, that calling strlen() in each case gives the number of bytes in a string is not consistent visually apparent number of letters in it):
1 2 3 4 5 6 | #include <stdio.h> #include <string.h> int main() { char str[] = "Привет, 世界"; printf( "%s [%d байт]\n", str, (int)strlen( str ) ); } |
1 2 | $ ./1 Привет, 世界 [20 байт] |
P.S. With great detail about the localization in C / C ++ and the localized strings, who are interested in more detail, can be read here: Language localization of the C / C ++ Language localization of the C / C ++ - there is an explanation more 22 pages of office document formats.
Some features of the rows with the Russian, which are not described in considering generally rows char, They will be discussed in the following articles in the continuation of this theme.
Keep…
I understand the text crept typo, and it is a UTF-8 sequences, 16 and 32
And I also understand! ОК!
Guys, Thank you! Fixed
Author of the article as I understand it confused description of UTF-8 and UTF-32.