Working with lines localized in C ++

first, you need to be able to do with the localized strings - a record character constants broad localized characters and distinguish them from conventional lines char[]. For this line is written to the prior s qualifier L:

#include <stdio.h>
#include <wchar.h>
#include <locale>
int main( void ) {
   wchar_t w[] = L"русскоязычная строка";
   setlocale( LC_ALL, "" );
   printf( "%ls [%lu байт, %lu букв]\n", w, sizeof( w ), wcslen( w ) );
}

#include <stdio.h>

#include <wchar.h>

#include <locale>

int main( void ) {

wchar_t w[] = L"русскоязычная строка";

setlocale( LC_ALL, "" );

printf( "%ls [%lu байт, %lu букв]\n", w, sizeof( w ), wcslen( w ) );

}

The result will be:

$ ./2
русскоязычная строка [84 байт, 20 букв]

1 2	$ ./2 русскоязычная строка [84 байт, 20 букв]

Pay attention, the string length (number characters) in this case, clearly less, than the number bytes allocated for the string (your operating system, their attitude may be different from, I show in Linux, but it does not affect the programming technique).

In such a line near with equal success may be symbols of diverse nature: different languages, special mathematical symbols, common in theirske designation of the Greek alphabet (a, e, i, Fr., p, l, Phi, Oh ...), musical notes, etc.. How are you, obviously, know, just as well as a part of wide character strings, with equal success, and can meet the characters of the Latin alphabet (ASCII main table), with each such symbol will also hold 2 or 4 bytes (depending on the agreements adopted in the operating system), in contrast to the usual 1 bytes.

We perform a number of operations with the Russian lines, but writing them (till) in traditional form arrays char:

#include <string.h>
#include <iostream>
using namespace std;
int main( void ) {
   char s1[] = "это ", s2[] = "фрагменты ",
        s3[] = "русскоязычной ", s4[] = "строки ",
        s5[ 120 ];
   strcpy( s5, strcat( s1, strcat( s2, strcat( s3, s4 ) ) ) );
   cout << s5 << endl;
}

#include <string.h>

#include <iostream>

using namespace std;

int main( void ) {

char s1[] = "это ", s2[] = "фрагменты ",

s3[] = "русскоязычной ", s4[] = "строки ",

s5[ 120 ];

strcpy( s5, strcat( s1, strcat( s2, strcat( s3, s4 ) ) ) );

cout << s5 << endl;

}

perform:

$ ./3a 
это фрагменты русскоязычной строки  [66]

1 2	$ ./3a это фрагменты русскоязычной строки [66]

It would seem, what (nearly) everything is working precisely a textbook, and why do we need any extensive localized strings? But this deceptive illusion! The point here is, that some traditional functions lowercase (strcat(), strcpy(), strdup(), strstr() and etc.) will return the correct results. This is because they perform operations on bytes, byte by byte, without delving into the internal structure of characters to copy.

But other operations (and false результат strlen() it has clearly points) will not work properly: strncpy(), strchr(), strsep(), strtok() and etc. And they will create you a very unexpected results, very difficult to interpret. Lookthey how to work byte string Reverse, and how to distinguish his work on the English and Russian line:

#include <string.h>
#include <iostream>
using namespace std;
char* revers( char* s ) {
   for( int i = 0, j = strlen( s ) - 1; i < j; i++, j-- ) {
      char c = s[ i ];
      s[ i ] = s[ j ];
      s[ j ] = c;
   }
   return s;
}
int main( void ) {
   char se[] = "this is english string",
        sr[] = "это русскоязычная строка";
   cout << revers( se ) << endl << revers( sr ) << endl;
}

#include <string.h>

#include <iostream>

using namespace std;

char* revers( char* s ) {

for( int i = 0, j = strlen( s ) - 1; i < j; i++, j-- ) {

char c = s[ i ];

s[ i ] = s[ j ];

s[ j ] = c;

}

return s;

}

int main( void ) {

char se[] = "this is english string",

sr[] = "это русскоязычная строка";

cout << revers( se ) << endl << revers( sr ) << endl;

}

It works so, and this definitely not that, what you expected to receive:

$ ./3b
gnirts hsilgne si siht
�коЀтс� �ѰнЇыѷЏѾкЁсур� �Ђэ�

$ ./3b

gnirts hsilgne si siht

�коЀтс� �ѰнЇыѷЏѾкЁсур� �Ђэ�

nand this concludes our discussion of the possibility of representing the Russian-speaking lines of traditional arrays char[] and the processing of their traditional functions in lower case, and complete this examination output: _y to earn with Russian lines as an array char шt is possible only:

and). or when we use string constants unchanged, only as a line for their input-output unchanged;

b). or for the treatment of their functions (library or their own), which do not take into account the internal structure of the characters, without delving into the is contentit is strings, and operate with them simply as a meaningless sequence of bytes.

In all Otherwise correctI work with the Cyrillic alphabet available only as a broad array of localized characters wchar_t (with completingm row wide null symbol L’′). To work with localized representation of lines of C library provides wide a set of line features, completely similar to the traditional functions of the lower case, but instead of the prefix str in their names prefixed with wcs: wcslen() instead strlen(), wcsncpy() instead strncpy() etc.

Let's see how it works on the example of:

#include <stdio.h>
#include <wchar.h>
#include <locale>
wchar_t* revers( wchar_t *w ) {
   wchar_t *sec, wb[ 80 ];
   if( ( sec = wcschr( w, L' ' ) ) != NULL ) {
      wcsncpy( wb, w, sec - w )[ sec - w ] = L'\0';
      while( L' ' == *sec ) sec++;
      revers( sec );
      wcscat( wcscat( wmemmove( w, sec, wcslen( sec ) + 1 ), L" " ), wb );
   }
   return w;
}

int main( void ) {
   setlocale( LC_ALL, "" );  // только после этого работают преобразования!  
   wchar_t ws[] = L"тестовая русскоязычная строка из нескольких слов    ";
   while( L' ' == ws[ wcslen( ws ) - 1 ] )
      ws[ wcslen( ws ) - 1 ] = L'\0';
   printf( "устранение завершающих пробелов: '%ls'\n", ws );
   printf( "1-е реверсирование слов: '%ls'\n", revers( ws ) );
   printf( "2-е реверсирование слов: '%ls'\n", revers( ws ) );
}

#include <stdio.h>

#include <wchar.h>

#include <locale>

wchar_t* revers( wchar_t *w ) {

wchar_t *sec, wb[ 80 ];

if( ( sec = wcschr( w, L' ' ) ) != NULL ) {

wcsncpy( wb, w, sec - w )[ sec - w ] = L'\0';

while( L' ' == *sec ) sec++;

revers( sec );

wcscat( wcscat( wmemmove( w, sec, wcslen( sec ) + 1 ), L" " ), wb );

}

return w;

}

int main( void ) {

setlocale( LC_ALL, "" ); // только после этого работают преобразования!

wchar_t ws[] = L"тестовая русскоязычная строка из нескольких слов ";

while( L' ' == ws[ wcslen( ws ) - 1 ] )

ws[ wcslen( ws ) - 1 ] = L'\0';

printf( "устранение завершающих пробелов: '%ls'\n", ws );

printf( "1-е реверсирование слов: '%ls'\n", revers( ws ) );

printf( "2-е реверсирование слов: '%ls'\n", revers( ws ) );

}

$ ./4
устранение завершающих пробелов: 'тестовая русскоязычная строка из нескольких слов'
1-е реверсирование слов: 'слов нескольких из строка русскоязычная тестовая'
2-е реверсирование слов: 'тестовая русскоязычная строка из нескольких слов'

$ ./4

устранение завершающих пробелов: 'тестовая русскоязычная строка из нескольких слов'

1-е реверсирование слов: 'слов нескольких из строка русскоязычная тестовая'

2-е реверсирование слов: 'тестовая русскоязычная строка из нескольких слов'

This illustration is quite enough, to see direct analogies manipulation functions with symbols wchar_t. the, who has some experience working with strings char effortlessly spread it wide strings. Setting language locale (call setlocale()) O devices (terminal) — obligatory, because the C / C ++ program sets the default locale “C” (andto historically), which allows output only 128 characters younger half of the 8-bit ASCII characters.

In the illustrated writing function sets the locale, used in the default operating system - I'm guessing, that we are experimenting in the Russian-speaking the installed system. The new language standard (C99) and introduces a new format for the string formatting functions (printf(), sprintf()) — %ls, This format strings wchar_t[].

Tinternally as well, As with arrays char, converts to C ++ from C, C ++ library introduces a complete analogue of the container class string, but containing in their composition wide localized characters, and is known as the class wstring:

#include <locale>
#include <iostream>
using namespace std;
int main( void ) {
   locale::global( locale( "" ) );
   wstring ws = L"строка";
   wcout << ws << endl;
}

#include <locale>

#include <iostream>

using namespace std;

int main( void ) {

locale::global( locale( "" ) );

wstring ws = L"строка";

wcout << ws << endl;

}

Here, the output string of localized characters (ws) must is output to the output stream wcout (similar in meaning cout, but other than cout).

In the illustrated writing: locale::global( locale( “” ) ) — This locale setting by default in C ++ OOP way, similar to, as it has been shown before in the manner of C.

Atpolls IO wide character strings (to the terminal or to a file) separate complicated subject, therefore consideration will be deferred to a single note on this subject.

Working with localized strings

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Statistics