lianergoist
Posts: 33
Joined: Sat Nov 08, 2014 12:38 pm
Location: Horsens, Denmark

wchar functions

Tue Jun 12, 2018 5:16 pm

Does somebody here know how to use the wchar functions in C?

I have been fighting this all day, and I can't get anything to work. I have read the glibc doc and searched the net, but still stuck...

Well, basically I have a list of words, and I want to sort it by length of words. In fact, I want to write words with 4 letters to '4.txt', words with 5 letters to '5.txt', etc.

First I used fgets and fputs to read and write files, but that doesn't work, because words like 'æble' makes strlen return 5. So, I thought I could simply use fgetws and fputws and wcslen, but I get some wierd results.

The following code demonstrate the problem. I create a new file:

[email protected]:~/c $ echo "æble" > in.txt
[email protected]:~/c $ ls -l in.txt
-rw-r--r-- 1 pi pi 6 jun 12 18:53 in.txt

Note file size 6 bytes...

I compile the following code:

Code: Select all

#include <stdio.h>
#include <string.h>
#include <wchar.h>

#define MAX_WORD_LEN 30

int main() {

    wchar_t wstring[MAX_WORD_LEN];
   
    size_t wsl;

    FILE *infile, *outfile;

    if ((infile = fopen("in.txt", "r")) == NULL) 
    {
        puts("Error opening infile");   
        return 1;
    }
    
    fgetws(wstring, MAX_WORD_LEN-1, infile);
    wsl = wcsnlen(wstring, MAX_WORD_LEN);
        
    printf("%d\n", (int) wsl);

    fclose(infile);

/*
    if ((outfile = fopen("out.txt", "w")) == NULL) 
    {
        puts("Error opening outfile");   
        return 1;
    }
    
    fputws(wstring, outfile);
    wsl = wcsnlen(wstring, MAX_WORD_LEN);
    printf("%d\n", (int) wsl);

    fclose(outfile);
*/ 

    return 0;
}
- and when I run the program, it return 2...

If I remove the comments and compile again, it returns 3 and 3... So something is clearly wrong here!

What am I missing?
Thomas Jensen

There are two types of people.
1) Those who can extrapolate from incomplete data

markkuk
Posts: 62
Joined: Thu Mar 22, 2018 1:02 pm

Re: wchar functions

Tue Jun 12, 2018 7:26 pm

lianergoist wrote:
Tue Jun 12, 2018 5:16 pm
The following code demonstrate the problem. I create a new file:

[email protected]:~/c $ echo "æble" > in.txt
[email protected]:~/c $ ls -l in.txt
-rw-r--r-- 1 pi pi 6 jun 12 18:53 in.txt

Note file size 6 bytes...
The text file is UTF-8 encoded, one 2-byte character followed by 4 single byte characters (remember that "echo" adds a line feed). By default, the wchar functions expect that file I/O is done using UTF-32 encoding (4-byte characters). You need to use the locale mechanism to tell the program that you are using UTF-8.

Add the following include:

Code: Select all

#include <locale.h>
and call the setlocale() function at the top of main(), before any file I/O calls:

Code: Select all

setlocale(LC_ALL, "");
Now the code works correctly.

Heater
Posts: 10291
Joined: Tue Jul 17, 2012 3:02 pm

Re: wchar functions

Tue Jun 12, 2018 8:10 pm

markkuk,
The text file is UTF-8 encoded, one 2-byte character followed by 4 single byte characters...
This makes no sense. (Edit: Well OK it does in this particular case)

UTF-8 is a variable length encoding. Basically any normal ASCII string, as in C string, is one byte per character. Characters outside that can be two bytes or three or more.

Without parsing the whole thing there is no way to tell how many characters are in a given length of UTF-8 bytes.

We might hope that wchar or whatever works with it's 16 bit or 32 bit representations and we would know how many characters are in a given number of bytes. Or vice versa.

Sadly, Unicode is so mind bendingly complex that is not sure.

lianergoist
Posts: 33
Joined: Sat Nov 08, 2014 12:38 pm
Location: Horsens, Denmark

Re: wchar functions

Wed Jun 13, 2018 4:52 am

markkuk wrote:
Tue Jun 12, 2018 7:26 pm

Code: Select all

setlocale(LC_ALL, "");

Oh, thank you, man. I have seen someone on the net suggest "setlocale(LC_CTYPE, "en.UTF-8")", but that didn't work. This does! Thanks again!
Thomas Jensen

There are two types of people.
1) Those who can extrapolate from incomplete data

Return to “C/C++”