[bug-gawk] GAWK for Windows does not work properly with UTF-8

Discussion:

Marc de Bourget

2016-02-11 11:46:21 UTC

I use this version:
http://sourceforge.net/projects/ezwinports/files/gawk-4.1.3-w32-bin.zip/download

Problem: This GAWK for Windows version counts bytes instead of characters.
CÃ©line has 6 characters but 7 bytes due tu the multibyte character "Ã©".

The length function for the string "CÃ©line" should result in 6 but it is 7.
Using gawk for Windows with UTF-8 produces wrong results for at least the
functions length, substr, index, match, split("CÃ©line", CHARS, ""), printf,
sprintf.

Creating a DOS Batch with setting the environment variable LC_ALL doesn't
help:
celine.bat:
SET LC_ALL=en_US.UTF-8
gawk -f celine.awk
Content of celine.awk:
BEGIN {
test = "CÃ©line"
print length(test)
print substr(test,2,1)
print "|" sprintf("%-12.12s", test) "|"
}

Eli Zaretskii

2016-02-11 20:46:42 UTC

Permalink

Date: Thu, 11 Feb 2016 12:46:21 +0100
http://sourceforge.net/projects/ezwinports/files/gawk-4.1.3-w32-bin.zip/download
Problem: This GAWK for Windows version counts bytes instead of characters.
Céline has 6 characters but 7 bytes due tu the multibyte character "é".
The length function for the string "Céline" should result in 6 but it is 7.
Using gawk for Windows with UTF-8 produces wrong results for at least the functions length, substr, index,
match, split("Céline", CHARS, ""), printf, sprintf.

MS-Windows doesn't support UTF-8 as the locale's codeset, so sadly you
cannot do this, as long as Gawk uses libc functions such as mbrtowc,
and as long as it uses wchar_t as the type to hold Unicode codepoints
in scalar values.

In fact, because Gawk relies on the system's locale support, Gawk
programs that manipulate non-ASCII characters cannot be fully
portable, in the sense that they will produce the same output given
the same input on all supported platforms, even on those that do have
UTF-8 locales.

So the only way you can write a portable Gawk program that does TRT
with UTF-8 is to implement the functionality in Awk (or hack Gawk's C
implementation, if you want).

Sorry.

Eli Zaretskii

2016-02-12 07:31:58 UTC

Permalink

Date: Fri, 12 Feb 2016 07:31:18 +0100
Thank you, Eli. UTF-8 is the future

Please tell that to Microsoft bunch, so that they include support for
that in Windows.

so there should be native UTF-8 support with GAWK in the near future.

Patches are welcome, of course. But I very much doubt that something
like that will happen any time soon, unless sweeping changes are made
in Gawk, as part of a larger effort of making Gawk programs that use
non-ASCII characters portable. Right now, Gawk only supports the
non-ASCII characters recognizable and supported by the current
locale's codeset.

People on the newsgroup say that the Cygwin version works correctly with UTF-8 for Windows, is it true?

Cygwin pretends to be living in the UTF-8 locale, that's true.

Eli Zaretskii

2016-02-15 03:34:38 UTC

Permalink

Date: Sun, 14 Feb 2016 23:25:06 +0100
The issue can be solved by counting characters instead of bytes.

Gawk does that by using mbrlen, but that function uses the current
locale to determine how many bytes constitute a character.

BTW, what is TRT?

The Right Thing.

Eli Zaretskii

2016-02-17 19:52:46 UTC

Permalink

Date: Tue, 16 Feb 2016 23:10:31 +0100
Hello Eli,

(Let's not make this a private conversation; please keep the list on
the CC.)

BEGIN {
test = "Céline"
if (test ~ /[àá]/)
print "found"
else
print "not found"
# => Wrong result: found
}
It seems multibyte characters can't be used in character classes correctly?
I am trying to understand, why it doesn't work.

It doesn't work because the Windows library functions that Gawk uses
to support non-ASCII characters interpret the bytes in your program
assuming they are encoded in the current locale's codeset, which I'm
guessing is some Windows codepage. These functions don't know you
actually feed them with UTF-8 multibyte sequences. The UTF-8 encoding
of all the 3 letters you used begins with a 0xC3 byte, so Gawk
produces a false match.

Can I solve this otherwise?
I have to use a lot character classes with accents.

The only way to support UTF-8 encoded characters is to write your own
regexp matching code.

Eli Zaretskii

2016-02-17 20:07:14 UTC

Permalink

Date: Wed, 17 Feb 2016 20:40:56 +0100
Hello Eli, do you have a little tipp how to use multibyte characters in
character classes correctly, also combinated with negation [^èé][a-z]?
I have always thought that pattern matching is no problem with UTF-8,
but character classes seem to be a problem. Are there workarounds?
Can you please help me, please? THANK YOU VERY MUCH!

I don't see how this could work on Windows, as long as you must encode
the files in UTF-8. One workaround is to recode the files in
something like codepage-1252, and if your system codepage is
different, then use the chcp command to switch to that codepage before
running Gawk. Then I expect the above matching to work as expected.

If you are lucky, and all of the characters you need to match can be
encoded in your system codepage, then that's what I would suggest
doing.

Eli Zaretskii

2016-02-20 08:57:12 UTC

Permalink

Date: Fri, 19 Feb 2016 18:46:30 -0500
3. You could google on how to change the windowz codepage to UTF-8.

This particular suggestion won't help to solve the problem at hand.
Windows doesn't support UTF-8 as the locale's codeset, so all the
library functions used to work with wchar_t "wide characters", such as
mbrtowc, wcslen, etc. will not change their behavior if the codepage
is switched to UTF-8.