Discussion:
Problem with substr() after match() with non-ASCII characters
(too old to reply)
Janis Papanagnou
2015-08-22 20:33:52 UTC
Permalink
The issue was observed using GNU awk 4.1.2 and confirmed to show the
same behaviour in GNU awk 4.1.3.

With the attached program 'testprog' applied on the attached data 'testdata'
I do *not* get the expected result of four lines containing "2007" each, but
instead I get:

2007
0703
2007
0071

The problem is caused/triggered by non-ASCII characters in 'testdata'.

Note: I can run 'testprog' it with LC_ALL=C and the output is as expected.

My understanding is, though, that the implicit results from the match()
function, RSTART and RLENGTH, should be consistently usable in substr(),
independent of the locale setting.

Thanks!

Janis
Stephane Chazelas
2015-08-23 21:32:12 UTC
Permalink
2015-08-22 22:33:52 +0200, Janis Papanagnou:
> The issue was observed using GNU awk 4.1.2 and confirmed to show the
> same behaviour in GNU awk 4.1.3.
>
> With the attached program 'testprog' applied on the attached data 'testdata'
> I do *not* get the expected result of four lines containing "2007" each, but
> instead I get:
>
> 2007
> 0703
> 2007
> 0071
>
> The problem is caused/triggered by non-ASCII characters in 'testdata'.
>
> Note: I can run 'testprog' it with LC_ALL=C and the output is as expected.
>
> My understanding is, though, that the implicit results from the match()
> function, RSTART and RLENGTH, should be consistently usable in substr(),
> independent of the locale setting.
[...]

Note that in a UTF-8 locale, that testdata is not valid text.
Those bytes don't form valid characters.

While the behaviour would be unspecified by POSIX, here I'd
agree gawk has some inconsistency in that those invalid by
sequences are considered of length 0 for length, index and
substr but of length 1 for match.

To me, the best approach would be that they be of length 1 all
the time (and that they also match /./ (they don't in GNU tools
in general, they don't even match ? in GNU fnmatch, though they
do in the GNU shell's ?)).

Here though, you should use a locale where that data is valid
text. If you don't know the encoding but don't care an know it's
single-byte, the C locale is a good option.

--
Stephane
Aharon Robbins
2015-08-24 18:47:36 UTC
Permalink
> To: bug-***@gnu.org
> From: Stephane Chazelas <***@gmail.com>
> Date: Sun, 23 Aug 2015 22:32:12 +0100
> Subject: Re: [bug-gawk] Problem with substr() after match() with non-ASCII
> characters
>
> Note that in a UTF-8 locale, that testdata is not valid text.
> Those bytes don't form valid characters.
>
> While the behaviour would be unspecified by POSIX, here I'd
> agree gawk has some inconsistency in that those invalid by
> sequences are considered of length 0 for length, index and
> substr but of length 1 for match.

I think it's the other way around, they're 0 for match and 1 for
the others.

I think this patch, which is a bit of a hack, improves things.
It at least gets the "right" results for Janis's data and doesn't
break the test suite.

I will likely push this, or something like it with more comments.

Arnold
------------------------------------------------
diff --git a/node.c b/node.c
index 1741a13..b33a4f6 100644
--- a/node.c
+++ b/node.c
@@ -734,14 +734,20 @@ str2wstr(NODE *n, size_t **ptr)
warned = true;
lintwarn(_("Invalid multibyte data detected. There may be a mismatch between your data and your locale."));
}
+ if (using_utf8()) {
+ count = 1;
+ wc = 0xFFFD; /* unicode replacement character */
+ goto got_wc;
+ }
break;

case 0:
count = 1;
/* fall through */
default:
- *wsp++ = wc;
src_count -= count;
+ got_wc:
+ *wsp++ = wc;
while (count--) {
if (ptr != NULL)
(*ptr)[sp - n->stptr] = i;
Hermann Peifer
2015-08-24 19:26:54 UTC
Permalink
> lintwarn(_("Invalid multibyte data detected. There may be a mismatch between your data and your locale."));

Could it be an idea to promote this lintwarn to a real warning ?

Hermann
Aharon Robbins
2015-09-01 03:16:28 UTC
Permalink
> > lintwarn(_("Invalid multibyte data detected. There may be a mismatch between your data and your locale."));
>
> Could it be an idea to promote this lintwarn to a real warning ?

I did that.

Thanks,

Arnold
Aharon Robbins
2015-08-24 15:30:48 UTC
Permalink
Hi.

> From: Janis Papanagnou <***@hotmail.com>
> To: "bug-***@gnu.org" <bug-***@gnu.org>
> Date: Sat, 22 Aug 2015 22:33:52 +0200
> Subject: [bug-gawk] Problem with substr() after match() with non-ASCII
> characters
>
> The issue was observed using GNU awk 4.1.2 and confirmed to show the
> same behaviour in GNU awk 4.1.3.
>
> With the attached program 'testprog' applied on the attached data 'testdata'
> I do *not* get the expected result of four lines containing "2007" each, but
> instead I get:
>
> 2007
> 0703
> 2007
> 0071
>
> The problem is caused/triggered by non-ASCII characters in 'testdata'.
>
> Note: I can run 'testprog' it with LC_ALL=C and the output is as expected.

The problem is that you're feeding gawk invalid multibyte data for
the locale you're in. When gawk tries to figure out where, in terms of
characters, the match starts, it gets confused because of this invalid
data.

$ LC_ALL=en_US.UTF-8 gawk --lint -f testprog testdata
2007
gawk: testprog:2: (FILENAME=testdata FNR=2) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
0703
2007
0071

> My understanding is, though, that the implicit results from the match()
> function, RSTART and RLENGTH, should be consistently usable in substr(),
> independent of the locale setting.

*When the data is valid*, this is correct and things work as expected.
In your case, it's Garbage In, Garbage Out. :-(

If there's a way to set the locale to latin-whatever for where you
are, then things will probably work ok. Otherwise, you should use
LC_ALL=C or the -b option.

There really is no way around this; the underlying C library routines
depend on the value of the locale variables in order to interpret
the input data.

HTH,

Arnold
Janis Papanagnou
2015-09-15 21:35:58 UTC
Permalink
Hi Arnold!

Sorry for the late reply; I couldn't access my hotmail account from
abroad for hotmail's security measures, but now I'm fully online again.

> The problem is that you're feeding gawk invalid multibyte data for
> the locale you're in. When gawk tries to figure out where, in terms of
> characters, the match starts, it gets confused because of this invalid
> data.

Obviously.

My view is that (a) I expect *consistency* in the functions, and (b) I should
be able to process any data (from unknown locales). I can achieve (b) by
the two means I posted, so *functionally* I'm fine now. I think that (a)
should be addressed (i.e. a consistent implementation that does not
"confuse" awk, and let awk's set of functions work with the same "metric").

YMMV, and all I can do is posting (on your demand) the issue.

Thanks.

Janis

---
> From: ***@skeeve.com
> Date: Mon, 24 Aug 2015 18:30:48 +0300
> To: ***@hotmail.com; bug-***@gnu.org
> Subject: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
>
> Hi.
>
> > From: Janis Papanagnou <***@hotmail.com>
> > To: "bug-***@gnu.org" <bug-***@gnu.org>
> > Date: Sat, 22 Aug 2015 22:33:52 +0200
> > Subject: [bug-gawk] Problem with substr() after match() with non-ASCII
> > characters
> >
> > The issue was observed using GNU awk 4.1.2 and confirmed to show the
> > same behaviour in GNU awk 4.1.3.
> >
> > With the attached program 'testprog' applied on the attached data 'testdata'
> > I do *not* get the expected result of four lines containing "2007" each, but
> > instead I get:
> >
> > 2007
> > 0703
> > 2007
> > 0071
> >
> > The problem is caused/triggered by non-ASCII characters in 'testdata'.
> >
> > Note: I can run 'testprog' it with LC_ALL=C and the output is as expected.
>
> The problem is that you're feeding gawk invalid multibyte data for
> the locale you're in. When gawk tries to figure out where, in terms of
> characters, the match starts, it gets confused because of this invalid
> data.
>
> $ LC_ALL=en_US.UTF-8 gawk --lint -f testprog testdata
> 2007
> gawk: testprog:2: (FILENAME=testdata FNR=2) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
> 0703
> 2007
> 0071
>
> > My understanding is, though, that the implicit results from the match()
> > function, RSTART and RLENGTH, should be consistently usable in substr(),
> > independent of the locale setting.
>
> *When the data is valid*, this is correct and things work as expected.
> In your case, it's Garbage In, Garbage Out. :-(
>
> If there's a way to set the locale to latin-whatever for where you
> are, then things will probably work ok. Otherwise, you should use
> LC_ALL=C or the -b option.
>
> There really is no way around this; the underlying C library routines
> depend on the value of the locale variables in order to interpret
> the input data.
>
> HTH,
>
> Arnold
Eli Zaretskii
2015-09-16 07:10:00 UTC
Permalink
> From: Janis Papanagnou <***@hotmail.com>
> Date: Tue, 15 Sep 2015 23:35:58 +0200
>
> > The problem is that you're feeding gawk invalid multibyte data for
> > the locale you're in. When gawk tries to figure out where, in terms of
> > characters, the match starts, it gets confused because of this invalid
> > data.
>
> Obviously.
>
> My view is that (a) I expect *consistency* in the functions, and (b) I should
> be able to process any data (from unknown locales). I can achieve (b) by
> the two means I posted, so *functionally* I'm fine now. I think that (a)
> should be addressed (i.e. a consistent implementation that does not
> "confuse" awk, and let awk's set of functions work with the same "metric").

You cannot have locale-independent processing as long as Gawk relies
on locale-dependent functions such as mbrtowc, mbrlen, and strcoll.
If we want to be locale-independent, we need to have
locale-indifferent versions of those functions (and others like
them). And even then, some users will _want_ locale dependency,
e.g. when sorting text or displaying date/time values.

So you are asking for something that is (a) a lot of work, and (b) is
practically an unreachable goal, if you insist on 100% locale
independence.
Janis Papanagnou
2015-09-16 11:40:40 UTC
Permalink
> Date: Wed, 16 Sep 2015 10:10:00 +0300
> From: ***@gnu.org
> Subject: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
> To: ***@hotmail.com
> CC: ***@skeeve.com; bug-***@gnu.org
>
> > From: Janis Papanagnou <***@hotmail.com>
> > Date: Tue, 15 Sep 2015 23:35:58 +0200
> >
> > > The problem is that you're feeding gawk invalid multibyte data for
> > > the locale you're in. When gawk tries to figure out where, in terms of
> > > characters, the match starts, it gets confused because of this invalid
> > > data.
> >
> > Obviously.
> >
> > My view is that (a) I expect *consistency* in the functions, and (b) I should
> > be able to process any data (from unknown locales). I can achieve (b) by
> > the two means I posted, so *functionally* I'm fine now. I think that (a)
> > should be addressed (i.e. a consistent implementation that does not
> > "confuse" awk, and let awk's set of functions work with the same "metric").
>
> You cannot have locale-independent processing as long as Gawk relies
> on locale-dependent functions such as mbrtowc, mbrlen, and strcoll.
> If we want to be locale-independent, we need to have
> locale-indifferent versions of those functions (and others like
> them). And even then, some users will _want_ locale dependency,
> e.g. when sorting text or displaying date/time values.
>
> So you are asking for something that is (a) a lot of work, and (b) is
> practically an unreachable goal, if you insist on 100% locale
> independence.

No. All I was asking for was to remove an inconsistency (or let gawk give
some hint that it has problems operating on the data).

The statement "is a lot of work" needs no reply in context of a bug report;
it's your decision, anyway, what you do and what you don't do (or ignore).

Thanks.

Janis
Hermann Peifer
2015-09-16 16:13:03 UTC
Permalink
On 2015-09-16 13:40, Janis Papanagnou wrote:
>
> No. All I was asking for was to remove an inconsistency (or let gawk give
> some hint that it has problems operating on the data).
>

Both has been done on 25 August, as far as I can see. Hermann

+2015-08-25 Arnold D. Robbins <***@skeeve.com>
+
+ * node.c (str2wstr): Upon finding an invalid character, if
+ using UTF-8, use the replacement character instead of skipping
+ it. Helps match() and other functions work better in the face
+ of unexpected data. Make the lint warning an unconditional
+ warning.

http://git.savannah.gnu.org/cgit/gawk.git/commit/?id=278fe876bb18938803ac1c36b028adb8cef6fe84
Aharon Robbins
2015-09-01 03:15:56 UTC
Permalink
Just to close this off, I have pushed a patch that makes this behave
better. You should have seen it.

Thanks,

Arnold
Loading...