Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data

From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: Junio C Hamano <gitster@pobox.com>,
	Hamza Mahfooz <someguy@effective-light.com>,
	git@vger.kernel.org
Subject: Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data
Date: Sat, 16 Oct 2021 19:12:47 +0200	[thread overview]
Message-ID: <87zgr8dg8j.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <eddcbe66-b172-90b7-309e-e9ce5b5b44a4@web.de>

On Sat, Oct 16 2021, René Scharfe wrote:

> Am 15.10.21 um 22:03 schrieb Junio C Hamano:
>> Hamza Mahfooz <someguy@effective-light.com> writes:
>>
>>> If we attempt to grep non-ascii log message text with an ascii pattern, we
>>
>> "with an ascii pattern, when Git is built with and told to use pcre2, we"
>>
>>> run into the following issue:
>>>
>>>     $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
>>>     grep: (standard input): binary file matches
>
> I get no error message on macOS 11.6, but this result, with the underlined
> part in red:
>
>    Author: ??var Arnfjörð Bjarmason <avarab@gmail.com>
>             ^^^^^^^^^^^^^^^^^^
>
> So the pattern matches the second byte of a two-byte character, inserts a
> color code in the middle and thus produces invalid output in this case.

Thanks for digging into these edge cases...

>>>
>>> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log
>>> output is encoded in UTF-8.
>>
>>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>> +	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
>>> +	    (!opt->ignore_locale && is_utf8_locale() &&
>>> +	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
>>> +					    (p->fixed || p->is_fixed))))
>>
>> That's a mouthful.  It is not obvious what new condition is being
>> added.  I had to flip the order to see the only difference is, that
>>
>>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>> +	if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>> +	    !(!opt->ignore_case && (p->fixed || p->is_fixed))) ||
>>> +	    (!opt->ignore_locale && !has_non_ascii(p->pattern)))
>>
>> ... in addition to the case where the original condition holds, if
>> we do not say "ignore locale" and the pattern is ascii-only, we
>> apply these two option flags.  And that matches what the proposed
>> log message explained as the condition the problem appears.
>>
>> So,... looks good, I guess.
>>
>> Thanks, will queue.
>>
>>
>> Addendum.
>>
>> If we were reordering pieces in the condition, I wonder if there is
>> a better way to reorganize it, though.  The original is already
>> barely explainable with words, and with this new condition added, I
>> am not sure if anybody can phrase the condition in simple words to
>> others after staring it for a few minutes.  I can't.
>>
>> But straightening it out is best left as a future clean-up patch,
>> separate from this series.
>>
>
> It can be written as:
>
> 	literal = !opt->ignore_case && (p->fixed || p->is_fixed);
> 	if (!opt->ignore_locale) {
> 		if (!has_non_ascii(p->pattern) ||
> 		    (is_utf8_locale() && !literal))
> 			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> 	}

Whatever we go from here I'm very much for untangling that condition,
but I guess it can be done as a follow-up too, I'll defer to Hamza
there...

> Literal patterns are those that don't use any wildcards or case-folding.
> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
> pattern only consists of ASCII characters, or if the pattern is encoded
> in UTF-8 and is not just a literal pattern.
>
> Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
> ASCII chars?
>
> The old condition was (reformatted to better match the new one):
>
> 	if (!opt->ignore_locale) {
> 		if (is_utf8_locale() && has_non_ascii(p->pattern) && !literal)
> 			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> 	}
>
> Intuitively I'd say the condition should be:
>
> 	if (!opt->ignore_locale && is_utf8_locale()) {
> 		if (has_non_ascii(p->pattern) || !literal)
> 			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> 	}
>
> If both input and pattern are encoded in UTF-8, enable PCRE2_UTF if we
> have to match non-ASCII characters or do more than just literal
> matching.
>
> For literal patterns that consist only of ASCII characters we don't need
> the cost and complication of PCRE2_UTF.
>
> Makes sense?

    echo 'René Scharfe' >f &&
    $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
    1
    $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
    f:René Scharfe
    0

So it's a choose-your-own adventure where you can pick if you're
right. I.e. do you want the "." metacharacter to match your "é" or not?

These sorts of patterns demonstrate nicely that the relationship between your
pattern being ASCII and wanting or not wanting UTF-8 matching semantics
isn't what you might imagine it to be.

If you look at:

    git log --reverse -p -Gis_utf8_locale -- grep.c

And my earlier replies in this thread-at-large (not digging up a
reference now, sorry), you'll see that the current behavior in grep.c is
really a compromise based on some intersection of user patterns, us
potentially grepping arbitrary binary data and/or "valid" encodings, and
what PCRE does and doesn't puke on.

It's not "right" by any sense, but we sort of limp along with it, mostly
because unlike say Perl's regex engine which really is used for serious
production work where Unicode-correctness matters I don't think anyone
is using "git grep" for anything like that, it's mostly for people's
ad-hoc greps.

Right now I can't remember if the only reason we can't "fix this" is
because we'd be too aggressive with the PCRE version dependency, see
95ca1f987ed (grep/pcre2: better support invalid UTF-8 haystacks,
2021-01-24).