git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: Junio C Hamano <gitster@pobox.com>,
	Hamza Mahfooz <someguy@effective-light.com>,
	git@vger.kernel.org
Subject: Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data
Date: Sat, 16 Oct 2021 19:12:47 +0200	[thread overview]
Message-ID: <87zgr8dg8j.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <eddcbe66-b172-90b7-309e-e9ce5b5b44a4@web.de>


On Sat, Oct 16 2021, René Scharfe wrote:

> Am 15.10.21 um 22:03 schrieb Junio C Hamano:
>> Hamza Mahfooz <someguy@effective-light.com> writes:
>>
>>> If we attempt to grep non-ascii log message text with an ascii pattern, we
>>
>> "with an ascii pattern, when Git is built with and told to use pcre2, we"
>>
>>> run into the following issue:
>>>
>>>     $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
>>>     grep: (standard input): binary file matches
>
> I get no error message on macOS 11.6, but this result, with the underlined
> part in red:
>
>    Author: ??var Arnfjörð Bjarmason <avarab@gmail.com>
>             ^^^^^^^^^^^^^^^^^^
>
> So the pattern matches the second byte of a two-byte character, inserts a
> color code in the middle and thus produces invalid output in this case.

Thanks for digging into these edge cases...

>>>
>>> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log
>>> output is encoded in UTF-8.
>>
>>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>> +	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
>>> +	    (!opt->ignore_locale && is_utf8_locale() &&
>>> +	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
>>> +					    (p->fixed || p->is_fixed))))
>>
>> That's a mouthful.  It is not obvious what new condition is being
>> added.  I had to flip the order to see the only difference is, that
>>
>>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>> +	if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>> +	    !(!opt->ignore_case && (p->fixed || p->is_fixed))) ||
>>> +	    (!opt->ignore_locale && !has_non_ascii(p->pattern)))
>>
>> ... in addition to the case where the original condition holds, if
>> we do not say "ignore locale" and the pattern is ascii-only, we
>> apply these two option flags.  And that matches what the proposed
>> log message explained as the condition the problem appears.
>>
>> So,... looks good, I guess.
>>
>> Thanks, will queue.
>>
>>
>> Addendum.
>>
>> If we were reordering pieces in the condition, I wonder if there is
>> a better way to reorganize it, though.  The original is already
>> barely explainable with words, and with this new condition added, I
>> am not sure if anybody can phrase the condition in simple words to
>> others after staring it for a few minutes.  I can't.
>>
>> But straightening it out is best left as a future clean-up patch,
>> separate from this series.
>>
>
> It can be written as:
>
> 	literal = !opt->ignore_case && (p->fixed || p->is_fixed);
> 	if (!opt->ignore_locale) {
> 		if (!has_non_ascii(p->pattern) ||
> 		    (is_utf8_locale() && !literal))
> 			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> 	}

Whatever we go from here I'm very much for untangling that condition,
but I guess it can be done as a follow-up too, I'll defer to Hamza
there...

> Literal patterns are those that don't use any wildcards or case-folding.
> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
> pattern only consists of ASCII characters, or if the pattern is encoded
> in UTF-8 and is not just a literal pattern.
>
> Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
> ASCII chars?
>
> The old condition was (reformatted to better match the new one):
>
> 	if (!opt->ignore_locale) {
> 		if (is_utf8_locale() && has_non_ascii(p->pattern) && !literal)
> 			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> 	}
>
> Intuitively I'd say the condition should be:
>
> 	if (!opt->ignore_locale && is_utf8_locale()) {
> 		if (has_non_ascii(p->pattern) || !literal)
> 			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> 	}
>
> If both input and pattern are encoded in UTF-8, enable PCRE2_UTF if we
> have to match non-ASCII characters or do more than just literal
> matching.
>
> For literal patterns that consist only of ASCII characters we don't need
> the cost and complication of PCRE2_UTF.
>
> Makes sense?

    echo 'René Scharfe' >f &&
    $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
    1
    $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
    f:René Scharfe
    0

So it's a choose-your-own adventure where you can pick if you're
right. I.e. do you want the "." metacharacter to match your "é" or not?

These sorts of patterns demonstrate nicely that the relationship between your
pattern being ASCII and wanting or not wanting UTF-8 matching semantics
isn't what you might imagine it to be.

If you look at:

    git log --reverse -p -Gis_utf8_locale -- grep.c

And my earlier replies in this thread-at-large (not digging up a
reference now, sorry), you'll see that the current behavior in grep.c is
really a compromise based on some intersection of user patterns, us
potentially grepping arbitrary binary data and/or "valid" encodings, and
what PCRE does and doesn't puke on.

It's not "right" by any sense, but we sort of limp along with it, mostly
because unlike say Perl's regex engine which really is used for serious
production work where Unicode-correctness matters I don't think anyone
is using "git grep" for anything like that, it's mostly for people's
ad-hoc greps.

Right now I can't remember if the only reason we can't "fix this" is
because we'd be too aggressive with the PCRE version dependency, see
95ca1f987ed (grep/pcre2: better support invalid UTF-8 haystacks,
2021-01-24).







  reply	other threads:[~2021-10-16 17:36 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-15 16:13 [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data Hamza Mahfooz
2021-10-15 20:03   ` Junio C Hamano
2021-10-16 16:25     ` René Scharfe
2021-10-16 17:12       ` Ævar Arnfjörð Bjarmason [this message]
2021-10-16 19:44         ` René Scharfe
2021-10-17  6:00           ` Junio C Hamano
2021-10-17  6:55             ` René Scharfe
2021-10-17  9:44               ` Ævar Arnfjörð Bjarmason
2021-11-15 20:43   ` Andreas Schwab
2021-11-15 22:41     ` Ævar Arnfjörð Bjarmason
2021-11-16  2:12       ` Carlo Arenas
2021-11-16  8:41       ` Andreas Schwab
2021-11-16  9:06         ` Carlo Arenas
2021-11-16  9:18           ` Andreas Schwab
2021-11-16  9:29           ` Andreas Schwab
2021-11-16  9:38             ` Carlo Arenas
2021-11-16  9:55               ` Andreas Schwab
2021-11-16 11:00                 ` [PATCH] grep: avoid setting UTF mode when not needed Carlo Marcelo Arenas Belón
2021-11-16 12:32                   ` Ævar Arnfjörð Bjarmason
2021-11-16 13:35                     ` Hamza Mahfooz
2021-11-17 14:31                       ` Ævar Arnfjörð Bjarmason
2021-11-16 18:22                     ` Carlo Arenas
2021-11-16 18:48                     ` Junio C Hamano
2021-11-17 10:23                   ` [PATCH v2] grep: avoid setting UTF mode when dangerous with PCRE Carlo Marcelo Arenas Belón
2021-11-18  7:29                     ` Junio C Hamano
2021-11-18 10:15                       ` Carlo Arenas
2021-11-18 12:49                         ` Hamza Mahfooz
2021-11-19  6:58                           ` Ævar Arnfjörð Bjarmason
2021-11-18 21:21                     ` Hamza Mahfooz
2021-11-17 18:46               ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data René Scharfe
2021-11-17 19:56                 ` Carlo Arenas
2021-11-17 20:59                 ` Ævar Arnfjörð Bjarmason
2021-11-17 21:53                   ` Carlo Arenas
2021-11-18  0:00                     ` Ævar Arnfjörð Bjarmason
2021-11-18 18:17                   ` Junio C Hamano
2021-11-18 20:57                     ` René Scharfe
2021-11-19  7:00                       ` Ævar Arnfjörð Bjarmason
2021-11-19 16:08                         ` René Scharfe
2021-11-19 17:33                           ` Carlo Arenas
2021-11-19 17:11                         ` Junio C Hamano
2021-10-15 18:05 ` [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Junio C Hamano
2021-10-15 18:24   ` Hamza Mahfooz
2021-10-15 19:28     ` Junio C Hamano
2021-10-15 19:40       ` Hamza Mahfooz
2021-10-15 19:49       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zgr8dg8j.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=someguy@effective-light.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).