All of lore.kernel.org
 help / color / mirror / Atom feed
From: "René Scharfe" <l.s.r@web.de>
To: Junio C Hamano <gitster@pobox.com>,
	Hamza Mahfooz <someguy@effective-light.com>
Cc: git@vger.kernel.org, "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data
Date: Sat, 16 Oct 2021 18:25:20 +0200	[thread overview]
Message-ID: <eddcbe66-b172-90b7-309e-e9ce5b5b44a4@web.de> (raw)
In-Reply-To: <xmqq4k9i5a4n.fsf@gitster.g>

Am 15.10.21 um 22:03 schrieb Junio C Hamano:
> Hamza Mahfooz <someguy@effective-light.com> writes:
>
>> If we attempt to grep non-ascii log message text with an ascii pattern, we
>
> "with an ascii pattern, when Git is built with and told to use pcre2, we"
>
>> run into the following issue:
>>
>>     $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
>>     grep: (standard input): binary file matches

I get no error message on macOS 11.6, but this result, with the underlined
part in red:

   Author: ??var Arnfjörð Bjarmason <avarab@gmail.com>
            ^^^^^^^^^^^^^^^^^^

So the pattern matches the second byte of a two-byte character, inserts a
color code in the middle and thus produces invalid output in this case.

>>
>> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log
>> output is encoded in UTF-8.
>
>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>> +	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
>> +	    (!opt->ignore_locale && is_utf8_locale() &&
>> +	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
>> +					    (p->fixed || p->is_fixed))))
>
> That's a mouthful.  It is not obvious what new condition is being
> added.  I had to flip the order to see the only difference is, that
>
>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>> +	if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>> +	    !(!opt->ignore_case && (p->fixed || p->is_fixed))) ||
>> +	    (!opt->ignore_locale && !has_non_ascii(p->pattern)))
>
> ... in addition to the case where the original condition holds, if
> we do not say "ignore locale" and the pattern is ascii-only, we
> apply these two option flags.  And that matches what the proposed
> log message explained as the condition the problem appears.
>
> So,... looks good, I guess.
>
> Thanks, will queue.
>
>
> Addendum.
>
> If we were reordering pieces in the condition, I wonder if there is
> a better way to reorganize it, though.  The original is already
> barely explainable with words, and with this new condition added, I
> am not sure if anybody can phrase the condition in simple words to
> others after staring it for a few minutes.  I can't.
>
> But straightening it out is best left as a future clean-up patch,
> separate from this series.
>

It can be written as:

	literal = !opt->ignore_case && (p->fixed || p->is_fixed);
	if (!opt->ignore_locale) {
		if (!has_non_ascii(p->pattern) ||
		    (is_utf8_locale() && !literal))
			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
	}

Literal patterns are those that don't use any wildcards or case-folding.
If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
pattern only consists of ASCII characters, or if the pattern is encoded
in UTF-8 and is not just a literal pattern.

Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
ASCII chars?

The old condition was (reformatted to better match the new one):

	if (!opt->ignore_locale) {
		if (is_utf8_locale() && has_non_ascii(p->pattern) && !literal)
			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
	}

Intuitively I'd say the condition should be:

	if (!opt->ignore_locale && is_utf8_locale()) {
		if (has_non_ascii(p->pattern) || !literal)
			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
	}

If both input and pattern are encoded in UTF-8, enable PCRE2_UTF if we
have to match non-ASCII characters or do more than just literal
matching.

For literal patterns that consist only of ASCII characters we don't need
the cost and complication of PCRE2_UTF.

Makes sense?

René

  reply	other threads:[~2021-10-16 16:25 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-15 16:13 [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data Hamza Mahfooz
2021-10-15 20:03   ` Junio C Hamano
2021-10-16 16:25     ` René Scharfe [this message]
2021-10-16 17:12       ` Ævar Arnfjörð Bjarmason
2021-10-16 19:44         ` René Scharfe
2021-10-17  6:00           ` Junio C Hamano
2021-10-17  6:55             ` René Scharfe
2021-10-17  9:44               ` Ævar Arnfjörð Bjarmason
2021-11-15 20:43   ` Andreas Schwab
2021-11-15 22:41     ` Ævar Arnfjörð Bjarmason
2021-11-16  2:12       ` Carlo Arenas
2021-11-16  8:41       ` Andreas Schwab
2021-11-16  9:06         ` Carlo Arenas
2021-11-16  9:18           ` Andreas Schwab
2021-11-16  9:29           ` Andreas Schwab
2021-11-16  9:38             ` Carlo Arenas
2021-11-16  9:55               ` Andreas Schwab
2021-11-16 11:00                 ` [PATCH] grep: avoid setting UTF mode when not needed Carlo Marcelo Arenas Belón
2021-11-16 12:32                   ` Ævar Arnfjörð Bjarmason
2021-11-16 13:35                     ` Hamza Mahfooz
2021-11-17 14:31                       ` Ævar Arnfjörð Bjarmason
2021-11-16 18:22                     ` Carlo Arenas
2021-11-16 18:48                     ` Junio C Hamano
2021-11-17 10:23                   ` [PATCH v2] grep: avoid setting UTF mode when dangerous with PCRE Carlo Marcelo Arenas Belón
2021-11-18  7:29                     ` Junio C Hamano
2021-11-18 10:15                       ` Carlo Arenas
2021-11-18 12:49                         ` Hamza Mahfooz
2021-11-19  6:58                           ` Ævar Arnfjörð Bjarmason
2021-11-18 21:21                     ` Hamza Mahfooz
2021-11-17 18:46               ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data René Scharfe
2021-11-17 19:56                 ` Carlo Arenas
2021-11-17 20:59                 ` Ævar Arnfjörð Bjarmason
2021-11-17 21:53                   ` Carlo Arenas
2021-11-18  0:00                     ` Ævar Arnfjörð Bjarmason
2021-11-18 18:17                   ` Junio C Hamano
2021-11-18 20:57                     ` René Scharfe
2021-11-19  7:00                       ` Ævar Arnfjörð Bjarmason
2021-11-19 16:08                         ` René Scharfe
2021-11-19 17:33                           ` Carlo Arenas
2021-11-19 17:11                         ` Junio C Hamano
2021-10-15 18:05 ` [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Junio C Hamano
2021-10-15 18:24   ` Hamza Mahfooz
2021-10-15 19:28     ` Junio C Hamano
2021-10-15 19:40       ` Hamza Mahfooz
2021-10-15 19:49       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eddcbe66-b172-90b7-309e-e9ce5b5b44a4@web.de \
    --to=l.s.r@web.de \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=someguy@effective-light.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.