All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hamza Mahfooz <someguy@effective-light.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: "Carlo Marcelo Arenas Belón" <carenas@gmail.com>,
	git@vger.kernel.org, gitster@pobox.com,
	"Andreas Schwab" <schwab@linux-m68k.org>
Subject: Re: [PATCH] grep: avoid setting UTF mode when not needed
Date: Tue, 16 Nov 2021 08:35:51 -0500	[thread overview]
Message-ID: <R33O2R.WYXS4AQP7W05@effective-light.com> (raw)
In-Reply-To: <211116.86tugcp8mg.gmgdl@evledraar.gmail.com>

Does anyone know if René's patch causes older PCRE versions to break? 
If it doesn't I think René's patch plus Ævar's fix is the way to go.

On Tue, Nov 16 2021 at 01:32:17 PM +0100, Ævar Arnfjörð Bjarmason 
<avarab@gmail.com> wrote:
> 
> On Tue, Nov 16 2021, Carlo Marcelo Arenas Belón wrote:
> 
>>  Since ae39ba431a (grep/pcre2: fix an edge case concerning ascii 
>> patterns
>>  and UTF-8 data, 2021-10-15), PCRE_UTF mode is enabled in cases 
>> where it
>>  will fail because of UTF-8 validation, which is needed for versions 
>> of
>>  PCRE2 older than 10.34.
>> 
>>  Revert the change on logic to avoid failures that were reported 
>> from the
>>  test cases, but that should also reflect in normal use when JIT is 
>> enabled
>>  and could result in crashes (or worse), as UTF-8 validation is 
>> skipped.
>> 
>>  Keeping the tests, as they pass even without the fix as replicated 
>> locally
>>  in Debian 10 and the CI.
>> 
>>  Reported-by: Andreas Schwab <schwab@linux-m68k.org>
>>  Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
>>  ---
>>   grep.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>> 
>>  diff --git a/grep.c b/grep.c
>>  index f6e113e9f0..fe847a0111 100644
>>  --- a/grep.c
>>  +++ b/grep.c
>>  @@ -382,10 +382,8 @@ static void compile_pcre2_pattern(struct 
>> grep_pat *p, const struct grep_opt *opt
>>   		}
>>   		options |= PCRE2_CASELESS;
>>   	}
>>  -	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
>>  -	    (!opt->ignore_locale && is_utf8_locale() &&
>>  -	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
>>  -					    (p->fixed || p->is_fixed))))
>>  +	if (!opt->ignore_locale && is_utf8_locale() && 
>> has_non_ascii(p->pattern) &&
>>  +	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>   		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>> 
>>   #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER
> 
> Hrm.
> 
> A few things:
> 
> First, if we've got a post-PCREv2 version whatever fix let's guard 
> that
> with an ifdef, see thep GIT_PCRE2_VERSION_*_HIGHER at the top of 
> grep.h.
> 
> It really helps to have those, both to know to test on those older
> versions, and also so that we can at some point in the future bump the
> required version and drop these workarounds entirely (as we do with
> git-curl-compat.h).
> 
> Secondly, whatever do here let's first fix the test added in 
> ae39ba431a,
> so we're not groping around in the dark even more.
> 
> I didn't spot this at the time but the test that Hamza added in that
> based on my initial report[1] is broken & doesn't test anything
> meaningful. It needs to have this applied:
> 
> 	diff --git a/t/t7812-grep-icase-non-ascii.sh 
> b/t/t7812-grep-icase-non-ascii.sh
> 	index 22487d90fdc..1da6b07a579 100755
> 	--- a/t/t7812-grep-icase-non-ascii.sh
> 	+++ b/t/t7812-grep-icase-non-ascii.sh
> 	@@ -60,7 +60,7 @@ test_expect_success GETTEXT_LOCALE,PCRE 'log 
> --author with an ascii pattern on U
> 	        test_write_lines "forth" >file4 &&
> 	        git add file4 &&
> 	        git commit --author="À Ú Thor <author@example.com>" -m 
> sécond &&
> 	-       git log -1 --color=always --perl-regexp --author=".*Thor" 
> >log &&
> 	+       git log -1 --color=always --perl-regexp --author=". . Thor" 
> >log &&
> 	        grep Author log >actual.raw &&
> 	        test_decode_color <actual.raw >actual &&
> 	        test_cmp expected actual
> 
> I.e. the whole point of using the color output to test this is to
> discover where PCRE2 is going to consider a character boundary to be,
> using .* means that it won't be tested at add, since .* will happily 
> eat
> up whatever arbitrary data it finds with or without UTF-8 mode.
> 
> Other tests added in that & adjacent (if any?) commits may have the 
> same
> issue, I haven't dug into it.
> 
> If we lead with that patch we'll get the test passing on master as
> before, but with your patch above it'll break. I.e. the "when not
> needed' in the $subject isn't true, it's just that the test is
> completely broken.
> 
> In the context of this being a pretty urgent post-release fix (but I
> don't know if Junio would consider a point-release, so perhaphs it's
> not) I'd be OK with either of:
> 
>  A. Let's back out this new log grep color thing entirely while we
>     reconsider this. The gitster/hm/paint-hits-in-log-grep topic
>     currently reverts cleanly.
> 
>  B. Don't break the new log grep color thing, and also fix the 
> 'grepping
>     binary' regression (which is much more important than having A)
> 
> But let's not go for some in-between where we break the new feature to
> the point of it being worse than the state of not having it at all in
> v2.33.0.
> 
> I.e. without the that log grep color feature we wouldn't screw up the
> display of non-ASCII characters in log output (yay), in v2.34.0 we
> don't, but also color the match (yay), but we broke grepping binary
> *files* (boo!).
> 
> I think the approach I suggested in [2] is a much more viable way
> forward, i.e. let's stop fiddling with this giant nested if statement
> that's mainly meant for the grep-a-file-case, revert to the
> pre-log-grep-color state, and have the log-grep-color mode pass in a
> "yes, I'd like the UTF-8 mode, please".
> 
> Having said that that's probably also broken, just in ways we're less
> likely to spot (or maybe some of the log encoding settings/reencoding
> saves us there?), we may need to have that ifdef'd to 10.34 and 
> higher,
> or have some opt-in setting for this.
> 
> 1. https://lore.kernel.org/git/87v92bju64.fsf@evledraar.gmail.com/
> 2. https://lore.kernel.org/git/87v92bju64.fsf@evledraar.gmail.com/



  reply	other threads:[~2021-11-16 13:36 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-15 16:13 [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data Hamza Mahfooz
2021-10-15 20:03   ` Junio C Hamano
2021-10-16 16:25     ` René Scharfe
2021-10-16 17:12       ` Ævar Arnfjörð Bjarmason
2021-10-16 19:44         ` René Scharfe
2021-10-17  6:00           ` Junio C Hamano
2021-10-17  6:55             ` René Scharfe
2021-10-17  9:44               ` Ævar Arnfjörð Bjarmason
2021-11-15 20:43   ` Andreas Schwab
2021-11-15 22:41     ` Ævar Arnfjörð Bjarmason
2021-11-16  2:12       ` Carlo Arenas
2021-11-16  8:41       ` Andreas Schwab
2021-11-16  9:06         ` Carlo Arenas
2021-11-16  9:18           ` Andreas Schwab
2021-11-16  9:29           ` Andreas Schwab
2021-11-16  9:38             ` Carlo Arenas
2021-11-16  9:55               ` Andreas Schwab
2021-11-16 11:00                 ` [PATCH] grep: avoid setting UTF mode when not needed Carlo Marcelo Arenas Belón
2021-11-16 12:32                   ` Ævar Arnfjörð Bjarmason
2021-11-16 13:35                     ` Hamza Mahfooz [this message]
2021-11-17 14:31                       ` Ævar Arnfjörð Bjarmason
2021-11-16 18:22                     ` Carlo Arenas
2021-11-16 18:48                     ` Junio C Hamano
2021-11-17 10:23                   ` [PATCH v2] grep: avoid setting UTF mode when dangerous with PCRE Carlo Marcelo Arenas Belón
2021-11-18  7:29                     ` Junio C Hamano
2021-11-18 10:15                       ` Carlo Arenas
2021-11-18 12:49                         ` Hamza Mahfooz
2021-11-19  6:58                           ` Ævar Arnfjörð Bjarmason
2021-11-18 21:21                     ` Hamza Mahfooz
2021-11-17 18:46               ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data René Scharfe
2021-11-17 19:56                 ` Carlo Arenas
2021-11-17 20:59                 ` Ævar Arnfjörð Bjarmason
2021-11-17 21:53                   ` Carlo Arenas
2021-11-18  0:00                     ` Ævar Arnfjörð Bjarmason
2021-11-18 18:17                   ` Junio C Hamano
2021-11-18 20:57                     ` René Scharfe
2021-11-19  7:00                       ` Ævar Arnfjörð Bjarmason
2021-11-19 16:08                         ` René Scharfe
2021-11-19 17:33                           ` Carlo Arenas
2021-11-19 17:11                         ` Junio C Hamano
2021-10-15 18:05 ` [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Junio C Hamano
2021-10-15 18:24   ` Hamza Mahfooz
2021-10-15 19:28     ` Junio C Hamano
2021-10-15 19:40       ` Hamza Mahfooz
2021-10-15 19:49       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=R33O2R.WYXS4AQP7W05@effective-light.com \
    --to=someguy@effective-light.com \
    --cc=avarab@gmail.com \
    --cc=carenas@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=schwab@linux-m68k.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.