git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: Junio C Hamano <gitster@pobox.com>,
	Hamza Mahfooz <someguy@effective-light.com>,
	git@vger.kernel.org
Subject: Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data
Date: Sun, 17 Oct 2021 11:44:31 +0200	[thread overview]
Message-ID: <87r1ckc6gf.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <0ea73e7a-6d43-e223-ab2e-24c684102856@web.de>


On Sun, Oct 17 2021, René Scharfe wrote:

> Am 17.10.21 um 08:00 schrieb Junio C Hamano:
>> René Scharfe <l.s.r@web.de> writes:
>>
>>>>> Literal patterns are those that don't use any wildcards or case-folding.
>>>>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
>>>>> pattern only consists of ASCII characters, or if the pattern is encoded
>>>>> in UTF-8 and is not just a literal pattern.
>>>>>
>>>>> Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
>>>>> ASCII chars?
>>>>> ...
>>>>     echo 'René Scharfe' >f &&
>>>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
>>>>     1
>>>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
>>>>     f:René Scharfe
>>>>     0
>>>>
>>>> So it's a choose-your-own adventure where you can pick if you're
>>>> right. I.e. do you want the "." metacharacter to match your "é" or not?
>>>
>>> Yes, I do, and it's what Hamza's patch is fixing.
>>
>> That may be correct but is this discussion still about "Why enable
>> ... for literal patterns that consist of only ASCII"?  Calling "." a
>> "metacharacter" and wanting it to match anything other than a single
>> dot would mean the pattern we are discussing is no longer "literal",
>> isn't it?  I am puzzled.
>
> Right, Ævar's comment is not about my question, but highlights an
> inconsistency in master that is fixed by Hamza's patch.

Yes, sorry about that. Just generally about the messy semantics.

> I'll repeat and extend my question: Hamza's patch enables PCRE2_UTF for
> non-ASCII patterns even if they are literal or our locale is not UTF-8.
> The following change would fix the edge case mentioned in its commit
> message without these side-effects.  Am I correct?
>
> diff --git a/grep.c b/grep.c
> index fe847a0111..5badb6d851 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -382,7 +382,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  		}
>  		options |= PCRE2_CASELESS;
>  	}
> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
> +	if (!opt->ignore_locale && is_utf8_locale() &&
>  	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>  		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);

I haven't had time to carefully look into this, but one caveat to check
out is if it works with older PCRE and whether you need e.g. the
GIT_PCRE2_VERSION_10_36_OR_HIGHER.

I tried your suggestion on top of Hamza's series, compiled PCRE v2
10.23, tested, and also tried manually removing the
PCRE2_MATCH_INVALID_UTF flag and tested again.

We pass all tests with both, so maybe this is safe to do (or maybe we're
missing some test we haven't thought of yet...).

One thing that makes me nervous is that we had breakages in the past
once the patches escaped into the wild, particularly because the code
being modified here has is_utf8_locale(), and our tests run under
LANG=c LC_ALL=C.

I tried running all the tests with a non-C locale, there's a lot of
failures, but none new with this change. As an aside the below patch
makes all but one shortlog test pass for me. I wonder if we shouldn't do
this for real to smoke out any $LANG or $LC_ALL dependencies.

I.e. almost all of the failures were due to relying on the sort order of
sort(1), and in one case comm(1), the first hunk here is also redundant
to defining our own ls(1) wrapper....

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index bd17f308b38..738ca6ef587 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -693,12 +693,12 @@ test_expect_success 'expire removes unreferenced packs' '
 		^refs/heads/C
 		EOF
 		git multi-pack-index write &&
-		ls .git/objects/pack | grep -v -e pack-[AB] >expect &&
+		ls .git/objects/pack | sort | grep -v -e pack-[AB] >expect &&
 		git multi-pack-index expire &&
-		ls .git/objects/pack >actual &&
+		ls .git/objects/pack | sort >actual &&
 		test_cmp expect actual &&
-		ls .git/objects/pack/ | grep idx >expect-idx &&
-		test-tool read-midx .git/objects | grep idx >actual-midx &&
+		ls .git/objects/pack/ | sort | grep idx >expect-idx &&
+		test-tool read-midx .git/objects | sort | grep idx >actual-midx &&
 		test_cmp expect-idx actual-midx &&
 		git multi-pack-index verify &&
 		git fsck
@@ -802,7 +802,7 @@ test_expect_success 'expire works when adding new packs' '
 		refs/heads/E
 		EOF
 		git multi-pack-index expire &&
-		ls .git/objects/pack/ | grep idx >expect &&
+		ls .git/objects/pack/ | sort | grep idx >expect &&
 		test-tool read-midx .git/objects | grep idx >actual &&
 		test_cmp expect actual &&
 		git multi-pack-index verify
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 8361b5c1c57..f4f9d231f28 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -417,14 +417,22 @@ test -n "$BASH_VERSION" && shopt -u checkwinsize 2>/dev/null
 
 # For repeatability, reset the environment to known value.
 # TERM is sanitized below, after saving color control sequences.
-LANG=C
-LC_ALL=C
+LANG=en_US.UTF-8
+LC_ALL=en_US.UTF-8
 PAGER=cat
 TZ=UTC
 COLUMNS=80
 export LANG LC_ALL PAGER TZ COLUMNS
 EDITOR=:
 
+sort () {
+	LC_ALL=C LANG=C command sort "$@"
+}
+
+comm () {
+	LC_ALL=C LANG=C command comm "$@"
+}
+
 # A call to "unset" with no arguments causes at least Solaris 10
 # /usr/xpg4/bin/sh and /bin/ksh to bail out.  So keep the unsets
 # deriving from the command substitution clustered with the other

  reply	other threads:[~2021-10-17 10:05 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-15 16:13 [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data Hamza Mahfooz
2021-10-15 20:03   ` Junio C Hamano
2021-10-16 16:25     ` René Scharfe
2021-10-16 17:12       ` Ævar Arnfjörð Bjarmason
2021-10-16 19:44         ` René Scharfe
2021-10-17  6:00           ` Junio C Hamano
2021-10-17  6:55             ` René Scharfe
2021-10-17  9:44               ` Ævar Arnfjörð Bjarmason [this message]
2021-11-15 20:43   ` Andreas Schwab
2021-11-15 22:41     ` Ævar Arnfjörð Bjarmason
2021-11-16  2:12       ` Carlo Arenas
2021-11-16  8:41       ` Andreas Schwab
2021-11-16  9:06         ` Carlo Arenas
2021-11-16  9:18           ` Andreas Schwab
2021-11-16  9:29           ` Andreas Schwab
2021-11-16  9:38             ` Carlo Arenas
2021-11-16  9:55               ` Andreas Schwab
2021-11-16 11:00                 ` [PATCH] grep: avoid setting UTF mode when not needed Carlo Marcelo Arenas Belón
2021-11-16 12:32                   ` Ævar Arnfjörð Bjarmason
2021-11-16 13:35                     ` Hamza Mahfooz
2021-11-17 14:31                       ` Ævar Arnfjörð Bjarmason
2021-11-16 18:22                     ` Carlo Arenas
2021-11-16 18:48                     ` Junio C Hamano
2021-11-17 10:23                   ` [PATCH v2] grep: avoid setting UTF mode when dangerous with PCRE Carlo Marcelo Arenas Belón
2021-11-18  7:29                     ` Junio C Hamano
2021-11-18 10:15                       ` Carlo Arenas
2021-11-18 12:49                         ` Hamza Mahfooz
2021-11-19  6:58                           ` Ævar Arnfjörð Bjarmason
2021-11-18 21:21                     ` Hamza Mahfooz
2021-11-17 18:46               ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data René Scharfe
2021-11-17 19:56                 ` Carlo Arenas
2021-11-17 20:59                 ` Ævar Arnfjörð Bjarmason
2021-11-17 21:53                   ` Carlo Arenas
2021-11-18  0:00                     ` Ævar Arnfjörð Bjarmason
2021-11-18 18:17                   ` Junio C Hamano
2021-11-18 20:57                     ` René Scharfe
2021-11-19  7:00                       ` Ævar Arnfjörð Bjarmason
2021-11-19 16:08                         ` René Scharfe
2021-11-19 17:33                           ` Carlo Arenas
2021-11-19 17:11                         ` Junio C Hamano
2021-10-15 18:05 ` [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Junio C Hamano
2021-10-15 18:24   ` Hamza Mahfooz
2021-10-15 19:28     ` Junio C Hamano
2021-10-15 19:40       ` Hamza Mahfooz
2021-10-15 19:49       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r1ckc6gf.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=someguy@effective-light.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).