All of lore.kernel.org
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: "Carlo Arenas" <carenas@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: "René Scharfe" <l.s.r@web.de>,
	"Andreas Schwab" <schwab@linux-m68k.org>,
	"Hamza Mahfooz" <someguy@effective-light.com>,
	git@vger.kernel.org
Subject: Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data
Date: Thu, 18 Nov 2021 10:17:59 -0800	[thread overview]
Message-ID: <xmqqczmxxr8o.fsf@gitster.g> (raw)
In-Reply-To: <211117.86y25m5wez.gmgdl@evledraar.gmail.com> (=?utf-8?B?IsOG?= =?utf-8?B?dmFyIEFybmZqw7Zyw7A=?= Bjarmason"'s message of "Wed, 17 Nov 2021 21:59:45 +0100")

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> Let's have a look at the map.  Here are the differences between the
>> versions regarding use of PCRE2_UTF:
>>
>> o: opt->ignore_locale
>> h: has_non_ascii(p->pattern)
>> i: is_utf8_locale()
>> l: !opt->ignore_case && (p->fixed || p->is_fixed)
>>
>> o h i l master hamza rene2
>> 0 0 0 0      0     1     0
>> 0 0 0 1      0     1     0
>> 0 0 1 0      0     1     1
>> 0 0 1 1      0     1     0  <== 7812.13, confirmed using fprint() debugging
>>
>> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/
>> should not have this breakage, because it doesn't enable PCRE2_UTF for
>> literal patterns.
>
> PCRE2_UTF will also matter for literal patterns. Try to peel apart the
> two bytes in "é" and match them under -i with/without PCRE_UTF.

Sorry for being late to the party, but doesn't "literal" in the
context of this thread mean the column "l" above, i.e. we are not
ignoring case and fixed or is_fixed member is set?  So "under -i"
disqualifies as an example for "will also matter for literal",
doesn't it?

In hindsight, I guess we could have pushed a bit harder when René's

-	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	if (!opt->ignore_locale && is_utf8_locale() &&
 	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);

in https://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/
(is that what is called 'rene2' above?) was raised on Oct 17th to
amend/fix Hamza's [v13 3/3]; that would have prevented 'master' from
having this breakage?

Carlo, in your [PATCH v2] in <20211117102329.95456-1-carenas@gmail.com>,
I see that the #else side for older PCREv2 users essentially reverts
what Hamza's [PATCH v13 3/3] did to this area.

+#else
+	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
+		options |= PCRE2_UTF;
+#endif

I guess this is a lot of change in the amount of text involved but
the least amount of actual change in the behaviour.  For those with
newer PCREv2, the behaviour would be the same as v2.34.0, and for
others, the behaviour would be the same as v2.33.0.

Having said all that, because the consensus seems to be that the
whole "when we should match in UTF mode" may need to be rethought, I
think reverting Hamza's [v13 3/3] would be the simplest way to clean
up the mess for v2.34.1 that will give us a cleaner slate to later
build on, than applying this patch.

So, I dunno.  Comments from those involved in the discussion?

Thanks.


  parent reply	other threads:[~2021-11-18 18:18 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-15 16:13 [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data Hamza Mahfooz
2021-10-15 20:03   ` Junio C Hamano
2021-10-16 16:25     ` René Scharfe
2021-10-16 17:12       ` Ævar Arnfjörð Bjarmason
2021-10-16 19:44         ` René Scharfe
2021-10-17  6:00           ` Junio C Hamano
2021-10-17  6:55             ` René Scharfe
2021-10-17  9:44               ` Ævar Arnfjörð Bjarmason
2021-11-15 20:43   ` Andreas Schwab
2021-11-15 22:41     ` Ævar Arnfjörð Bjarmason
2021-11-16  2:12       ` Carlo Arenas
2021-11-16  8:41       ` Andreas Schwab
2021-11-16  9:06         ` Carlo Arenas
2021-11-16  9:18           ` Andreas Schwab
2021-11-16  9:29           ` Andreas Schwab
2021-11-16  9:38             ` Carlo Arenas
2021-11-16  9:55               ` Andreas Schwab
2021-11-16 11:00                 ` [PATCH] grep: avoid setting UTF mode when not needed Carlo Marcelo Arenas Belón
2021-11-16 12:32                   ` Ævar Arnfjörð Bjarmason
2021-11-16 13:35                     ` Hamza Mahfooz
2021-11-17 14:31                       ` Ævar Arnfjörð Bjarmason
2021-11-16 18:22                     ` Carlo Arenas
2021-11-16 18:48                     ` Junio C Hamano
2021-11-17 10:23                   ` [PATCH v2] grep: avoid setting UTF mode when dangerous with PCRE Carlo Marcelo Arenas Belón
2021-11-18  7:29                     ` Junio C Hamano
2021-11-18 10:15                       ` Carlo Arenas
2021-11-18 12:49                         ` Hamza Mahfooz
2021-11-19  6:58                           ` Ævar Arnfjörð Bjarmason
2021-11-18 21:21                     ` Hamza Mahfooz
2021-11-17 18:46               ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data René Scharfe
2021-11-17 19:56                 ` Carlo Arenas
2021-11-17 20:59                 ` Ævar Arnfjörð Bjarmason
2021-11-17 21:53                   ` Carlo Arenas
2021-11-18  0:00                     ` Ævar Arnfjörð Bjarmason
2021-11-18 18:17                   ` Junio C Hamano [this message]
2021-11-18 20:57                     ` René Scharfe
2021-11-19  7:00                       ` Ævar Arnfjörð Bjarmason
2021-11-19 16:08                         ` René Scharfe
2021-11-19 17:33                           ` Carlo Arenas
2021-11-19 17:11                         ` Junio C Hamano
2021-10-15 18:05 ` [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Junio C Hamano
2021-10-15 18:24   ` Hamza Mahfooz
2021-10-15 19:28     ` Junio C Hamano
2021-10-15 19:40       ` Hamza Mahfooz
2021-10-15 19:49       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqczmxxr8o.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=avarab@gmail.com \
    --cc=carenas@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=l.s.r@web.de \
    --cc=schwab@linux-m68k.org \
    --cc=someguy@effective-light.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.