All of lore.kernel.org
 help / color / mirror / Atom feed
From: Carlo Arenas <carenas@gmail.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: "Andreas Schwab" <schwab@linux-m68k.org>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Hamza Mahfooz" <someguy@effective-light.com>,
	git@vger.kernel.org, "Junio C Hamano" <gitster@pobox.com>
Subject: Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data
Date: Wed, 17 Nov 2021 11:56:25 -0800	[thread overview]
Message-ID: <CAPUEspg3Ox41H-dCKzqV5oS4UP5Pkdoq0wUaXtxfNWWRTO0k1w@mail.gmail.com> (raw)
In-Reply-To: <634c4237-325a-13e8-0a92-09d23bdfb111@web.de>

On Wed, Nov 17, 2021 at 10:46 AM René Scharfe <l.s.r@web.de> wrote:
>
> Am 16.11.21 um 10:38 schrieb Carlo Arenas:
> > On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> >>
> >> expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data':
> >>         git grep -h "var" invalid-0x80 >actual &&
> >>         test_cmp expected actual &&
> >>         git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
> >>         test_cmp expected actual
> >>
> >> ++ git grep -h var invalid-0x80
> >> ++ test_cmp expected actual
> >> ++ test 2 -ne 2
> >> ++ eval 'diff -u' '"$@"'
> >> +++ diff -u expected actual
> >> ++ git grep -h '(*NO_JIT)var' invalid-0x80
> >> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set
> >
> > That is exactly what I was worried about, this is not failing one
> > test, but making `git grep` unusable in any repository that has any
> > binary files that might be reachable by it, and it is likely affecting
> > anyone using PCRE older than 10.34
>
> Let's have a look at the map.  Here are the differences between the
> versions regarding use of PCRE2_UTF:
>
> o: opt->ignore_locale
> h: has_non_ascii(p->pattern)
> i: is_utf8_locale()
> l: !opt->ignore_case && (p->fixed || p->is_fixed)
>
> o h i l master hamza rene2
> 0 0 0 0      0     1     0
> 0 0 0 1      0     1     0
> 0 0 1 0      0     1     1
> 0 0 1 1      0     1     0  <== 7812.13, confirmed using fprint() debugging
>
> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/
> should not have this breakage, because it doesn't enable PCRE2_UTF for
> literal patterns.

Correct, and indeed I had your expression instead of the ugly one in
my draft for v2[1] but then I found the tests were even more broken
than reported originally and got worried it might introduce yet
another issue because of the brittleness of the rest of the code
around it.  I am hoping it will be introduced in a follow up series
though, and unless this code gets thrown away in the promised
refactoring by Ævar which I haven't seen yet and once this is in
maint.

IMHO and in line with the previous warnings you raised[2] during the
development of this feature, it might be worth looking at it more
deeply, but at least the proposed patch keeps the feature working in
practice (the only case that won't work as expected will be when the
expression includes "." with the intention of matching a UTF-8
character and while using PCRE2 older than 10.34) and more importantly
fixes the regression that it introduced.

Carlo

[1] https://lore.kernel.org/git/20211117102329.95456-1-carenas@gmail.com/
[2] https://lore.kernel.org/git/fc7eb9fc-9521-5484-b05f-9c20086fd485@web.de/

  reply	other threads:[~2021-11-17 19:56 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-15 16:13 [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data Hamza Mahfooz
2021-10-15 20:03   ` Junio C Hamano
2021-10-16 16:25     ` René Scharfe
2021-10-16 17:12       ` Ævar Arnfjörð Bjarmason
2021-10-16 19:44         ` René Scharfe
2021-10-17  6:00           ` Junio C Hamano
2021-10-17  6:55             ` René Scharfe
2021-10-17  9:44               ` Ævar Arnfjörð Bjarmason
2021-11-15 20:43   ` Andreas Schwab
2021-11-15 22:41     ` Ævar Arnfjörð Bjarmason
2021-11-16  2:12       ` Carlo Arenas
2021-11-16  8:41       ` Andreas Schwab
2021-11-16  9:06         ` Carlo Arenas
2021-11-16  9:18           ` Andreas Schwab
2021-11-16  9:29           ` Andreas Schwab
2021-11-16  9:38             ` Carlo Arenas
2021-11-16  9:55               ` Andreas Schwab
2021-11-16 11:00                 ` [PATCH] grep: avoid setting UTF mode when not needed Carlo Marcelo Arenas Belón
2021-11-16 12:32                   ` Ævar Arnfjörð Bjarmason
2021-11-16 13:35                     ` Hamza Mahfooz
2021-11-17 14:31                       ` Ævar Arnfjörð Bjarmason
2021-11-16 18:22                     ` Carlo Arenas
2021-11-16 18:48                     ` Junio C Hamano
2021-11-17 10:23                   ` [PATCH v2] grep: avoid setting UTF mode when dangerous with PCRE Carlo Marcelo Arenas Belón
2021-11-18  7:29                     ` Junio C Hamano
2021-11-18 10:15                       ` Carlo Arenas
2021-11-18 12:49                         ` Hamza Mahfooz
2021-11-19  6:58                           ` Ævar Arnfjörð Bjarmason
2021-11-18 21:21                     ` Hamza Mahfooz
2021-11-17 18:46               ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data René Scharfe
2021-11-17 19:56                 ` Carlo Arenas [this message]
2021-11-17 20:59                 ` Ævar Arnfjörð Bjarmason
2021-11-17 21:53                   ` Carlo Arenas
2021-11-18  0:00                     ` Ævar Arnfjörð Bjarmason
2021-11-18 18:17                   ` Junio C Hamano
2021-11-18 20:57                     ` René Scharfe
2021-11-19  7:00                       ` Ævar Arnfjörð Bjarmason
2021-11-19 16:08                         ` René Scharfe
2021-11-19 17:33                           ` Carlo Arenas
2021-11-19 17:11                         ` Junio C Hamano
2021-10-15 18:05 ` [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Junio C Hamano
2021-10-15 18:24   ` Hamza Mahfooz
2021-10-15 19:28     ` Junio C Hamano
2021-10-15 19:40       ` Hamza Mahfooz
2021-10-15 19:49       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPUEspg3Ox41H-dCKzqV5oS4UP5Pkdoq0wUaXtxfNWWRTO0k1w@mail.gmail.com \
    --to=carenas@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=schwab@linux-m68k.org \
    --cc=someguy@effective-light.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.