* [PATCH 1/2] grep/pcre2: limit the instances in which UTF mode is enabled @ 2021-11-18 8:41 Hamza Mahfooz 2021-11-18 8:41 ` [PATCH 2/2] ci: add a job for PCRE2 Hamza Mahfooz 2021-11-18 10:04 ` [PATCH 1/2] grep/pcre2: limit the instances in which UTF mode is enabled Carlo Arenas 0 siblings, 2 replies; 8+ messages in thread From: Hamza Mahfooz @ 2021-11-18 8:41 UTC (permalink / raw) To: git Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, René Scharfe, Andreas Schwab, Hamza Mahfooz, Ævar Arnfjörð Bjarmason UTF mode is enabled for cases that cause older versions of PCRE to break. This is primarily due to the fact that we can't make as many assumptions on the kind of data that is fed to "git grep." So, limit when UTF mode can be enabled by introducing "is_log" to struct grep_opt, checking to see if it's a non-zero value in compile_pcre2_pattern() and only mutating it in cmd_log() so that we know "git log" was invoked if it's set to a non-zero value. Fixes: ae39ba431a (grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data, 2021-10-15) Suggested-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> --- builtin/log.c | 1 + grep.c | 2 +- grep.h | 1 + t/t7812-grep-icase-non-ascii.sh | 2 +- 4 files changed, 4 insertions(+), 2 deletions(-) diff --git a/builtin/log.c b/builtin/log.c index f75d87e8d7..040b0b533f 100644 --- a/builtin/log.c +++ b/builtin/log.c @@ -751,6 +751,7 @@ int cmd_log(int argc, const char **argv, const char *prefix) git_config(git_log_config, NULL); repo_init_revisions(the_repository, &rev, prefix); + rev.grep_filter.is_log = 1; rev.always_show_header = 1; memset(&opt, 0, sizeof(opt)); opt.def = "HEAD"; diff --git a/grep.c b/grep.c index f6e113e9f0..665d86f007 100644 --- a/grep.c +++ b/grep.c @@ -382,7 +382,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) || + if ((opt->is_log && !opt->ignore_locale && !has_non_ascii(p->pattern)) || (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && !(!opt->ignore_case && (p->fixed || p->is_fixed)))) diff --git a/grep.h b/grep.h index 3e8815c347..64634c6a3f 100644 --- a/grep.h +++ b/grep.h @@ -167,6 +167,7 @@ struct grep_opt { int extended_regexp_option; int pattern_type_option; int ignore_locale; + int is_log; char colors[NR_GREP_COLORS][COLOR_MAXLEN]; unsigned pre_context; unsigned post_context; diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index 22487d90fd..1da6b07a57 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -60,7 +60,7 @@ test_expect_success GETTEXT_LOCALE,PCRE 'log --author with an ascii pattern on U test_write_lines "forth" >file4 && git add file4 && git commit --author="À Ú Thor <author@example.com>" -m sécond && - git log -1 --color=always --perl-regexp --author=".*Thor" >log && + git log -1 --color=always --perl-regexp --author=". . Thor" >log && grep Author log >actual.raw && test_decode_color <actual.raw >actual && test_cmp expected actual -- 2.34.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/2] ci: add a job for PCRE2 2021-11-18 8:41 [PATCH 1/2] grep/pcre2: limit the instances in which UTF mode is enabled Hamza Mahfooz @ 2021-11-18 8:41 ` Hamza Mahfooz 2021-11-18 9:53 ` [PATCH v2 " Hamza Mahfooz 2021-11-18 10:32 ` [PATCH " Ævar Arnfjörð Bjarmason 2021-11-18 10:04 ` [PATCH 1/2] grep/pcre2: limit the instances in which UTF mode is enabled Carlo Arenas 1 sibling, 2 replies; 8+ messages in thread From: Hamza Mahfooz @ 2021-11-18 8:41 UTC (permalink / raw) To: git Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, René Scharfe, Andreas Schwab, Hamza Mahfooz, Ævar Arnfjörð Bjarmason Since, git aspires to support many PCRE2 versions, it is only sensible to test changes to git against versions of PCRE2 that are deemed to be notable, to ensure those changes to git don't cause regressions when using the aforementioned PCRE2 versions. This is underscored by the fact that, commit ae39ba431a (grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data, 2021-10-15) was able to make it's way to master while causing an otherwise easy to catch regression when an older version of PCRE2 is used. So, to address this issue, add a job for PCRE2 to build all of the notable versions, compile all of them against git and only run the tests that can possibly be impacted by PCRE2. Suggested-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> --- .github/workflows/main.yml | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index 6ed6a9e807..ae96fc0251 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -319,3 +319,29 @@ jobs: - uses: actions/checkout@v2 - run: ci/install-dependencies.sh - run: ci/test-documentation.sh + pcre2: + needs: ci-config + if: needs.ci-config.outputs.enabled == 'yes' + strategy: + fail-fast: false + matrix: + jit: ['--enable-jit', '--disable-jit'] + utf: ['--enable-utf', '--disable-utf'] + version: [31, 34, 36, 39] + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v2 + - uses: actions/checkout@v2 + with: + repository: 'PhilipHazel/pcre2' + path: 'compat/pcre2-repo' + - run: ci/install-dependencies.sh + - run: cd compat/pcre2-repo + - run: git checkout pcre2-10.${{matrix.version}} + - run: ./autogen.sh + - run: ./configure ${{matrix.jit}} ${{matrix.utf}} --prefix="$PWD/inst" + - run: make + - run: sudo make install + - run: cd ../.. + - run: make USE_LIBPCRE=Y CFLAGS=-O3 LIBPCREDIR="$PWD/compat/pcre2-repo/inst" + - run: cd t && make *{grep,log,pickaxe}* -- 2.34.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 2/2] ci: add a job for PCRE2 2021-11-18 8:41 ` [PATCH 2/2] ci: add a job for PCRE2 Hamza Mahfooz @ 2021-11-18 9:53 ` Hamza Mahfooz 2021-11-18 10:32 ` [PATCH " Ævar Arnfjörð Bjarmason 1 sibling, 0 replies; 8+ messages in thread From: Hamza Mahfooz @ 2021-11-18 9:53 UTC (permalink / raw) To: git Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, René Scharfe, Andreas Schwab, Hamza Mahfooz, Ævar Arnfjörð Bjarmason Since, git aspires to support many PCRE2 versions, it is only sensible to test changes to git against versions of PCRE2 that are deemed to be notable, to ensure those changes to git don't cause regressions when using the aforementioned PCRE2 versions. This is underscored by the fact that, commit ae39ba431a (grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data, 2021-10-15) was able to make it's way to master while causing an otherwise easy to catch regression when an older version of PCRE2 is used. So, to address this issue, add a job for PCRE2 to build all of the notable versions, compile all of them against git and only run the tests that can possibly be impacted by PCRE2. Suggested-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> --- v2: use 'ref' instead of doing a manual checkout and add jobname to the env. --- .github/workflows/main.yml | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index 6ed6a9e807..6d8906139d 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -319,3 +319,31 @@ jobs: - uses: actions/checkout@v2 - run: ci/install-dependencies.sh - run: ci/test-documentation.sh + pcre2: + needs: ci-config + if: needs.ci-config.outputs.enabled == 'yes' + strategy: + fail-fast: false + matrix: + jit: ['--enable-jit', '--disable-jit'] + utf: ['--enable-utf', '--disable-utf'] + version: [31, 34, 36, 39] + env: + jobname: linux-gcc-default + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v2 + - uses: actions/checkout@v2 + with: + repository: 'PhilipHazel/pcre2' + path: 'compat/pcre2-repo' + ref: pcre2-10.${{matrix.version}} + - run: ci/install-dependencies.sh + - run: cd compat/pcre2-repo + - run: ./autogen.sh + - run: ./configure ${{matrix.jit}} ${{matrix.utf}} --prefix="$PWD/inst" + - run: make + - run: sudo make install + - run: cd ../.. + - run: make USE_LIBPCRE=Y CFLAGS=-O3 LIBPCREDIR="$PWD/compat/pcre2-repo/inst" + - run: cd t && make *{grep,log,pickaxe}* -- 2.34.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] ci: add a job for PCRE2 2021-11-18 8:41 ` [PATCH 2/2] ci: add a job for PCRE2 Hamza Mahfooz 2021-11-18 9:53 ` [PATCH v2 " Hamza Mahfooz @ 2021-11-18 10:32 ` Ævar Arnfjörð Bjarmason 2021-11-22 22:26 ` Hamza Mahfooz 1 sibling, 1 reply; 8+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2021-11-18 10:32 UTC (permalink / raw) To: Hamza Mahfooz Cc: git, Junio C Hamano, Carlo Marcelo Arenas Belón, René Scharfe, Andreas Schwab On Thu, Nov 18 2021, Hamza Mahfooz wrote: > Since, git aspires to support many PCRE2 versions, it is only sensible to > test changes to git against versions of PCRE2 that are deemed to be > notable, to ensure those changes to git don't cause regressions when using > the aforementioned PCRE2 versions. This is underscored by the fact that, > commit ae39ba431a (grep/pcre2: fix an edge case concerning ascii patterns > and UTF-8 data, 2021-10-15) was able to make it's way to master while > causing an otherwise easy to catch regression when an older version of > PCRE2 is used. So, to address this issue, add a job for PCRE2 to build all > of the notable versions, compile all of them against git and only run the > tests that can possibly be impacted by PCRE2. > > Suggested-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> > Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> > --- > .github/workflows/main.yml | 26 ++++++++++++++++++++++++++ > 1 file changed, 26 insertions(+) > > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml > index 6ed6a9e807..ae96fc0251 100644 > --- a/.github/workflows/main.yml > +++ b/.github/workflows/main.yml > @@ -319,3 +319,29 @@ jobs: > - uses: actions/checkout@v2 > - run: ci/install-dependencies.sh > - run: ci/test-documentation.sh > + pcre2: > + needs: ci-config > + if: needs.ci-config.outputs.enabled == 'yes' > + strategy: > + fail-fast: false > + matrix: > + jit: ['--enable-jit', '--disable-jit'] > + utf: ['--enable-utf', '--disable-utf'] > + version: [31, 34, 36, 39] > + runs-on: ubuntu-latest > + steps: > + - uses: actions/checkout@v2 > + - uses: actions/checkout@v2 > + with: > + repository: 'PhilipHazel/pcre2' > + path: 'compat/pcre2-repo' > + - run: ci/install-dependencies.sh > + - run: cd compat/pcre2-repo > + - run: git checkout pcre2-10.${{matrix.version}} > + - run: ./autogen.sh > + - run: ./configure ${{matrix.jit}} ${{matrix.utf}} --prefix="$PWD/inst" Thanks a lot for following-up on this. Do you have a link to a sample run of this to see how it looks? I think that the --enable-utf here is a bug, my fault, I suggested using that option. But on closer inspection I should have said --{enable,disable}-unicode. Eyeballing the configure.ac in pcre2.git now and checking if/how it passes our tests I think it might be a noop unless --enable-ebcdic is also in play, which we don't need to test. Any reason for picking those specific versions? I think we do need to test on older than 10.31 (released in early 2018). On the other hand PCREv2 wasn't released at all until 2015, and got up to 10.20 all in that one year, and I think git may have been one of the first popular packages to use it. I (optionally at first) us from PCRE v1 to v2, and IIRC something like only 1-2 obscure packages in Debian depended on it at the time, with hundreds depending on PCREv1. We should also add a "latest" in there, and then just map that ourselves to: git tag -l --sort=-version:refname | head -n 1 I.e. it's probably overkill to test the pcre2.git bleeding edge, and it might produce undue noise, but testing the latest release in addition to select older versions seems like a good thing to do. > + - run: make > + - run: sudo make install I don't think "sudo" here is needed (and I'm surprised it works in CI, presumably a noop or it's always root). Just a: make install Should do, it'll also take care of doing "make" > + - run: cd ../.. > + - run: make USE_LIBPCRE=Y CFLAGS=-O3 LIBPCREDIR="$PWD/compat/pcre2-repo/inst" I was going to say: "We should probably leave CFLAGS undefined and use whatever the default is, or is there a good reason for -O3?". But I see this is another thing copied from my throw-away one-liner in [1], sorry. We can just skip that, I just happened to be testing with CFLAGS=-O3 locally at the time & copied that into my E-Mail. > + - run: cd t && make *{grep,log,pickaxe}* 1. https://lore.kernel.org/git/211117.867dd67spq.gmgdl@evledraar.gmail.com/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] ci: add a job for PCRE2 2021-11-18 10:32 ` [PATCH " Ævar Arnfjörð Bjarmason @ 2021-11-22 22:26 ` Hamza Mahfooz 0 siblings, 0 replies; 8+ messages in thread From: Hamza Mahfooz @ 2021-11-22 22:26 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: git, Junio C Hamano, Carlo Marcelo Arenas Belón, René Scharfe, Andreas Schwab On Thu, Nov 18 2021 at 11:32:50 AM +0100, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: >> > Thanks a lot for following-up on this. Do you have a link to a sample > run of this to see how it looks? https://github.com/effective-light/git/actions/runs/1492352516 (it looks like the disable unicode case isn't worth considering, since it never runs through the tests successfully). > But on closer inspection I should have said > --{enable,disable}-unicode. Eyeballing the configure.ac in pcre2.git > now > and checking if/how it passes our tests I think it might be a noop > unless --enable-ebcdic is also in play, which we don't need to test. Looks like ebcdic and unicode can't be enabled at the same time. > Any reason for picking those specific versions? I think we do need to > test on older than 10.31 (released in early 2018). I chose them primarily because they were brought up on the other thread. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] grep/pcre2: limit the instances in which UTF mode is enabled 2021-11-18 8:41 [PATCH 1/2] grep/pcre2: limit the instances in which UTF mode is enabled Hamza Mahfooz 2021-11-18 8:41 ` [PATCH 2/2] ci: add a job for PCRE2 Hamza Mahfooz @ 2021-11-18 10:04 ` Carlo Arenas 2021-11-18 19:40 ` Carlo Marcelo Arenas Belón 1 sibling, 1 reply; 8+ messages in thread From: Carlo Arenas @ 2021-11-18 10:04 UTC (permalink / raw) To: Hamza Mahfooz Cc: git, Junio C Hamano, René Scharfe, Andreas Schwab, Ævar Arnfjörð Bjarmason On Thu, Nov 18, 2021 at 12:42 AM Hamza Mahfooz <someguy@effective-light.com> wrote: > > UTF mode is enabled for cases that cause older versions of PCRE to break. Not really; what is broken is our implementation of how PCRE gets called and that ignores the fact that giving it invalid UTF-8 (which might be valid LATIN-1 text for example) and telling it to do a match using UTF, will fail (if we are lucky even with an error) or might even crash (and obviously don't match) if we also tell it to not do the validation, and which is something we do when JIT is enabled. > This is primarily due to the fact that we can't make as many assumptions on > the kind of data that is fed to "git grep." So, limit when UTF mode can be > enabled by introducing "is_log" to struct grep_opt, checking to see if it's > a non-zero value in compile_pcre2_pattern() and only mutating it in > cmd_log() so that we know "git log" was invoked if it's set to a non-zero > value. I haven't tested it, but I think that for this to work with the log, we also need to make sure that all log entries that might not be UTF-8 get first iconv() which is why probably Æevar mentioned[1] i18n.commitEncoding in his old email. Of course doing that translation only makes sense if the log output is meant to be UTF-8 which is why there is all that logic about being in an UTF-8 locale or not which probably needs to be adjusted as well. Carlo [1] https://lore.kernel.org/git/87v92bju64.fsf@evledraar.gmail.com/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] grep/pcre2: limit the instances in which UTF mode is enabled 2021-11-18 10:04 ` [PATCH 1/2] grep/pcre2: limit the instances in which UTF mode is enabled Carlo Arenas @ 2021-11-18 19:40 ` Carlo Marcelo Arenas Belón 2021-11-18 20:53 ` Junio C Hamano 0 siblings, 1 reply; 8+ messages in thread From: Carlo Marcelo Arenas Belón @ 2021-11-18 19:40 UTC (permalink / raw) To: Hamza Mahfooz Cc: git, Junio C Hamano, René Scharfe, Andreas Schwab, Ævar Arnfjörð Bjarmason On Thu, Nov 18, 2021 at 02:04:08AM -0800, Carlo Arenas wrote: > > I haven't tested it, but I think that for this to work with the log, > we also need to make sure that all log entries that might not be UTF-8 > get first iconv() which is why probably Æevar mentioned[1] > i18n.commitEncoding in his old email. Having tested it, it seems this might work, but needs at least the following to make the test for it meaningful: diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index 1da6b07a57..e1fbdc0f80 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -47,6 +47,8 @@ test_expect_success REGEX_LOCALE 'grep string with regex, with -F' ' ' test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' ' + GIT_COMMITTER_NAME="C O Mitter" && + GIT_COMMITTER_EMAIL="committer@example.com" && git commit -m first && git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual && echo first >expected && @@ -72,10 +74,10 @@ test_expect_success GETTEXT_LOCALE,PCRE 'log --committer with an ascii pattern o EOF test_write_lines "fifth" >file5 && git add file5 && - GIT_COMMITTER_NAME="Ç O Mîtter" && + GIT_COMMITTER_NAME=$(echo "Ç O Mîtter" | iconv -t ISO-8859-1) && GIT_COMMITTER_EMAIL="committer@example.com" && git -c i18n.commitEncoding=latin1 commit -m thïrd && - git -c i18n.logOutputEncoding=latin1 log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log && + git log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log && grep Commit: log >actual.raw && test_decode_color <actual.raw >actual && test_cmp expected actual The first hunk is a nice to have and mimics what is done in the previous test where a non UTF match will match instead an equivalent ASCII text, but is not strictly needed unless the expression is also tightened. Note the second hunk is incomplete as the message is still not really encoded as latin1 and will need an equivalent processing, but left it out so it can be done together with fixing the corresponding test. I have to admit that adding to a complex condition might be obscuring some other edge case, and indeed the fact the test passed even without this fix is concerning. From my reading of what Ævar suggested originally[1], it would seem that the new is_log condition should be on an branch of its own, and more code should be needed to make sure the contents we are going to use are indeed utf8 by the time it hits pcre2_*match() before setting it. Carlo [1] https://lore.kernel.org/git/211116.86tugcp8mg.gmgdl@evledraar.gmail.com/ ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] grep/pcre2: limit the instances in which UTF mode is enabled 2021-11-18 19:40 ` Carlo Marcelo Arenas Belón @ 2021-11-18 20:53 ` Junio C Hamano 0 siblings, 0 replies; 8+ messages in thread From: Junio C Hamano @ 2021-11-18 20:53 UTC (permalink / raw) To: Carlo Marcelo Arenas Belón Cc: Hamza Mahfooz, git, René Scharfe, Andreas Schwab, Ævar Arnfjörð Bjarmason Carlo Marcelo Arenas Belón <carenas@gmail.com> writes: > I have to admit that adding to a complex condition might be obscuring > some other edge case, and indeed the fact the test passed even without > this fix is concerning. > > From my reading of what Ævar suggested originally[1], it would seem that > the new is_log condition should be on an branch of its own, and more code > should be needed to make sure the contents we are going to use are indeed > utf8 by the time it hits pcre2_*match() before setting it. At least, let's not consider pursuing this approach to pile on more iffy code on top of broken code in the release. Unless I hear otherwise, I am leaning to revert the whole three patches of the original series and call it a day for both 'maint' (for 2.34.1) and 'master'. Thanks. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-11-22 22:27 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-11-18 8:41 [PATCH 1/2] grep/pcre2: limit the instances in which UTF mode is enabled Hamza Mahfooz 2021-11-18 8:41 ` [PATCH 2/2] ci: add a job for PCRE2 Hamza Mahfooz 2021-11-18 9:53 ` [PATCH v2 " Hamza Mahfooz 2021-11-18 10:32 ` [PATCH " Ævar Arnfjörð Bjarmason 2021-11-22 22:26 ` Hamza Mahfooz 2021-11-18 10:04 ` [PATCH 1/2] grep/pcre2: limit the instances in which UTF mode is enabled Carlo Arenas 2021-11-18 19:40 ` Carlo Marcelo Arenas Belón 2021-11-18 20:53 ` Junio C Hamano
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.