git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Carlo Marcelo Arenas Belón" <carenas@gmail.com>
To: git@vger.kernel.org
Cc: avarab@gmail.com, gitster@pobox.com,
	"Carlo Marcelo Arenas Belón" <carenas@gmail.com>
Subject: [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P
Date: Tue, 17 Jan 2023 02:51:23 -0800	[thread overview]
Message-ID: <20230117105123.58328-1-carenas@gmail.com> (raw)
In-Reply-To: <20230108155217.2817-1-carenas@gmail.com>

When UTF is enabled for a PCRE match, the PCRE2_UTF flag is used
by the pcre2_compile() call, but that would only allow for the
use of Unicode character properties when caseless is required
but not to include the additional UTF characters for all other
class matches.

This would result in failed matches for expressions that rely
on those properties, for ex:

  $ git grep -P '\bÆvar'

Add a configuration that could be used to enable the PCRE2_UCP
flag to correctly match those cases, when required.

The use of this has an impact on performance that has been estimated
to be significant.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
Changes since v2:
* make setting UCP and opt-in as suggested by Ævar
* remove performance test and instead add a test

 Documentation/config/grep.txt |  6 ++++++
 grep.c                        | 11 ++++++++++-
 grep.h                        |  1 +
 t/t7810-grep.sh               | 13 +++++++++++++
 4 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index e521f20390..8848db7311 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -26,3 +26,9 @@ grep.fullName::
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
 	is executed outside of a git repository.  Defaults to false.
+
+pcre.ucp::
+	If set to true, will use all Unicode Character Properties when matching
+	`\w`, `\b`, `\d` or the POSIX classes (ex: `[:alnum:]`) and PCRE is used
+	as the underlying engine. If PCRE is not being used it is ignored.
+	Defaults to false
diff --git a/grep.c b/grep.c
index 06eed69493..ceafb8937d 100644
--- a/grep.c
+++ b/grep.c
@@ -102,6 +102,12 @@ int grep_config(const char *var, const char *value, void *cb)
 			return config_error_nonbool(var);
 		return color_parse(value, color);
 	}
+
+	if (!strcmp(var, "pcre.ucp")) {
+		opt->pcre_ucp = git_config_bool(var, value);
+		return 0;
+	}
+
 	return 0;
 }
 
@@ -292,8 +298,11 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (!opt->ignore_locale && is_utf8_locale() && !literal)
+	if (!opt->ignore_locale && is_utf8_locale() && !literal) {
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+		if (opt->pcre_ucp)
+			options |= PCRE2_UCP;
+	}
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
diff --git a/grep.h b/grep.h
index 6075f997e6..082bd3a0c7 100644
--- a/grep.h
+++ b/grep.h
@@ -171,6 +171,7 @@ struct grep_opt {
 	int file_break;
 	int heading;
 	int max_count;
+	int pcre_ucp;
 	void *priv;
 
 	void (*output)(struct grep_opt *opt, const void *data, size_t size);
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 8eded6ab27..a99a967060 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -95,6 +95,7 @@ test_expect_success setup '
 	then
 		echo "¿" >reverse-question-mark
 	fi &&
+	echo "émotion" >ucp &&
 	git add . &&
 	test_tick &&
 	git commit -m initial
@@ -1474,6 +1475,18 @@ test_expect_success PCRE 'grep -P backreferences work (the PCRE NO_AUTO_CAPTURE
 	test_cmp hello_world actual
 '
 
+test_expect_success PCRE 'grep -c pcre.ucp -P fixes \b' '
+	cat >expected <<-\EOF &&
+	ucp:émotion
+	EOF
+	cat >pattern <<-\EOF &&
+	\bémotion
+	EOF
+	LC_ALL=en_US.UTF-8 git -c pcre.ucp=true grep -P -f pattern >actual &&
+	test_cmp expected actual &&
+	LC_ALL=en_US.UTF-8 test_must_fail git grep -P -f pattern
+'
+
 test_expect_success 'grep -G invalidpattern properly dies ' '
 	test_must_fail git grep -G "a["
 '
-- 
2.37.1 (Apple Git-137.1)


  parent reply	other threads:[~2023-01-17 10:52 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-08  6:23 [PATCH] grep: correctly identify utf-8 characters with \{b,w} in -P Carlo Marcelo Arenas Belón
2023-01-08  6:39 ` Junio C Hamano
2023-01-08 15:52 ` [PATCH v2] " Carlo Marcelo Arenas Belón
2023-01-09 11:35   ` Ævar Arnfjörð Bjarmason
2023-01-09 18:40     ` bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} " Paul Eggert
2023-01-09 19:51       ` Ævar Arnfjörð Bjarmason
2023-01-09 23:12         ` Paul Eggert
2023-01-10  4:49     ` [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} " Carlo Arenas
2023-01-16 20:48     ` Junio C Hamano
2023-04-03 21:38     ` -P '\d' in GNU and git grep Paul Eggert
2023-04-04  3:30       ` bug#60690: " Jim Meyering
2023-04-04  6:46         ` Paul Eggert
2023-04-04 15:31           ` Jim Meyering
2023-04-04  6:56       ` Carlo Arenas
2023-04-04 18:25         ` bug#60690: " Paul Eggert
2023-04-04 19:31           ` Junio C Hamano
2023-04-05 18:32             ` Paul Eggert
2023-04-05 19:04               ` Paul Eggert
2023-04-05 19:37               ` Junio C Hamano
2023-04-05 19:40               ` Jim Meyering
2023-04-05 20:03                 ` Paul Eggert
2023-04-05 21:20                 ` Carlo Arenas
2023-04-06 15:45               ` demerphq
2023-04-07 16:48                 ` Paul Eggert
2023-04-06 13:39             ` demerphq
2023-04-07 19:00               ` Paul Eggert
2023-04-08  5:01                 ` Carlo Arenas
2023-04-08 22:45                   ` Paul Eggert
2023-01-17 10:51   ` Carlo Marcelo Arenas Belón [this message]
2023-01-17 12:38     ` [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P Ævar Arnfjörð Bjarmason
2023-01-17 15:19       ` Junio C Hamano
2023-01-18  7:35         ` Carlo Arenas
2023-01-18 11:49           ` Ævar Arnfjörð Bjarmason
2023-01-18 16:20             ` Junio C Hamano
2023-01-18 23:06               ` Ævar Arnfjörð Bjarmason
2023-01-18 23:24                 ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230117105123.58328-1-carenas@gmail.com \
    --to=carenas@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).