All of lore.kernel.org
 help / color / mirror / Atom feed
* regex compilation error with --color-words
@ 2023-03-29 22:55 Eric Sunshine
  2023-03-30  7:55 ` Diomidis Spinellis
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Sunshine @ 2023-03-29 22:55 UTC (permalink / raw)
  To: Git List, Diomidis Spinellis
  Cc: René Scharfe, Junio C Hamano, demerphq, Mario Grgic

I'm encountering a failure on macOS High Sierra 10.13.6 when using
--color-words:

    % git show --color-words HEAD
    fatal: invalid regular expression:
[[:alpha:]_'][[:alnum:]_']*|0[xb]?[0-9a-fA-F_]*|[0-9a-fA-F_]+(\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::|&&=|\|\|=|//=|\*\*=|&&|\|\||//|\+\+|--|\*\*|\.\.\.?|[-+*/%.^&<>=!|]=|=~|!~|<<|<>|<=>|>>|[^[:space:]]|[<C0>-<FF>][<80>-<BF>]+

This crash happens when viewing the commit I sent to Peff today[1],
though it doesn't happen with all commits. The problem bisects to:

    Author: Diomidis Spinellis <dds@aueb.gr>
    Date:   Fri Aug 26 11:58:15 2022 +0300

    grep: fix multibyte regex handling under macOS

    The commit 29de20504e (Makefile: fix default regex settings on
    Darwin, 2013-05-11) fixed t0070-fundamental.sh under Darwin (macOS) by
    adopting Git's regex library.  However, this library is compiled with
    NO_MBSUPPORT, which causes git-grep to work incorrectly on multibyte
    (e.g. UTF-8) files.  Current macOS versions pass t0070-fundamental.sh
    with the native macOS regex library, which also supports multibyte
    characters.

    Adjust the Makefile to use the native regex library, and call
    setlocale(3) to set CTYPE according to the user's preference.
    The setlocale call is required on all platforms, but in platforms
    supporting gettext(3), setlocale was called as a side-effect of
    initializing gettext.  Therefore, move the CTYPE setlocale call from
    gettext.c to common-main.c and the corresponding locale.h include
    into git-compat-util.h.

    Thanks to the global initialization of CTYPE setlocale, the test-tool
    regex command now works correctly with supported multibyte regexes, and
    is used to set the MB_REGEX test prerequisite by assessing a platform's
    support for them.

    Signed-off-by: Diomidis Spinellis <dds@aueb.gr>
    Signed-off-by: Junio C Hamano <gitster@pobox.com>

I see that this same commit is also the subject of another bug report
currently being discussed[2], so I've Cc:'d the participants of that
thread, as well.

Any pointers aimed at getting this resolved would be appreciated.

[1]: https://lore.kernel.org/git/CAPig+cQiOGrDSUc34jHEBp87Rx-dnXNcPcF76bu0SJoOzD+1hw@mail.gmail.com/
[2]: https://lore.kernel.org/git/MW4PR20MB5517583CBEEF34B1E87CCF1290859@MW4PR20MB5517.namprd20.prod.outlook.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regex compilation error with --color-words
  2023-03-29 22:55 regex compilation error with --color-words Eric Sunshine
@ 2023-03-30  7:55 ` Diomidis Spinellis
  2023-03-31 20:44   ` René Scharfe
  0 siblings, 1 reply; 15+ messages in thread
From: Diomidis Spinellis @ 2023-03-30  7:55 UTC (permalink / raw)
  To: Eric Sunshine, Git List
  Cc: René Scharfe, Junio C Hamano, demerphq, Mario Grgic

On 30-Mar-23 1:55, Eric Sunshine wrote:
> I'm encountering a failure on macOS High Sierra 10.13.6 when using
> --color-words:

The built-in word separation regular expression pattern for the Perl 
language fails to work with the macOS regex engine.  The same also 
happens with the FreeBSD one (tested on 14.0).

The issue can be replicated through the following sequence of commands.

git init color-words
cd color-words
echo '*.pl   diff=perl' >.gitattributes
echo 'print 42;' >t.pl
git add t.pl
git commit -am Add
git show --color-words

Strangely, I haven't been able to reproduce the failure with egrep on 
any of the two platforms.

egrep 
'[[:alpha:]_'\''][[:alnum:]_'\'']*|0[xb]?[0-9a-fA-F_]*|[0-9a-fA-F_]+(\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::|&&=|\|\|=|//=|\*\*=|&&|\|\||//|\+\+|--|\*\*|\.\.\.?|[-+*/%.^&<>=!|]=|=~|!~|<<|<>|<=>|>>|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+' 
/dev/null

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regex compilation error with --color-words
  2023-03-30  7:55 ` Diomidis Spinellis
@ 2023-03-31 20:44   ` René Scharfe
  2023-04-02  9:44     ` René Scharfe
  0 siblings, 1 reply; 15+ messages in thread
From: René Scharfe @ 2023-03-31 20:44 UTC (permalink / raw)
  To: Diomidis Spinellis, Eric Sunshine, Git List
  Cc: Junio C Hamano, demerphq, Mario Grgic

Am 30.03.23 um 09:55 schrieb Diomidis Spinellis:
> On 30-Mar-23 1:55, Eric Sunshine wrote:
>> I'm encountering a failure on macOS High Sierra 10.13.6 when using
>> --color-words:
>
> The built-in word separation regular expression pattern for the Perl language fails to work with the macOS regex engine.  The same also happens with the FreeBSD one (tested on 14.0).
>
> The issue can be replicated through the following sequence of commands.
>
> git init color-words
> cd color-words
> echo '*.pl   diff=perl' >.gitattributes
> echo 'print 42;' >t.pl
> git add t.pl
> git commit -am Add
> git show --color-words

Or in Git's own repo:

   $ git log -p --color-words --no-merges '*.c'
   Schwerwiegend: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<C0>-<FF>][<80>-<BF>]+
   commit 14b9a044798ebb3858a1f1a1377309a3d6054ac8
   [...]

The error disappears when localization is turned off:

   $ LANG=C git log -p --color-words --no-merges '*.c' >/dev/null
   # just finishes without an error

The issue also vanishes when the "|[\xc0-\xff][\x80-\xbf]+" part is
removed that the macros PATTERNS and IPATTERN in userdiff.c append.

So it seems regcomp(1) on macOS doesn't like invalid Unicode characters
unless it's in ASCII mode (LANG=C).  664d44ee7f (userdiff: simplify
word-diff safeguard, 2011-01-11) explains that this part exists to match
a multi-byte UTF-8 character.  With a regcomp(1) that supports
multi-byte characters natively they need to be specified differently, I
guess, perhaps like this "[^\x00-\x7f]"?

> Strangely, I haven't been able to reproduce the failure with egrep on any of the two platforms.
>
> egrep '[[:alpha:]_'\''][[:alnum:]_'\'']*|0[xb]?[0-9a-fA-F_]*|[0-9a-fA-F_]+(\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::|&&=|\|\|=|//=|\*\*=|&&|\|\||//|\+\+|--|\*\*|\.\.\.?|[-+*/%.^&<>=!|]=|=~|!~|<<|<>|<=>|>>|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+' /dev/null

No idea how to specify non-ASCII bytes in shell or regex.  '\xNN' does
not seem to do the trick.  printf(1) interpretes octal numbers, though:

   $ echo ö | egrep $(printf "[\200-\377]")
   egrep: illegal byte sequence

(The regex contains "illegal bytes" -- UTF-8 multi-byte sequences cut
short; the "ö" is OK.)

René

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regex compilation error with --color-words
  2023-03-31 20:44   ` René Scharfe
@ 2023-04-02  9:44     ` René Scharfe
  2023-04-03 16:29       ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: René Scharfe @ 2023-04-02  9:44 UTC (permalink / raw)
  To: Diomidis Spinellis, Eric Sunshine, Git List
  Cc: Junio C Hamano, demerphq, Mario Grgic

Am 31.03.23 um 22:44 schrieb René Scharfe:
> Am 30.03.23 um 09:55 schrieb Diomidis Spinellis:
>> On 30-Mar-23 1:55, Eric Sunshine wrote:
>>> I'm encountering a failure on macOS High Sierra 10.13.6 when using
>>> --color-words:
>>
>> The built-in word separation regular expression pattern for the Perl language fails to work with the macOS regex engine.  The same also happens with the FreeBSD one (tested on 14.0).
>>
>> The issue can be replicated through the following sequence of commands.
>>
>> git init color-words
>> cd color-words
>> echo '*.pl   diff=perl' >.gitattributes
>> echo 'print 42;' >t.pl
>> git add t.pl
>> git commit -am Add
>> git show --color-words
>
> Or in Git's own repo:
>
>    $ git log -p --color-words --no-merges '*.c'
>    Schwerwiegend: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<C0>-<FF>][<80>-<BF>]+
>    commit 14b9a044798ebb3858a1f1a1377309a3d6054ac8
>    [...]
>
> The error disappears when localization is turned off:
>
>    $ LANG=C git log -p --color-words --no-merges '*.c' >/dev/null
>    # just finishes without an error
>
> The issue also vanishes when the "|[\xc0-\xff][\x80-\xbf]+" part is
> removed that the macros PATTERNS and IPATTERN in userdiff.c append.
>
> So it seems regcomp(1) on macOS doesn't like invalid Unicode characters
> unless it's in ASCII mode (LANG=C).  664d44ee7f (userdiff: simplify
> word-diff safeguard, 2011-01-11) explains that this part exists to match
> a multi-byte UTF-8 character.  With a regcomp(1) that supports
> multi-byte characters natively they need to be specified differently, I
> guess, perhaps like this "[^\x00-\x7f]"?

Actually we can drop the "|[\xc0-\xff][\x80-\xbf]+" part in that case
because the "[^[:space:]]" suffices.  And we probably need to do that at
runtime because it depends on the locale.  The rather elaborate patch
below does that.  It leaks the truncated word_regex, which isn't that
bad because it's done only once per run, but certainly untidy.

I suspect/hope this can be done simpler and cleaner after refactoring
the userdiff code to allow for runtime assembly of regular expressions.

And it's regcomp(3), or rather regexec(3), not regcomp(1).

---
 userdiff.c | 38 ++++++++++++++++++++++++++++++++++++--
 userdiff.h |  2 ++
 2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/userdiff.c b/userdiff.c
index 09203fbc35..aa2cd150ba 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -9,6 +9,8 @@ static struct userdiff_driver *drivers;
 static int ndrivers;
 static int drivers_alloc;

+#define OR_MULTI_BYTE_CHAR "|[\xc0-\xff][\x80-\xbf]+"
+
 #define PATTERNS(lang, rx, wrx) { \
 	.name = lang, \
 	.binary = -1, \
@@ -16,7 +18,9 @@ static int drivers_alloc;
 		.pattern = rx, \
 		.cflags = REG_EXTENDED, \
 	}, \
-	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
+	.word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
+	.is_builtin = 1, \
+	.has_multi_byte_char_fallback = 1, \
 }
 #define IPATTERN(lang, rx, wrx) { \
 	.name = lang, \
@@ -25,7 +29,9 @@ static int drivers_alloc;
 		.pattern = rx, \
 		.cflags = REG_EXTENDED | REG_ICASE, \
 	}, \
-	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
+	.word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
+	.is_builtin = 1, \
+	.has_multi_byte_char_fallback = 1, \
 }

 /*
@@ -330,6 +336,25 @@ static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
 	return 0;
 }

+static int regexec_support_multi_byte_chars(void)
+{
+	static const char not_space[] = "[^[:space:]]";
+	static const char utf8_multi_byte_char[] = "\xc2\xa3";
+	regex_t re;
+	regmatch_t match;
+	static int result = -1;
+
+	if (result != -1)
+		return result;
+	if (regcomp(&re, not_space, REG_EXTENDED))
+		BUG("invalid regular expression: %s", not_space);
+	result = !regexec(&re, utf8_multi_byte_char, 1, &match, 0) &&
+		match.rm_so == 0 &&
+		match.rm_eo == strlen(utf8_multi_byte_char);
+	regfree(&re);
+	return result;
+}
+
 static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t len)
 {
 	struct find_by_namelen_data udcbdata = {
@@ -337,6 +362,15 @@ static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t
 		.len = len,
 	};
 	for_each_userdiff_driver(userdiff_find_by_namelen_cb, &udcbdata);
+	if (udcbdata.driver &&
+	    udcbdata.driver->is_builtin &&
+	    udcbdata.driver->has_multi_byte_char_fallback &&
+	    regexec_support_multi_byte_chars()) {
+		const char *word_regex = udcbdata.driver->word_regex;
+		udcbdata.driver->word_regex = xmemdupz(word_regex,
+			strlen(word_regex) - strlen(OR_MULTI_BYTE_CHAR));
+		udcbdata.driver->has_multi_byte_char_fallback = 0;
+	}
 	return udcbdata.driver;
 }

diff --git a/userdiff.h b/userdiff.h
index 24419db697..83f5863d58 100644
--- a/userdiff.h
+++ b/userdiff.h
@@ -21,6 +21,8 @@ struct userdiff_driver {
 	const char *textconv;
 	struct notes_cache *textconv_cache;
 	int textconv_want_cache;
+	int is_builtin;
+	int has_multi_byte_char_fallback;
 };
 enum userdiff_driver_type {
 	USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
--
2.40.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: regex compilation error with --color-words
  2023-04-02  9:44     ` René Scharfe
@ 2023-04-03 16:29       ` Junio C Hamano
  2023-04-03 19:32         ` René Scharfe
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2023-04-03 16:29 UTC (permalink / raw)
  To: René Scharfe
  Cc: Diomidis Spinellis, Eric Sunshine, Git List, demerphq, Mario Grgic

René Scharfe <l.s.r@web.de> writes:

> Actually we can drop the "|[\xc0-\xff][\x80-\xbf]+" part in that case
> because the "[^[:space:]]" suffices.  And we probably need to do that at
> runtime because it depends on the locale.  The rather elaborate patch
> below does that.  It leaks the truncated word_regex, which isn't that
> bad because it's done only once per run, but certainly untidy.

Small ugliness like what we see below is fine in a technology
demonostration.

> I suspect/hope this can be done simpler and cleaner after refactoring
> the userdiff code to allow for runtime assembly of regular expressions.

Do we expect "does the regcomp(3) and regexec(3) correctly match a
non-space multi-byte UTF-8 sequence as expected?" to be the only
choices, do we expect we will choose from only two, and do we expect
that the differences between the MB version and fallback version to
be the same "OR_MULTI_BYTE_CHAR may be omitted"?  For now I think
it would be reasonable to answer yes to all three.

How are .is_builtin and .has_multi_byte_char_fallback bits expected
to be used?  For what kind of files do we expect them to be set
differently?

In the simplest case, I would imagine that we could do this

 	...
 	const char *word_regex;
+	const char *word_regex_wo_mb;
 	const char *textconv;
 	...

in the definition of "struct userdifif_driver", use

 #define PATTERNS(lang, rx, wrx) { \
 	...
 	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
+	.word_regex_wo_mb = wrx "|[^[:space:]]", \
 }

and similar for IPATTERN, and make a non-NULL .word.regex_wo_mb
serve as the .has_multi_byte_char_fallback bit to trigger "does our
regex engine do a good job for multi-byte?"

Thanks.

> diff --git a/userdiff.c b/userdiff.c
> index 09203fbc35..aa2cd150ba 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -9,6 +9,8 @@ static struct userdiff_driver *drivers;
>  static int ndrivers;
>  static int drivers_alloc;
>
> +#define OR_MULTI_BYTE_CHAR "|[\xc0-\xff][\x80-\xbf]+"
> +
>  #define PATTERNS(lang, rx, wrx) { \
>  	.name = lang, \
>  	.binary = -1, \
> @@ -16,7 +18,9 @@ static int drivers_alloc;
>  		.pattern = rx, \
>  		.cflags = REG_EXTENDED, \
>  	}, \
> -	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
> +	.word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
> +	.is_builtin = 1, \
> +	.has_multi_byte_char_fallback = 1, \
>  }
>  #define IPATTERN(lang, rx, wrx) { \
>  	.name = lang, \
> @@ -25,7 +29,9 @@ static int drivers_alloc;
>  		.pattern = rx, \
>  		.cflags = REG_EXTENDED | REG_ICASE, \
>  	}, \
> -	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
> +	.word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
> +	.is_builtin = 1, \
> +	.has_multi_byte_char_fallback = 1, \
>  }
>
>  /*
> @@ -330,6 +336,25 @@ static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
>  	return 0;
>  }
>
> +static int regexec_support_multi_byte_chars(void)
> +{
> +	static const char not_space[] = "[^[:space:]]";
> +	static const char utf8_multi_byte_char[] = "\xc2\xa3";
> +	regex_t re;
> +	regmatch_t match;
> +	static int result = -1;
> +
> +	if (result != -1)
> +		return result;
> +	if (regcomp(&re, not_space, REG_EXTENDED))
> +		BUG("invalid regular expression: %s", not_space);
> +	result = !regexec(&re, utf8_multi_byte_char, 1, &match, 0) &&
> +		match.rm_so == 0 &&
> +		match.rm_eo == strlen(utf8_multi_byte_char);
> +	regfree(&re);
> +	return result;
> +}
> +
>  static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t len)
>  {
>  	struct find_by_namelen_data udcbdata = {
> @@ -337,6 +362,15 @@ static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t
>  		.len = len,
>  	};
>  	for_each_userdiff_driver(userdiff_find_by_namelen_cb, &udcbdata);
> +	if (udcbdata.driver &&
> +	    udcbdata.driver->is_builtin &&
> +	    udcbdata.driver->has_multi_byte_char_fallback &&
> +	    regexec_support_multi_byte_chars()) {
> +		const char *word_regex = udcbdata.driver->word_regex;
> +		udcbdata.driver->word_regex = xmemdupz(word_regex,
> +			strlen(word_regex) - strlen(OR_MULTI_BYTE_CHAR));
> +		udcbdata.driver->has_multi_byte_char_fallback = 0;
> +	}
>  	return udcbdata.driver;
>  }
>
> diff --git a/userdiff.h b/userdiff.h
> index 24419db697..83f5863d58 100644
> --- a/userdiff.h
> +++ b/userdiff.h
> @@ -21,6 +21,8 @@ struct userdiff_driver {
>  	const char *textconv;
>  	struct notes_cache *textconv_cache;
>  	int textconv_want_cache;
> +	int is_builtin;
> +	int has_multi_byte_char_fallback;
>  };
>  enum userdiff_driver_type {
>  	USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
> --
> 2.40.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regex compilation error with --color-words
  2023-04-03 16:29       ` Junio C Hamano
@ 2023-04-03 19:32         ` René Scharfe
  2023-04-06 20:19           ` [PATCH] userdiff: support regexec(3) with multi-byte support René Scharfe
  0 siblings, 1 reply; 15+ messages in thread
From: René Scharfe @ 2023-04-03 19:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Diomidis Spinellis, Eric Sunshine, Git List, demerphq, Mario Grgic

Am 03.04.23 um 18:29 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>
>> Actually we can drop the "|[\xc0-\xff][\x80-\xbf]+" part in that case
>> because the "[^[:space:]]" suffices.  And we probably need to do that at
>> runtime because it depends on the locale.  The rather elaborate patch
>> below does that.  It leaks the truncated word_regex, which isn't that
>> bad because it's done only once per run, but certainly untidy.
>
> Small ugliness like what we see below is fine in a technology
> demonostration.
>
>> I suspect/hope this can be done simpler and cleaner after refactoring
>> the userdiff code to allow for runtime assembly of regular expressions.
>
> Do we expect "does the regcomp(3) and regexec(3) correctly match a
> non-space multi-byte UTF-8 sequence as expected?" to be the only
> choices, do we expect we will choose from only two, and do we expect
> that the differences between the MB version and fallback version to
> be the same "OR_MULTI_BYTE_CHAR may be omitted"?  For now I think
> it would be reasonable to answer yes to all three.
>
> How are .is_builtin and .has_multi_byte_char_fallback bits expected
> to be used?  For what kind of files do we expect them to be set
> differently?

.is_builtin is unnecessary here.  It is a remnant of me noticing that we
don't add "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" to user-defined
patterns, but .has_multi_byte_char_fallback alone suffices.

>
> In the simplest case, I would imagine that we could do this
>
>  	...
>  	const char *word_regex;
> +	const char *word_regex_wo_mb;
>  	const char *textconv;
>  	...
>
> in the definition of "struct userdifif_driver", use
>
>  #define PATTERNS(lang, rx, wrx) { \
>  	...
>  	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
> +	.word_regex_wo_mb = wrx "|[^[:space:]]", \

Ah, nice, no allocation or string manipulation at runtime at all, at
the small cost of having near-duplicate static strings.

>  }
>
> and similar for IPATTERN, and make a non-NULL .word.regex_wo_mb
> serve as the .has_multi_byte_char_fallback bit to trigger "does our
> regex engine do a good job for multi-byte?"
>
> Thanks.
>
>> diff --git a/userdiff.c b/userdiff.c
>> index 09203fbc35..aa2cd150ba 100644
>> --- a/userdiff.c
>> +++ b/userdiff.c
>> @@ -9,6 +9,8 @@ static struct userdiff_driver *drivers;
>>  static int ndrivers;
>>  static int drivers_alloc;
>>
>> +#define OR_MULTI_BYTE_CHAR "|[\xc0-\xff][\x80-\xbf]+"
>> +
>>  #define PATTERNS(lang, rx, wrx) { \
>>  	.name = lang, \
>>  	.binary = -1, \
>> @@ -16,7 +18,9 @@ static int drivers_alloc;
>>  		.pattern = rx, \
>>  		.cflags = REG_EXTENDED, \
>>  	}, \
>> -	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
>> +	.word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
>> +	.is_builtin = 1, \
>> +	.has_multi_byte_char_fallback = 1, \
>>  }
>>  #define IPATTERN(lang, rx, wrx) { \
>>  	.name = lang, \
>> @@ -25,7 +29,9 @@ static int drivers_alloc;
>>  		.pattern = rx, \
>>  		.cflags = REG_EXTENDED | REG_ICASE, \
>>  	}, \
>> -	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
>> +	.word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
>> +	.is_builtin = 1, \
>> +	.has_multi_byte_char_fallback = 1, \
>>  }
>>
>>  /*
>> @@ -330,6 +336,25 @@ static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
>>  	return 0;
>>  }
>>
>> +static int regexec_support_multi_byte_chars(void)
>> +{
>> +	static const char not_space[] = "[^[:space:]]";
>> +	static const char utf8_multi_byte_char[] = "\xc2\xa3";
>> +	regex_t re;
>> +	regmatch_t match;
>> +	static int result = -1;
>> +
>> +	if (result != -1)
>> +		return result;
>> +	if (regcomp(&re, not_space, REG_EXTENDED))
>> +		BUG("invalid regular expression: %s", not_space);
>> +	result = !regexec(&re, utf8_multi_byte_char, 1, &match, 0) &&
>> +		match.rm_so == 0 &&
>> +		match.rm_eo == strlen(utf8_multi_byte_char);
>> +	regfree(&re);
>> +	return result;
>> +}
>> +
>>  static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t len)
>>  {
>>  	struct find_by_namelen_data udcbdata = {
>> @@ -337,6 +362,15 @@ static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t
>>  		.len = len,
>>  	};
>>  	for_each_userdiff_driver(userdiff_find_by_namelen_cb, &udcbdata);
>> +	if (udcbdata.driver &&
>> +	    udcbdata.driver->is_builtin &&
>> +	    udcbdata.driver->has_multi_byte_char_fallback &&
>> +	    regexec_support_multi_byte_chars()) {
>> +		const char *word_regex = udcbdata.driver->word_regex;
>> +		udcbdata.driver->word_regex = xmemdupz(word_regex,
>> +			strlen(word_regex) - strlen(OR_MULTI_BYTE_CHAR));
>> +		udcbdata.driver->has_multi_byte_char_fallback = 0;
>> +	}
>>  	return udcbdata.driver;
>>  }
>>
>> diff --git a/userdiff.h b/userdiff.h
>> index 24419db697..83f5863d58 100644
>> --- a/userdiff.h
>> +++ b/userdiff.h
>> @@ -21,6 +21,8 @@ struct userdiff_driver {
>>  	const char *textconv;
>>  	struct notes_cache *textconv_cache;
>>  	int textconv_want_cache;
>> +	int is_builtin;
>> +	int has_multi_byte_char_fallback;
>>  };
>>  enum userdiff_driver_type {
>>  	USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
>> --
>> 2.40.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] userdiff: support regexec(3) with multi-byte support
  2023-04-03 19:32         ` René Scharfe
@ 2023-04-06 20:19           ` René Scharfe
  2023-04-06 22:35             ` Johannes Sixt
                               ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: René Scharfe @ 2023-04-06 20:19 UTC (permalink / raw)
  To: Git List
  Cc: Diomidis Spinellis, Eric Sunshine, demerphq, Mario Grgic,
	D. Ben Knoble, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Jeff King

Since 1819ad327b (grep: fix multibyte regex handling under macOS,
2022-08-26) we use the system library for all regular expression
matching on macOS, not just for git grep.  It supports multi-byte
strings and rejects invalid multi-byte characters.

This broke all built-in userdiff word regexes in UTF-8 locales because
they all include such invalid bytes in expressions that are intended to
match multi-byte characters without explicit support for that from the
regex engine.

"|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word
regexes to match a single non-space or multi-byte character.  The \xNN
characters are invalid if interpreted as UTF-8 because they have their
high bit set, which indicates they are part of a multi-byte character,
but they are surrounded by single-byte characters.

Replace that expression with "|[^[:space:]]" if the regex engine
supports multi-byte matching, as there is no need to have an explicit
range for multi-byte characters then.  Check for that capability at
runtime, because it depends on the locale and thus on environment
variables.  Construct the full replacement expression at build time
and just switch it in if necessary to avoid string manipulation and
allocations at runtime.

Additionally the word regex for tex contains the expression
"[a-zA-Z0-9\x80-\xff]+" with a similarly invalid range.  The best
replacement with only valid characters that I can come up with is
"([a-zA-Z0-9]|[^\x01-\x7f])+".  Unlike the original it matches NUL
characters, though.  Assuming that tex files usually don't contain NUL
this should be acceptable.

Reported-by: D. Ben Knoble <ben.knoble@gmail.com>
Reported-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 t/t4034-diff-words.sh |  4 ++++
 userdiff.c            | 31 +++++++++++++++++++++++++++++--
 userdiff.h            |  1 +
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh
index 15764ee9ac..74586f3813 100755
--- a/t/t4034-diff-words.sh
+++ b/t/t4034-diff-words.sh
@@ -69,6 +69,10 @@ test_language_driver () {
 		echo "* diff='"$lang"'" >.gitattributes &&
 		word_diff --color-words
 	'
+	test_expect_success "diff driver '$lang' in Islandic" '
+		LANG=is_IS.UTF-8 LANGUAGE=is LC_ALL="$is_IS_locale" \
+		word_diff --color-words
+	'
 }

 test_expect_success setup '
diff --git a/userdiff.c b/userdiff.c
index 09203fbc35..eaec6ebb5e 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -17,6 +17,7 @@ static int drivers_alloc;
 		.cflags = REG_EXTENDED, \
 	}, \
 	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
+	.word_regex_multi_byte = wrx "|[^[:space:]]", \
 }
 #define IPATTERN(lang, rx, wrx) { \
 	.name = lang, \
@@ -26,6 +27,7 @@ static int drivers_alloc;
 		.cflags = REG_EXTENDED | REG_ICASE, \
 	}, \
 	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
+	.word_regex_multi_byte = wrx "|[^[:space:]]", \
 }

 /*
@@ -294,7 +296,7 @@ PATTERNS("scheme",
 	 /* All other words should be delimited by spaces or parentheses */
 	 "|([^][)(}{[ \t])+"),
 PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
-	 "\\\\[a-zA-Z@]+|\\\\.|[a-zA-Z0-9\x80-\xff]+"),
+	 "\\\\[a-zA-Z@]+|\\\\.|([a-zA-Z0-9]|[^\x01-\x7f])+"),
 { "default", NULL, NULL, -1, { NULL, 0 } },
 };
 #undef PATTERNS
@@ -330,6 +332,25 @@ static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
 	return 0;
 }

+static int regexec_supports_multi_byte_chars(void)
+{
+	static const char not_space[] = "[^[:space:]]";
+	static const char utf8_multi_byte_char[] = "\xc2\xa3";
+	regex_t re;
+	regmatch_t match;
+	static int result = -1;
+
+	if (result != -1)
+		return result;
+	if (regcomp(&re, not_space, REG_EXTENDED))
+		BUG("invalid regular expression: %s", not_space);
+	result = !regexec(&re, utf8_multi_byte_char, 1, &match, 0) &&
+		match.rm_so == 0 &&
+		match.rm_eo == strlen(utf8_multi_byte_char);
+	regfree(&re);
+	return result;
+}
+
 static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t len)
 {
 	struct find_by_namelen_data udcbdata = {
@@ -405,7 +426,13 @@ int userdiff_config(const char *k, const char *v)
 struct userdiff_driver *userdiff_find_by_name(const char *name)
 {
 	int len = strlen(name);
-	return userdiff_find_by_namelen(name, len);
+	struct userdiff_driver *driver = userdiff_find_by_namelen(name, len);
+	if (driver && driver->word_regex_multi_byte) {
+		if (regexec_supports_multi_byte_chars())
+			driver->word_regex = driver->word_regex_multi_byte;
+		driver->word_regex_multi_byte = NULL;
+	}
+	return driver;
 }

 struct userdiff_driver *userdiff_find_by_path(struct index_state *istate,
diff --git a/userdiff.h b/userdiff.h
index 24419db697..d726804c3e 100644
--- a/userdiff.h
+++ b/userdiff.h
@@ -18,6 +18,7 @@ struct userdiff_driver {
 	int binary;
 	struct userdiff_funcname funcname;
 	const char *word_regex;
+	const char *word_regex_multi_byte;
 	const char *textconv;
 	struct notes_cache *textconv_cache;
 	int textconv_want_cache;
--
2.40.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] userdiff: support regexec(3) with multi-byte support
  2023-04-06 20:19           ` [PATCH] userdiff: support regexec(3) with multi-byte support René Scharfe
@ 2023-04-06 22:35             ` Johannes Sixt
  2023-04-07  7:49               ` René Scharfe
  2023-04-07 14:41             ` D. Ben Knoble
  2023-04-07 17:23             ` Eric Sunshine
  2 siblings, 1 reply; 15+ messages in thread
From: Johannes Sixt @ 2023-04-06 22:35 UTC (permalink / raw)
  To: René Scharfe
  Cc: Git List, Diomidis Spinellis, Eric Sunshine, demerphq,
	Mario Grgic, D. Ben Knoble,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Jeff King

Am 06.04.23 um 22:19 schrieb René Scharfe:
> Since 1819ad327b (grep: fix multibyte regex handling under macOS,
> 2022-08-26) we use the system library for all regular expression
> matching on macOS, not just for git grep.  It supports multi-byte
> strings and rejects invalid multi-byte characters.
> 
> This broke all built-in userdiff word regexes in UTF-8 locales because
> they all include such invalid bytes in expressions that are intended to
> match multi-byte characters without explicit support for that from the
> regex engine.
> 
> "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word
> regexes to match a single non-space or multi-byte character.  The \xNN
> characters are invalid if interpreted as UTF-8 because they have their
> high bit set, which indicates they are part of a multi-byte character,
> but they are surrounded by single-byte characters.

Perhpas the expression should be "[\xc4\x80-\xf7\xbf\xbf\xbf]+", i.e.,
sequences of code points U+0080 to U+10FFFF?

> 
> Replace that expression with "|[^[:space:]]" if the regex engine
> supports multi-byte matching, as there is no need to have an explicit
> range for multi-byte characters then.

This is not equivalent. The original treated a sequence of non-ASCII
characters as a word. The new version treats each individual non-space
character (both ASCII and non-ASCII) as a word.

> Additionally the word regex for tex contains the expression
> "[a-zA-Z0-9\x80-\xff]+" with a similarly invalid range.  The best
> replacement with only valid characters that I can come up with is
> "([a-zA-Z0-9]|[^\x01-\x7f])+".  Unlike the original it matches NUL
> characters, though.  Assuming that tex files usually don't contain NUL
> this should be acceptable.

This is acceptable, of course. The replacement range looks sensible.

-- Hannes


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] userdiff: support regexec(3) with multi-byte support
  2023-04-06 22:35             ` Johannes Sixt
@ 2023-04-07  7:49               ` René Scharfe
  2023-04-07 10:56                 ` Johannes Sixt
  0 siblings, 1 reply; 15+ messages in thread
From: René Scharfe @ 2023-04-07  7:49 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Git List, Diomidis Spinellis, Eric Sunshine, demerphq,
	Mario Grgic, D. Ben Knoble,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Jeff King

Am 07.04.23 um 00:35 schrieb Johannes Sixt:
> Am 06.04.23 um 22:19 schrieb René Scharfe:
>> Since 1819ad327b (grep: fix multibyte regex handling under macOS,
>> 2022-08-26) we use the system library for all regular expression
>> matching on macOS, not just for git grep.  It supports multi-byte
>> strings and rejects invalid multi-byte characters.
>>
>> This broke all built-in userdiff word regexes in UTF-8 locales because
>> they all include such invalid bytes in expressions that are intended to
>> match multi-byte characters without explicit support for that from the
>> regex engine.
>>
>> "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word
>> regexes to match a single non-space or multi-byte character.  The \xNN
>> characters are invalid if interpreted as UTF-8 because they have their
>> high bit set, which indicates they are part of a multi-byte character,
>> but they are surrounded by single-byte characters.
>
> Perhpas the expression should be "[\xc4\x80-\xf7\xbf\xbf\xbf]+", i.e.,
> sequences of code points U+0080 to U+10FFFF?

regcomp(3) on macOS doesn't like it:

fatal: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[Ā-????]

Looks like it objects to U+10FFFF here; "[\xc4\x80-\xf3\xa0\x80\x80]" is
accepted for example.

\xc4\x80 is U+0100, by the way; U+0080 would be \xc2\x80.  And
regcomp(3) doesn't like that either ("[\xc2\x80-\xf3\xa0\x80\x80]"):

fatal: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<U+0080>-󠀀]

>> Replace that expression with "|[^[:space:]]" if the regex engine
>> supports multi-byte matching, as there is no need to have an explicit
>> range for multi-byte characters then.
>
> This is not equivalent. The original treated a sequence of non-ASCII
> characters as a word. The new version treats each individual non-space
> character (both ASCII and non-ASCII) as a word.

I assume you mean "The original treated [a single non-space as well as]
a sequence of non-ASCII characters [making up a single multi-byte
character] as a word.".  That works as intended by 664d44ee7f (userdiff:
simplify word-diff safeguard, 2011-01-11).

The new one doesn't match multi-byte whitespace anymore.  What other
differences do they have?

René

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] userdiff: support regexec(3) with multi-byte support
  2023-04-07  7:49               ` René Scharfe
@ 2023-04-07 10:56                 ` Johannes Sixt
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Sixt @ 2023-04-07 10:56 UTC (permalink / raw)
  To: René Scharfe
  Cc: Git List, Diomidis Spinellis, Eric Sunshine, demerphq,
	Mario Grgic, D. Ben Knoble,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Jeff King

Am 07.04.23 um 09:49 schrieb René Scharfe:
> Am 07.04.23 um 00:35 schrieb Johannes Sixt:
>> This is not equivalent. The original treated a sequence of non-ASCII
>> characters as a word. The new version treats each individual non-space
>> character (both ASCII and non-ASCII) as a word.
> 
> I assume you mean "The original treated [a single non-space as well as]
> a sequence of non-ASCII characters [making up a single multi-byte
> character] as a word.".  That works as intended by 664d44ee7f (userdiff:
> simplify word-diff safeguard, 2011-01-11).

I misread the original RE. I thought it would lump multiple multi-byte
characters together into one word, but it does not; sorry for that. It
looks like your suggested replacement is behaviorally identical to the
original after all, except perhaps for this one:

> The new one doesn't match multi-byte whitespace anymore.

but I did not find a reference that confirms it. I don't think we need
to bend over backwards to keep this compatibility, though.

-- Hannes


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] userdiff: support regexec(3) with multi-byte support
  2023-04-06 20:19           ` [PATCH] userdiff: support regexec(3) with multi-byte support René Scharfe
  2023-04-06 22:35             ` Johannes Sixt
@ 2023-04-07 14:41             ` D. Ben Knoble
  2023-04-07 16:02               ` Junio C Hamano
  2023-04-07 17:23             ` Eric Sunshine
  2 siblings, 1 reply; 15+ messages in thread
From: D. Ben Knoble @ 2023-04-07 14:41 UTC (permalink / raw)
  To: René Scharfe
  Cc: Git List, Diomidis Spinellis, Eric Sunshine, demerphq,
	Mario Grgic, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Jeff King

On Thu, Apr 6, 2023 at 4:19 PM René Scharfe <l.s.r@web.de> wrote:
>
> Since 1819ad327b (grep: fix multibyte regex handling under macOS,
> 2022-08-26) we use the system library for all regular expression
> matching on macOS, not just for git grep.  It supports multi-byte
> strings and rejects invalid multi-byte characters.
>
> This broke all built-in userdiff word regexes in UTF-8 locales because
> they all include such invalid bytes in expressions that are intended to
> match multi-byte characters without explicit support for that from the
> regex engine.
>
> "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word
> regexes to match a single non-space or multi-byte character.  The \xNN
> characters are invalid if interpreted as UTF-8 because they have their
> high bit set, which indicates they are part of a multi-byte character,
> but they are surrounded by single-byte characters.
>
> Replace that expression with "|[^[:space:]]" if the regex engine
> supports multi-byte matching, as there is no need to have an explicit
> range for multi-byte characters then.  Check for that capability at
> runtime, because it depends on the locale and thus on environment
> variables.  Construct the full replacement expression at build time
> and just switch it in if necessary to avoid string manipulation and
> allocations at runtime.
>
> Additionally the word regex for tex contains the expression
> "[a-zA-Z0-9\x80-\xff]+" with a similarly invalid range.  The best
> replacement with only valid characters that I can come up with is
> "([a-zA-Z0-9]|[^\x01-\x7f])+".  Unlike the original it matches NUL
> characters, though.  Assuming that tex files usually don't contain NUL
> this should be acceptable.
>
> Reported-by: D. Ben Knoble <ben.knoble@gmail.com>
> Reported-by: Eric Sunshine <sunshine@sunshineco.com>
> Helped-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: René Scharfe <l.s.r@web.de>

I tested the patch locally on top of ae73b2c8f1 and it solved my
problem. Seems like there's still some further discussion, though.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] userdiff: support regexec(3) with multi-byte support
  2023-04-07 14:41             ` D. Ben Knoble
@ 2023-04-07 16:02               ` Junio C Hamano
  0 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2023-04-07 16:02 UTC (permalink / raw)
  To: D. Ben Knoble
  Cc: René Scharfe, Git List, Diomidis Spinellis, Eric Sunshine,
	demerphq, Mario Grgic, Ævar Arnfjörð Bjarmason,
	Jeff King

"D. Ben Knoble" <ben.knoble@gmail.com> writes:

> On Thu, Apr 6, 2023 at 4:19 PM René Scharfe <l.s.r@web.de> wrote:
>>
>> Since 1819ad327b (grep: fix multibyte regex handling under macOS,
>> 2022-08-26) we use the system library for all regular expression
>> matching on macOS, not just for git grep.  It supports multi-byte
>> strings and rejects invalid multi-byte characters.
>> ...
>> Reported-by: D. Ben Knoble <ben.knoble@gmail.com>
>> Reported-by: Eric Sunshine <sunshine@sunshineco.com>
>> Helped-by: Junio C Hamano <gitster@pobox.com>
>> Signed-off-by: René Scharfe <l.s.r@web.de>
>
> I tested the patch locally on top of ae73b2c8f1 and it solved my
> problem. Seems like there's still some further discussion, though.

Thanks very much for reporting and testing.  Also, thanks all for
working on this solution.

Will queue.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] userdiff: support regexec(3) with multi-byte support
  2023-04-06 20:19           ` [PATCH] userdiff: support regexec(3) with multi-byte support René Scharfe
  2023-04-06 22:35             ` Johannes Sixt
  2023-04-07 14:41             ` D. Ben Knoble
@ 2023-04-07 17:23             ` Eric Sunshine
  2 siblings, 0 replies; 15+ messages in thread
From: Eric Sunshine @ 2023-04-07 17:23 UTC (permalink / raw)
  To: René Scharfe
  Cc: Git List, Diomidis Spinellis, demerphq, Mario Grgic,
	D. Ben Knoble, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Jeff King

On Thu, Apr 6, 2023 at 4:19 PM René Scharfe <l.s.r@web.de> wrote:
> Since 1819ad327b (grep: fix multibyte regex handling under macOS,
> 2022-08-26) we use the system library for all regular expression
> matching on macOS, not just for git grep.  It supports multi-byte
> strings and rejects invalid multi-byte characters.
>
> This broke all built-in userdiff word regexes in UTF-8 locales because
> they all include such invalid bytes in expressions that are intended to
> match multi-byte characters without explicit support for that from the
> regex engine.
>
> "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word
> regexes to match a single non-space or multi-byte character.  The \xNN
> characters are invalid if interpreted as UTF-8 because they have their
> high bit set, which indicates they are part of a multi-byte character,
> but they are surrounded by single-byte characters.
>
> Replace that expression with "|[^[:space:]]" if the regex engine
> supports multi-byte matching, as there is no need to have an explicit
> range for multi-byte characters then.  Check for that capability at
> runtime, because it depends on the locale and thus on environment
> variables.  Construct the full replacement expression at build time
> and just switch it in if necessary to avoid string manipulation and
> allocations at runtime.
>
> Reported-by: D. Ben Knoble <ben.knoble@gmail.com>
> Reported-by: Eric Sunshine <sunshine@sunshineco.com>
> Helped-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: René Scharfe <l.s.r@web.de>

Thank you, René! This patch resolves the problem I was experiencing[1].

I'm happy to have --color-words working again.

[1]: https://lore.kernel.org/git/CAPig+cSNmws2b7f7aRA2C56kvQYG3w_g+KhYdqhtmf+XhtAMhQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regex compilation error with --color-words
@ 2023-04-06 16:03 Benjamin
  0 siblings, 0 replies; 15+ messages in thread
From: Benjamin @ 2023-04-06 16:03 UTC (permalink / raw)
  To: Ben Knoble; +Cc: dds, demerphq, git, gitster, l.s.r, mario_grgic, sunshine

Ack, forgot the link apparently:

[1]: https://lore.kernel.org/git/CALnO6CAZtwfGY4SYeOuKqdP9+e_0EYNf4F703DRQB7UUfd_bUg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: regex compilation error with --color-words
@ 2023-04-06 13:39 Benjamin
  0 siblings, 0 replies; 15+ messages in thread
From: Benjamin @ 2023-04-06 13:39 UTC (permalink / raw)
  To: sunshine; +Cc: dds, demerphq, git, gitster, l.s.r, mario_grgic

This was also discussed in February[1]. Giving --word-diff is to git-diff is enough to trigger the error in non-C locales.

I am pleasantly surprised to learn that LC_ALL=C temporarily avoids the problem; that makes it useable again, for now. But a fix is much welcome!

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-04-07 17:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-29 22:55 regex compilation error with --color-words Eric Sunshine
2023-03-30  7:55 ` Diomidis Spinellis
2023-03-31 20:44   ` René Scharfe
2023-04-02  9:44     ` René Scharfe
2023-04-03 16:29       ` Junio C Hamano
2023-04-03 19:32         ` René Scharfe
2023-04-06 20:19           ` [PATCH] userdiff: support regexec(3) with multi-byte support René Scharfe
2023-04-06 22:35             ` Johannes Sixt
2023-04-07  7:49               ` René Scharfe
2023-04-07 10:56                 ` Johannes Sixt
2023-04-07 14:41             ` D. Ben Knoble
2023-04-07 16:02               ` Junio C Hamano
2023-04-07 17:23             ` Eric Sunshine
2023-04-06 13:39 regex compilation error with --color-words Benjamin
2023-04-06 16:03 Benjamin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.