All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/10] Fix icase grep on non-ascii
@ 2016-01-28 11:56 Nguyễn Thái Ngọc Duy
  2016-01-28 11:56 ` [PATCH v5 01/10] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
                   ` (11 more replies)
  0 siblings, 12 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-01-28 11:56 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

The series fixes grep choosing fast path that only works correctly for
ascii. This is a resend of v4 [1] after rebase. I think v4 just went
off radar for some reason, nothing was wrong with it (or nobody told
me about it).

[1] http://thread.gmane.org/gmane.comp.version-control.git/273381/focus=276288

Nguyễn Thái Ngọc Duy (10):
  grep: allow -F -i combination
  grep: break down an "if" stmt in preparation for next changes
  test-regex: expose full regcomp() to the command line
  grep/icase: avoid kwsset on literal non-ascii strings
  grep/icase: avoid kwsset when -F is specified
  grep/pcre: prepare locale-dependent tables for icase matching
  gettext: add is_utf8_locale()
  grep/pcre: support utf-8
  diffcore-pickaxe: "share" regex error handling code
  diffcore-pickaxe: support case insensitive match on non-ascii

 builtin/grep.c                           |  2 +-
 diffcore-pickaxe.c                       | 27 ++++++++----
 gettext.c                                | 24 ++++++++++-
 gettext.h                                |  1 +
 grep.c                                   | 44 ++++++++++++++++++--
 grep.h                                   |  1 +
 quote.c                                  | 37 +++++++++++++++++
 quote.h                                  |  1 +
 t/t7812-grep-icase-non-ascii.sh (new +x) | 71 ++++++++++++++++++++++++++++++++
 t/t7813-grep-icase-iso.sh (new +x)       | 19 +++++++++
 test-regex.c                             | 56 ++++++++++++++++++++++---
 11 files changed, 262 insertions(+), 21 deletions(-)
 create mode 100755 t/t7812-grep-icase-non-ascii.sh
 create mode 100755 t/t7813-grep-icase-iso.sh

-- 
2.7.0.288.g1d8ad15

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v5 01/10] grep: allow -F -i combination
  2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
@ 2016-01-28 11:56 ` Nguyễn Thái Ngọc Duy
  2016-01-28 11:56 ` [PATCH v5 02/10] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-01-28 11:56 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

-F means "no regex", not "case sensitive" so it should not override -i

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 5526fd7..4be0df5 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -809,7 +809,7 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 
 	if (!opt.pattern_list)
 		die(_("no pattern given."));
-	if (!opt.fixed && opt.ignore_case)
+	if (opt.ignore_case)
 		opt.regflags |= REG_ICASE;
 
 	compile_grep_patterns(&opt);
-- 
2.7.0.288.g1d8ad15

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 02/10] grep: break down an "if" stmt in preparation for next changes
  2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
  2016-01-28 11:56 ` [PATCH v5 01/10] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
@ 2016-01-28 11:56 ` Nguyễn Thái Ngọc Duy
  2016-01-28 11:56 ` [PATCH v5 03/10] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-01-28 11:56 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index 7b2b96a..e739d20 100644
--- a/grep.c
+++ b/grep.c
@@ -403,9 +403,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
 
-	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
+	if (is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
-	else
+	else if (opt->fixed) {
+		p->fixed = 1;
+	} else
 		p->fixed = 0;
 
 	if (p->fixed) {
-- 
2.7.0.288.g1d8ad15

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 03/10] test-regex: expose full regcomp() to the command line
  2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
  2016-01-28 11:56 ` [PATCH v5 01/10] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
  2016-01-28 11:56 ` [PATCH v5 02/10] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
@ 2016-01-28 11:56 ` Nguyễn Thái Ngọc Duy
  2016-01-29  5:31   ` Eric Sunshine
  2016-01-28 11:56 ` [PATCH v5 04/10] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-01-28 11:56 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 test-regex.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 50 insertions(+), 6 deletions(-)

diff --git a/test-regex.c b/test-regex.c
index 0dc598e..3b5641c 100644
--- a/test-regex.c
+++ b/test-regex.c
@@ -1,19 +1,63 @@
 #include "git-compat-util.h"
+#include "gettext.h"
+
+struct reg_flag {
+	const char *name;
+	int flag;
+};
+
+static struct reg_flag reg_flags[] = {
+	{ "EXTENDED",	 REG_EXTENDED	},
+	{ "NEWLINE",	 REG_NEWLINE	},
+	{ "ICASE",	 REG_ICASE	},
+	{ "NOTBOL",	 REG_NOTBOL	},
+#ifdef REG_STARTEND
+	{ "STARTEND",	 REG_STARTEND	},
+#endif
+	{ NULL, 0 }
+};
 
 int main(int argc, char **argv)
 {
-	char *pat = "[^={} \t]+";
-	char *str = "={}\nfred";
+	const char *pat;
+	const char *str;
+	int flags = 0;
 	regex_t r;
 	regmatch_t m[1];
 
-	if (regcomp(&r, pat, REG_EXTENDED | REG_NEWLINE))
+	if (argc == 1) {
+		/* special case, bug check */
+		pat = "[^={} \t]+";
+		str = "={}\nfred";
+		flags = REG_EXTENDED | REG_NEWLINE;
+	} else {
+		argv++;
+		pat = *argv++;
+		str = *argv++;
+		while (*argv) {
+			struct reg_flag *rf;
+			for (rf = reg_flags; rf->name; rf++)
+				if (!strcmp(*argv, rf->name)) {
+					flags |= rf->flag;
+					break;
+				}
+			if (!rf->name)
+				die("do not recognize %s", *argv);
+			argv++;
+		}
+		git_setup_gettext();
+	}
+
+	if (regcomp(&r, pat, flags))
 		die("failed regcomp() for pattern '%s'", pat);
-	if (regexec(&r, str, 1, m, 0))
-		die("no match of pattern '%s' to string '%s'", pat, str);
+	if (regexec(&r, str, 1, m, 0)) {
+		if (argc == 1)
+			die("no match of pattern '%s' to string '%s'", pat, str);
+		return 1;
+	}
 
 	/* http://sourceware.org/bugzilla/show_bug.cgi?id=3957  */
-	if (m[0].rm_so == 3) /* matches '\n' when it should not */
+	if (argc == 1 && m[0].rm_so == 3) /* matches '\n' when it should not */
 		die("regex bug confirmed: re-build git with NO_REGEX=1");
 
 	exit(0);
-- 
2.7.0.288.g1d8ad15

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 04/10] grep/icase: avoid kwsset on literal non-ascii strings
  2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
                   ` (2 preceding siblings ...)
  2016-01-28 11:56 ` [PATCH v5 03/10] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
@ 2016-01-28 11:56 ` Nguyễn Thái Ngọc Duy
  2016-01-29  6:18   ` Eric Sunshine
  2016-01-28 11:56 ` [PATCH v5 05/10] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-01-28 11:56 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

When we detect the pattern is just a literal string, we avoid heavy
regex engine and use fast substring search implemented in kwsset.c.
But kws uses git-ctype which is locale-independent so it does not know
how to fold case properly outside ascii range. Let regcomp or pcre
take care of this case instead. Slower, but accurate.

Helped-by: René Scharfe <l.s.r@web.de>
Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                                   |  7 ++++++-
 t/t7812-grep-icase-non-ascii.sh (new +x) | 23 +++++++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletion(-)
 create mode 100755 t/t7812-grep-icase-non-ascii.sh

diff --git a/grep.c b/grep.c
index e739d20..f2180a2 100644
--- a/grep.c
+++ b/grep.c
@@ -4,6 +4,7 @@
 #include "xdiff-interface.h"
 #include "diff.h"
 #include "diffcore.h"
+#include "commit.h"
 
 static int grep_source_load(struct grep_source *gs);
 static int grep_source_is_binary(struct grep_source *gs);
@@ -398,12 +399,16 @@ static int is_fixed(const char *s, size_t len)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
+	int icase_non_ascii;
 	int err;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
+	icase_non_ascii =
+		(opt->regflags & REG_ICASE || p->ignore_case) &&
+		has_non_ascii(p->pattern);
 
-	if (is_fixed(p->pattern, p->patternlen))
+	if (!icase_non_ascii && is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else if (opt->fixed) {
 		p->fixed = 1;
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
new file mode 100755
index 0000000..6eff490
--- /dev/null
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -0,0 +1,23 @@
+#!/bin/sh
+
+test_description='grep icase on non-English locales'
+
+. ./lib-gettext.sh
+
+test_expect_success GETTEXT_LOCALE 'setup' '
+	printf "TILRAUN: Halló Heimur!" >file &&
+	git add file &&
+	LC_ALL="$is_IS_locale" &&
+	export LC_ALL
+'
+
+test_have_prereq GETTEXT_LOCALE &&
+test-regex "HALLÓ" "Halló" ICASE &&
+test_set_prereq REGEX_LOCALE
+
+test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
+	git grep -i "TILRAUN: Halló Heimur!" &&
+	git grep -i "TILRAUN: HALLÓ HEIMUR!"
+'
+
+test_done
-- 
2.7.0.288.g1d8ad15

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 05/10] grep/icase: avoid kwsset when -F is specified
  2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
                   ` (3 preceding siblings ...)
  2016-01-28 11:56 ` [PATCH v5 04/10] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
@ 2016-01-28 11:56 ` Nguyễn Thái Ngọc Duy
  2016-01-29  6:23   ` Eric Sunshine
  2016-01-28 11:56 ` [PATCH v5 06/10] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-01-28 11:56 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

Similar to the previous commit, we can't use kws on icase search
outside ascii range. But we can't simply pass the pattern to
regcomp/pcre like the previous commit because it may contain regex
special characters, so we need to quote the regex first.

To avoid misquote traps that could lead to undefined behavior, we
always stick to basic regex engine in this case. We don't need fancy
features for grepping a literal string anyway.

basic_regex_quote_buf() assumes that if the pattern is in a multibyte
encoding, ascii chars must be unambiguously encoded as single
bytes. This is true at least for UTF-8. For others, let's wait until
people yell up. Chances are nobody uses multibyte, non utf-8 charsets
any more..

Helped-by: René Scharfe <l.s.r@web.de>
Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          | 25 ++++++++++++++++++++++++-
 quote.c                         | 37 +++++++++++++++++++++++++++++++++++++
 quote.h                         |  1 +
 t/t7812-grep-icase-non-ascii.sh | 26 ++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/grep.c b/grep.c
index f2180a2..fa96a29 100644
--- a/grep.c
+++ b/grep.c
@@ -5,6 +5,7 @@
 #include "diff.h"
 #include "diffcore.h"
 #include "commit.h"
+#include "quote.h"
 
 static int grep_source_load(struct grep_source *gs);
 static int grep_source_is_binary(struct grep_source *gs);
@@ -397,6 +398,24 @@ static int is_fixed(const char *s, size_t len)
 	return 1;
 }
 
+static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int err;
+
+	basic_regex_quote_buf(&sb, p->pattern);
+	err = regcomp(&p->regexp, sb.buf, opt->regflags & ~REG_EXTENDED);
+	if (opt->debug)
+		fprintf(stderr, "fixed%s\n", sb.buf);
+	strbuf_release(&sb);
+	if (err) {
+		char errbuf[1024];
+		regerror(err, &p->regexp, errbuf, 1024);
+		regfree(&p->regexp);
+		compile_regexp_failed(p, errbuf);
+	}
+}
+
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
 	int icase_non_ascii;
@@ -411,7 +430,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if (!icase_non_ascii && is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else if (opt->fixed) {
-		p->fixed = 1;
+		p->fixed = !icase_non_ascii;
+		if (!p->fixed) {
+			compile_fixed_regexp(p, opt);
+			return;
+		}
 	} else
 		p->fixed = 0;
 
diff --git a/quote.c b/quote.c
index fe884d2..c67adb7 100644
--- a/quote.c
+++ b/quote.c
@@ -440,3 +440,40 @@ void tcl_quote_buf(struct strbuf *sb, const char *src)
 	}
 	strbuf_addch(sb, '"');
 }
+
+void basic_regex_quote_buf(struct strbuf *sb, const char *src)
+{
+	char c;
+
+	if (*src == '^') {
+		/* only beginning '^' is special and needs quoting */
+		strbuf_addch(sb, '\\');
+		strbuf_addch(sb, *src++);
+	}
+	if (*src == '*')
+		/* beginning '*' is not special, no quoting */
+		strbuf_addch(sb, *src++);
+
+	while ((c = *src++)) {
+		switch (c) {
+		case '[':
+		case '.':
+		case '\\':
+		case '*':
+			strbuf_addch(sb, '\\');
+			strbuf_addch(sb, c);
+			break;
+
+		case '$':
+			/* only the end '$' is special and needs quoting */
+			if (*src == '\0')
+				strbuf_addch(sb, '\\');
+			strbuf_addch(sb, c);
+			break;
+
+		default:
+			strbuf_addch(sb, c);
+			break;
+		}
+	}
+}
diff --git a/quote.h b/quote.h
index 99e04d3..362d315 100644
--- a/quote.h
+++ b/quote.h
@@ -67,5 +67,6 @@ extern char *quote_path_relative(const char *in, const char *prefix,
 extern void perl_quote_buf(struct strbuf *sb, const char *src);
 extern void python_quote_buf(struct strbuf *sb, const char *src);
 extern void tcl_quote_buf(struct strbuf *sb, const char *src);
+extern void basic_regex_quote_buf(struct strbuf *sb, const char *src);
 
 #endif
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 6eff490..aba6b15 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -20,4 +20,30 @@ test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success REGEX_LOCALE 'grep literal string, with -F' '
+	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
+		 grep fixed >debug1 &&
+	echo "fixedTILRAUN: Halló Heimur!" >expect1 &&
+	test_cmp expect1 debug1 &&
+
+	git grep --debug -i -F "TILRAUN: HALLÓ HEIMUR!"  2>&1 >/dev/null |
+		 grep fixed >debug2 &&
+	echo "fixedTILRAUN: HALLÓ HEIMUR!" >expect2 &&
+	test_cmp expect2 debug2
+'
+
+test_expect_success REGEX_LOCALE 'grep string with regex, with -F' '
+	printf "^*TILR^AUN:.* \\Halló \$He[]imur!\$" >file &&
+
+	git grep --debug -i -F "^*TILR^AUN:.* \\Halló \$He[]imur!\$" 2>&1 >/dev/null |
+		 grep fixed >debug1 &&
+	echo "fixed\\^*TILR^AUN:\\.\\* \\\\Halló \$He\\[]imur!\\\$" >expect1 &&
+	test_cmp expect1 debug1 &&
+
+	git grep --debug -i -F "^*TILR^AUN:.* \\HALLÓ \$HE[]IMUR!\$"  2>&1 >/dev/null |
+		 grep fixed >debug2 &&
+	echo "fixed\\^*TILR^AUN:\\.\\* \\\\HALLÓ \$HE\\[]IMUR!\\\$" >expect2 &&
+	test_cmp expect2 debug2
+'
+
 test_done
-- 
2.7.0.288.g1d8ad15

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 06/10] grep/pcre: prepare locale-dependent tables for icase matching
  2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
                   ` (4 preceding siblings ...)
  2016-01-28 11:56 ` [PATCH v5 05/10] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
@ 2016-01-28 11:56 ` Nguyễn Thái Ngọc Duy
  2016-01-28 11:56 ` [PATCH v5 07/10] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-01-28 11:56 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2362 bytes --]

The default tables are usually built with C locale and only suitable
for LANG=C or similar.  This should make case insensitive search work
correctly for all single-byte charsets.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                             |  8 ++++++--
 grep.h                             |  1 +
 t/t7813-grep-icase-iso.sh (new +x) | 19 +++++++++++++++++++
 3 files changed, 26 insertions(+), 2 deletions(-)
 create mode 100755 t/t7813-grep-icase-iso.sh

diff --git a/grep.c b/grep.c
index fa96a29..921339a 100644
--- a/grep.c
+++ b/grep.c
@@ -324,11 +324,14 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 	int erroffset;
 	int options = PCRE_MULTILINE;
 
-	if (opt->ignore_case)
+	if (opt->ignore_case) {
+		if (has_non_ascii(p->pattern))
+			p->pcre_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
+	}
 
 	p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
-			NULL);
+				      p->pcre_tables);
 	if (!p->pcre_regexp)
 		compile_regexp_failed(p, error);
 
@@ -362,6 +365,7 @@ static void free_pcre_regexp(struct grep_pat *p)
 {
 	pcre_free(p->pcre_regexp);
 	pcre_free(p->pcre_extra_info);
+	pcre_free((void *)p->pcre_tables);
 }
 #else /* !USE_LIBPCRE */
 static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
diff --git a/grep.h b/grep.h
index 95f197a..cee4357 100644
--- a/grep.h
+++ b/grep.h
@@ -48,6 +48,7 @@ struct grep_pat {
 	regex_t regexp;
 	pcre *pcre_regexp;
 	pcre_extra *pcre_extra_info;
+	const unsigned char *pcre_tables;
 	kwset_t kws;
 	unsigned fixed:1;
 	unsigned ignore_case:1;
diff --git a/t/t7813-grep-icase-iso.sh b/t/t7813-grep-icase-iso.sh
new file mode 100755
index 0000000..efef7fb
--- /dev/null
+++ b/t/t7813-grep-icase-iso.sh
@@ -0,0 +1,19 @@
+#!/bin/sh
+
+test_description='grep icase on non-English locales'
+
+. ./lib-gettext.sh
+
+test_expect_success GETTEXT_ISO_LOCALE 'setup' '
+	printf "TILRAUN: Halló Heimur!" >file &&
+	git add file &&
+	LC_ALL="$is_IS_iso_locale" &&
+	export LC_ALL
+'
+
+test_expect_success GETTEXT_ISO_LOCALE,LIBPCRE 'grep pcre string' '
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
+test_done
-- 
2.7.0.288.g1d8ad15

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 07/10] gettext: add is_utf8_locale()
  2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
                   ` (5 preceding siblings ...)
  2016-01-28 11:56 ` [PATCH v5 06/10] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
@ 2016-01-28 11:56 ` Nguyễn Thái Ngọc Duy
  2016-01-28 11:56 ` [PATCH v5 08/10] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-01-28 11:56 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

This function returns true if git is running under an UTF-8
locale. pcre in the next patch will need this.

is_encoding_utf8() is used instead of strcmp() to catch both "utf-8"
and "utf8" suffixes.

When built with no gettext support, we peek in several env variables
to detect UTF-8. pcre library might support utf-8 even if libc is
built without locale support.. The peeking code is a copy from
compat/regex/regcomp.c

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 gettext.c | 24 ++++++++++++++++++++++--
 gettext.h |  1 +
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/gettext.c b/gettext.c
index a268a2c..db727ea 100644
--- a/gettext.c
+++ b/gettext.c
@@ -18,6 +18,8 @@
 #	endif
 #endif
 
+static const char *charset;
+
 /*
  * Guess the user's preferred languages from the value in LANGUAGE environment
  * variable and LC_MESSAGES locale category if NO_GETTEXT is not defined.
@@ -65,7 +67,6 @@ static int test_vsnprintf(const char *fmt, ...)
 	return ret;
 }
 
-static const char *charset;
 static void init_gettext_charset(const char *domain)
 {
 	/*
@@ -172,8 +173,27 @@ int gettext_width(const char *s)
 {
 	static int is_utf8 = -1;
 	if (is_utf8 == -1)
-		is_utf8 = !strcmp(charset, "UTF-8");
+		is_utf8 = is_utf8_locale();
 
 	return is_utf8 ? utf8_strwidth(s) : strlen(s);
 }
 #endif
+
+int is_utf8_locale(void)
+{
+#ifdef NO_GETTEXT
+	if (!charset) {
+		const char *env = getenv("LC_ALL");
+		if (!env || !*env)
+			env = getenv("LC_CTYPE");
+		if (!env || !*env)
+			env = getenv("LANG");
+		if (!env)
+			env = "";
+		if (strchr(env, '.'))
+			env = strchr(env, '.') + 1;
+		charset = xstrdup(env);
+	}
+#endif
+	return is_encoding_utf8(charset);
+}
diff --git a/gettext.h b/gettext.h
index 33696a4..7eee64a 100644
--- a/gettext.h
+++ b/gettext.h
@@ -90,5 +90,6 @@ const char *Q_(const char *msgid, const char *plu, unsigned long n)
 #endif
 
 const char *get_preferred_languages(void);
+extern int is_utf8_locale(void);
 
 #endif
-- 
2.7.0.288.g1d8ad15

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 08/10] grep/pcre: support utf-8
  2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
                   ` (6 preceding siblings ...)
  2016-01-28 11:56 ` [PATCH v5 07/10] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
@ 2016-01-28 11:56 ` Nguyễn Thái Ngọc Duy
  2016-01-28 11:56 ` [PATCH v5 09/10] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-01-28 11:56 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

In the previous change in this function, we add locale support for
single-byte encodings only. It looks like pcre only supports utf-* as
multibyte encodings, the others are left in the cold (which is
fine).

We need to enable PCRE_UTF8 so pcre can find character boundary
correctly. It's needed for case folding (when --ignore-case is used)
or '*', '+' or similar syntax is used.

The "has_non_ascii()" check is to be on the conservative side. If
there's non-ascii in the pattern, the searched content could still be
in utf-8, but we can treat it just like a byte stream and everything
should work. If we force utf-8 based on locale only and pcre validates
utf-8 and the file content is in non-utf8 encoding, things break.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Helped-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          |  2 ++
 t/t7812-grep-icase-non-ascii.sh | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/grep.c b/grep.c
index 921339a..2e4f71d 100644
--- a/grep.c
+++ b/grep.c
@@ -329,6 +329,8 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 			p->pcre_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
 	}
+	if (is_utf8_locale() && has_non_ascii(p->pattern))
+		options |= PCRE_UTF8;
 
 	p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
 				      p->pcre_tables);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index aba6b15..8896410 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -20,6 +20,21 @@ test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 icase' '
+	git grep --perl-regexp    "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 string with "+"' '
+	printf "TILRAUN: Hallóó Heimur!" >file2 &&
+	git add file2 &&
+	git grep -l --perl-regexp "TILRAUN: H.lló+ Heimur!" >actual &&
+	echo file >expected &&
+	echo file2 >>expected &&
+	test_cmp expected actual
+'
+
 test_expect_success REGEX_LOCALE 'grep literal string, with -F' '
 	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
 		 grep fixed >debug1 &&
-- 
2.7.0.288.g1d8ad15

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 09/10] diffcore-pickaxe: "share" regex error handling code
  2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
                   ` (7 preceding siblings ...)
  2016-01-28 11:56 ` [PATCH v5 08/10] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
@ 2016-01-28 11:56 ` Nguyễn Thái Ngọc Duy
  2016-01-28 11:56 ` [PATCH v5 10/10] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-01-28 11:56 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

There's another regcomp code block coming in this function. By moving
the error handling code out of this block, we don't have to add the
same error handling code in the new block.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 diffcore-pickaxe.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 7715c13..69c6567 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -204,20 +204,13 @@ void diffcore_pickaxe(struct diff_options *o)
 	int opts = o->pickaxe_opts;
 	regex_t regex, *regexp = NULL;
 	kwset_t kws = NULL;
+	int err = 0;
 
 	if (opts & (DIFF_PICKAXE_REGEX | DIFF_PICKAXE_KIND_G)) {
-		int err;
 		int cflags = REG_EXTENDED | REG_NEWLINE;
 		if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE))
 			cflags |= REG_ICASE;
 		err = regcomp(&regex, needle, cflags);
-		if (err) {
-			/* The POSIX.2 people are surely sick */
-			char errbuf[1024];
-			regerror(err, &regex, errbuf, 1024);
-			regfree(&regex);
-			die("invalid regex: %s", errbuf);
-		}
 		regexp = &regex;
 	} else {
 		kws = kwsalloc(DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE)
@@ -225,6 +218,13 @@ void diffcore_pickaxe(struct diff_options *o)
 		kwsincr(kws, needle, strlen(needle));
 		kwsprep(kws);
 	}
+	if (err) {
+		/* The POSIX.2 people are surely sick */
+		char errbuf[1024];
+		regerror(err, &regex, errbuf, 1024);
+		regfree(&regex);
+		die("invalid regex: %s", errbuf);
+	}
 
 	/* Might want to warn when both S and G are on; I don't care... */
 	pickaxe(&diff_queued_diff, o, regexp, kws,
-- 
2.7.0.288.g1d8ad15

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 10/10] diffcore-pickaxe: support case insensitive match on non-ascii
  2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
                   ` (8 preceding siblings ...)
  2016-01-28 11:56 ` [PATCH v5 09/10] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
@ 2016-01-28 11:56 ` Nguyễn Thái Ngọc Duy
  2016-01-29  6:38   ` Eric Sunshine
  2016-01-28 23:54 ` [PATCH v5 00/10] Fix icase grep " Junio C Hamano
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
  11 siblings, 1 reply; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-01-28 11:56 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

Similar to the "grep -F -i" case, we can't use kws on icase search
outside ascii range, so we quote the string and pass it to regcomp as
a basic regexp and let regex engine deal with case sensitivity.

The new test is put in t7812 instead of t4209-log-pickaxe because
lib-gettext.sh might cause problems elsewhere, probably..

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 diffcore-pickaxe.c              | 11 +++++++++++
 t/t7812-grep-icase-non-ascii.sh |  7 +++++++
 2 files changed, 18 insertions(+)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 69c6567..0a5f877 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -7,6 +7,8 @@
 #include "diffcore.h"
 #include "xdiff-interface.h"
 #include "kwset.h"
+#include "commit.h"
+#include "quote.h"
 
 typedef int (*pickaxe_fn)(mmfile_t *one, mmfile_t *two,
 			  struct diff_options *o,
@@ -212,6 +214,15 @@ void diffcore_pickaxe(struct diff_options *o)
 			cflags |= REG_ICASE;
 		err = regcomp(&regex, needle, cflags);
 		regexp = &regex;
+	} else if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE) &&
+		   has_non_ascii(needle)) {
+		struct strbuf sb = STRBUF_INIT;
+		int cflags = REG_NEWLINE | REG_ICASE;
+
+		basic_regex_quote_buf(&sb, needle);
+		err = regcomp(&regex, sb.buf, cflags);
+		strbuf_release(&sb);
+		regexp = &regex;
 	} else {
 		kws = kwsalloc(DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE)
 			       ? tolower_trans_tbl : NULL);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 8896410..a5475bb 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -61,4 +61,11 @@ test_expect_success REGEX_LOCALE 'grep string with regex, with -F' '
 	test_cmp expect2 debug2
 '
 
+test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
+	git commit -m first &&
+	git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual &&
+	echo first >expected &&
+	test_cmp expected actual
+'
+
 test_done
-- 
2.7.0.288.g1d8ad15

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 00/10] Fix icase grep on non-ascii
  2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
                   ` (9 preceding siblings ...)
  2016-01-28 11:56 ` [PATCH v5 10/10] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
@ 2016-01-28 23:54 ` Junio C Hamano
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
  11 siblings, 0 replies; 56+ messages in thread
From: Junio C Hamano @ 2016-01-28 23:54 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> The series fixes grep choosing fast path that only works correctly for
> ascii. This is a resend of v4 [1] after rebase. I think v4 just went
> off radar for some reason, nothing was wrong with it (or nobody told
> me about it).

Or nobody found it interesting, perhaps, in which case we cannot say
"nothing was wrong" or "nothing was good" with it ;-)

Will find time to take a look if nothing more urgent and interesting
(read: regression fixes) comes up in the meantime.

Thanks.

>
> [1] http://thread.gmane.org/gmane.comp.version-control.git/273381/focus=276288
>
> Nguyễn Thái Ngọc Duy (10):
>   grep: allow -F -i combination
>   grep: break down an "if" stmt in preparation for next changes
>   test-regex: expose full regcomp() to the command line
>   grep/icase: avoid kwsset on literal non-ascii strings
>   grep/icase: avoid kwsset when -F is specified
>   grep/pcre: prepare locale-dependent tables for icase matching
>   gettext: add is_utf8_locale()
>   grep/pcre: support utf-8
>   diffcore-pickaxe: "share" regex error handling code
>   diffcore-pickaxe: support case insensitive match on non-ascii
>
>  builtin/grep.c                           |  2 +-
>  diffcore-pickaxe.c                       | 27 ++++++++----
>  gettext.c                                | 24 ++++++++++-
>  gettext.h                                |  1 +
>  grep.c                                   | 44 ++++++++++++++++++--
>  grep.h                                   |  1 +
>  quote.c                                  | 37 +++++++++++++++++
>  quote.h                                  |  1 +
>  t/t7812-grep-icase-non-ascii.sh (new +x) | 71 ++++++++++++++++++++++++++++++++
>  t/t7813-grep-icase-iso.sh (new +x)       | 19 +++++++++
>  test-regex.c                             | 56 ++++++++++++++++++++++---
>  11 files changed, 262 insertions(+), 21 deletions(-)
>  create mode 100755 t/t7812-grep-icase-non-ascii.sh
>  create mode 100755 t/t7813-grep-icase-iso.sh

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 03/10] test-regex: expose full regcomp() to the command line
  2016-01-28 11:56 ` [PATCH v5 03/10] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
@ 2016-01-29  5:31   ` Eric Sunshine
  2016-01-29 14:29     ` Ramsay Jones
  0 siblings, 1 reply; 56+ messages in thread
From: Eric Sunshine @ 2016-01-29  5:31 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git

On Thu, Jan 28, 2016 at 06:56:16PM +0700, Nguyễn Thái Ngọc Duy wrote:
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
> diff --git a/test-regex.c b/test-regex.c
> @@ -1,19 +1,63 @@
>  int main(int argc, char **argv)
>  {
> -	char *pat = "[^={} \t]+";
> -	char *str = "={}\nfred";
> +	const char *pat;
> +	const char *str;
> +	int flags = 0;
>  	regex_t r;
>  	regmatch_t m[1];
>  
> -	if (regcomp(&r, pat, REG_EXTENDED | REG_NEWLINE))
> +	if (argc == 1) {
> +		/* special case, bug check */
> +		pat = "[^={} \t]+";
> +		str = "={}\nfred";
> +		flags = REG_EXTENDED | REG_NEWLINE;
> +	} else {
> +		argv++;
> +		pat = *argv++;
> +		str = *argv++;

I realize that this is just a test program, but it might be a good
idea to insert:

    if (argc < 3)
        die("usage: ...");

prior to the *argv++ dereferences to give a controlled failure rather
than an outright crash when an incorrect number of arguments is
given.

More below...

> +		while (*argv) {
> +			struct reg_flag *rf;
> +			for (rf = reg_flags; rf->name; rf++)
> +				if (!strcmp(*argv, rf->name)) {
> +					flags |= rf->flag;
> +					break;
> +				}
> +			if (!rf->name)
> +				die("do not recognize %s", *argv);
> +			argv++;
> +		}
> +		git_setup_gettext();
> +	}
> +
> +	if (regcomp(&r, pat, flags))
>  		die("failed regcomp() for pattern '%s'", pat);
> -	if (regexec(&r, str, 1, m, 0))
> -		die("no match of pattern '%s' to string '%s'", pat, str);
> +	if (regexec(&r, str, 1, m, 0)) {
> +		if (argc == 1)
> +			die("no match of pattern '%s' to string '%s'", pat, str);
> +		return 1;
> +	}
>  
>  	/* http://sourceware.org/bugzilla/show_bug.cgi?id=3957  */
> -	if (m[0].rm_so == 3) /* matches '\n' when it should not */
> +	if (argc == 1 && m[0].rm_so == 3) /* matches '\n' when it should not */
>  		die("regex bug confirmed: re-build git with NO_REGEX=1");

Again, I realize that this is just a test program, but sprinkling
this 'argc == 1' special case throughout the code makes it
unnecessarily difficult to follow. Some alternatives:

1. Rename the existing test-regex to test-regex-bug (or
   test-regex-bugs), and then name the new general purpose program
   test-regex.

2. Drop the special case altogether and have the program emit the
   matched text on stdout (in addition to the exit code indicating
   success/failure). Most callers will care only about the exit
   status, but the one special case in t0070 which wants to check for
   the glibc bug can do so itself:

    test_expect_success 'check for a bug in the regex routines' '
        # if this test fails, re-build git with NO_REGEX=1
        printf "fred" >expect &&
        test-regex "[^={} \t]+" "={}\nfred" EXTENDED NEWLINE >actual &&
        test_cmp expect actual
    '

   Of course, that doesn't actually work because "\n" in the 'str'
   argument isn't really a newline, so test-regex would have to do a
   bit of preprocessing of 'str' first (which might be as simple as
   calling unquote_c_style() or something).

3. [less desirable] Move the 'argc == 1' special case to its own
   function, which will result in a bit of duplicated code, but the
   result should at least be easier to follow.

>  	exit(0);
> -- 
> 2.7.0.288.g1d8ad15

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 04/10] grep/icase: avoid kwsset on literal non-ascii strings
  2016-01-28 11:56 ` [PATCH v5 04/10] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
@ 2016-01-29  6:18   ` Eric Sunshine
  2016-01-29  6:41     ` Eric Sunshine
  0 siblings, 1 reply; 56+ messages in thread
From: Eric Sunshine @ 2016-01-29  6:18 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: Git List

On Thu, Jan 28, 2016 at 6:56 AM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
> When we detect the pattern is just a literal string, we avoid heavy
> regex engine and use fast substring search implemented in kwsset.c.
> But kws uses git-ctype which is locale-independent so it does not know
> how to fold case properly outside ascii range. Let regcomp or pcre
> take care of this case instead. Slower, but accurate.
>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
> diff --git a/grep.c b/grep.c
> @@ -398,12 +399,16 @@ static int is_fixed(const char *s, size_t len)
>  static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>  {
> +       int icase_non_ascii;
>         int err;
>
>         p->word_regexp = opt->word_regexp;
>         p->ignore_case = opt->ignore_case;
> +       icase_non_ascii =
> +               (opt->regflags & REG_ICASE || p->ignore_case) &&
> +               has_non_ascii(p->pattern);
>
> -       if (is_fixed(p->pattern, p->patternlen))
> +       if (!icase_non_ascii && is_fixed(p->pattern, p->patternlen))

These double negatives (!non_ascii) here and in patch 5/10 are
difficult to grok. I wonder if a different name, such as
icase_ascii_only would help.

>                 p->fixed = 1;
>         else if (opt->fixed) {
>                 p->fixed = 1;

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 05/10] grep/icase: avoid kwsset when -F is specified
  2016-01-28 11:56 ` [PATCH v5 05/10] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
@ 2016-01-29  6:23   ` Eric Sunshine
  0 siblings, 0 replies; 56+ messages in thread
From: Eric Sunshine @ 2016-01-29  6:23 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: Git List

On Thu, Jan 28, 2016 at 6:56 AM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
> Similar to the previous commit, we can't use kws on icase search
> outside ascii range. But we can't simply pass the pattern to
> regcomp/pcre like the previous commit because it may contain regex
> special characters, so we need to quote the regex first.
>
> To avoid misquote traps that could lead to undefined behavior, we
> always stick to basic regex engine in this case. We don't need fancy
> features for grepping a literal string anyway.
>
> basic_regex_quote_buf() assumes that if the pattern is in a multibyte
> encoding, ascii chars must be unambiguously encoded as single
> bytes. This is true at least for UTF-8. For others, let's wait until
> people yell up. Chances are nobody uses multibyte, non utf-8 charsets
> any more..

s/any more../anymore./

> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
> diff --git a/grep.c b/grep.c
> @@ -397,6 +398,24 @@ static int is_fixed(const char *s, size_t len)
> +static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
> +{
> +       struct strbuf sb = STRBUF_INIT;
> +       int err;
> +
> +       basic_regex_quote_buf(&sb, p->pattern);
> +       err = regcomp(&p->regexp, sb.buf, opt->regflags & ~REG_EXTENDED);
> +       if (opt->debug)
> +               fprintf(stderr, "fixed%s\n", sb.buf);

Did you want a space or colon or something after "fixed" for human
consumption? (I realize that the test case doesn't care.)

> +       strbuf_release(&sb);
> +       if (err) {
> +               char errbuf[1024];
> +               regerror(err, &p->regexp, errbuf, 1024);

I guess this was copy/pasted from compile_regexp(), but for this new
code, perhaps: s/1024/sizeof(errbuf)/

> +               regfree(&p->regexp);
> +               compile_regexp_failed(p, errbuf);
> +       }
> +}

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 10/10] diffcore-pickaxe: support case insensitive match on non-ascii
  2016-01-28 11:56 ` [PATCH v5 10/10] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
@ 2016-01-29  6:38   ` Eric Sunshine
  0 siblings, 0 replies; 56+ messages in thread
From: Eric Sunshine @ 2016-01-29  6:38 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: Git List

On Thu, Jan 28, 2016 at 6:56 AM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
> Similar to the "grep -F -i" case, we can't use kws on icase search
> outside ascii range, so we quote the string and pass it to regcomp as
> a basic regexp and let regex engine deal with case sensitivity.
>
> The new test is put in t7812 instead of t4209-log-pickaxe because
> lib-gettext.sh might cause problems elsewhere, probably..

s/\.\.$/./

> Noticed-by: Plamen Totev <plamen.totev@abv.bg>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 04/10] grep/icase: avoid kwsset on literal non-ascii strings
  2016-01-29  6:18   ` Eric Sunshine
@ 2016-01-29  6:41     ` Eric Sunshine
  0 siblings, 0 replies; 56+ messages in thread
From: Eric Sunshine @ 2016-01-29  6:41 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: Git List

On Fri, Jan 29, 2016 at 1:18 AM, Eric Sunshine <sunshine@sunshineco.com> wrote:
> On Thu, Jan 28, 2016 at 6:56 AM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
>> When we detect the pattern is just a literal string, we avoid heavy
>> regex engine and use fast substring search implemented in kwsset.c.
>> But kws uses git-ctype which is locale-independent so it does not know
>> how to fold case properly outside ascii range. Let regcomp or pcre
>> take care of this case instead. Slower, but accurate.
>>
>> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
>> ---
>> diff --git a/grep.c b/grep.c
>> @@ -398,12 +399,16 @@ static int is_fixed(const char *s, size_t len)
>>  static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>>  {
>> +       int icase_non_ascii;
>>         int err;
>>
>>         p->word_regexp = opt->word_regexp;
>>         p->ignore_case = opt->ignore_case;
>> +       icase_non_ascii =
>> +               (opt->regflags & REG_ICASE || p->ignore_case) &&
>> +               has_non_ascii(p->pattern);
>>
>> -       if (is_fixed(p->pattern, p->patternlen))
>> +       if (!icase_non_ascii && is_fixed(p->pattern, p->patternlen))
>
> These double negatives (!non_ascii) here and in patch 5/10 are
> difficult to grok. I wonder if a different name, such as
> icase_ascii_only would help.

By the way, this is such a minor issue, and since there are only two
cases (in patches 4/10 and 5/10), it's not worth worrying about, and
certainly not worth a reroll.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 03/10] test-regex: expose full regcomp() to the command line
  2016-01-29  5:31   ` Eric Sunshine
@ 2016-01-29 14:29     ` Ramsay Jones
  0 siblings, 0 replies; 56+ messages in thread
From: Ramsay Jones @ 2016-01-29 14:29 UTC (permalink / raw)
  To: Eric Sunshine, Nguyễn Thái Ngọc Duy; +Cc: git



On 29/01/16 05:31, Eric Sunshine wrote:
> On Thu, Jan 28, 2016 at 06:56:16PM +0700, Nguyễn Thái Ngọc Duy wrote:
>> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
>> ---
>> diff --git a/test-regex.c b/test-regex.c
>> @@ -1,19 +1,63 @@
>>  int main(int argc, char **argv)
>>  {
>> -	char *pat = "[^={} \t]+";
>> -	char *str = "={}\nfred";
>> +	const char *pat;
>> +	const char *str;
>> +	int flags = 0;
>>  	regex_t r;
>>  	regmatch_t m[1];
>>  
>> -	if (regcomp(&r, pat, REG_EXTENDED | REG_NEWLINE))
>> +	if (argc == 1) {
>> +		/* special case, bug check */
>> +		pat = "[^={} \t]+";
>> +		str = "={}\nfred";
>> +		flags = REG_EXTENDED | REG_NEWLINE;
>> +	} else {
>> +		argv++;
>> +		pat = *argv++;
>> +		str = *argv++;
> 
> I realize that this is just a test program, but it might be a good
> idea to insert:
> 
>     if (argc < 3)
>         die("usage: ...");
> 
> prior to the *argv++ dereferences to give a controlled failure rather
> than an outright crash when an incorrect number of arguments is
> given.
> 
> More below...
> 
>> +		while (*argv) {
>> +			struct reg_flag *rf;
>> +			for (rf = reg_flags; rf->name; rf++)
>> +				if (!strcmp(*argv, rf->name)) {
>> +					flags |= rf->flag;
>> +					break;
>> +				}
>> +			if (!rf->name)
>> +				die("do not recognize %s", *argv);
>> +			argv++;
>> +		}
>> +		git_setup_gettext();
>> +	}
>> +
>> +	if (regcomp(&r, pat, flags))
>>  		die("failed regcomp() for pattern '%s'", pat);
>> -	if (regexec(&r, str, 1, m, 0))
>> -		die("no match of pattern '%s' to string '%s'", pat, str);
>> +	if (regexec(&r, str, 1, m, 0)) {
>> +		if (argc == 1)
>> +			die("no match of pattern '%s' to string '%s'", pat, str);
>> +		return 1;
>> +	}
>>  
>>  	/* http://sourceware.org/bugzilla/show_bug.cgi?id=3957  */
>> -	if (m[0].rm_so == 3) /* matches '\n' when it should not */
>> +	if (argc == 1 && m[0].rm_so == 3) /* matches '\n' when it should not */
>>  		die("regex bug confirmed: re-build git with NO_REGEX=1");
> 
> Again, I realize that this is just a test program, but sprinkling
> this 'argc == 1' special case throughout the code makes it
> unnecessarily difficult to follow. 

I completely agree!

>                                     Some alternatives:
> 
> 1. Rename the existing test-regex to test-regex-bug (or
>    test-regex-bugs), and then name the new general purpose program
>    test-regex.
> 
> 2. Drop the special case altogether and have the program emit the
>    matched text on stdout (in addition to the exit code indicating
>    success/failure). Most callers will care only about the exit
>    status, but the one special case in t0070 which wants to check for
>    the glibc bug can do so itself:
> 
>     test_expect_success 'check for a bug in the regex routines' '
>         # if this test fails, re-build git with NO_REGEX=1
>         printf "fred" >expect &&
>         test-regex "[^={} \t]+" "={}\nfred" EXTENDED NEWLINE >actual &&
>         test_cmp expect actual
>     '
> 
>    Of course, that doesn't actually work because "\n" in the 'str'
>    argument isn't really a newline, so test-regex would have to do a
>    bit of preprocessing of 'str' first (which might be as simple as
>    calling unquote_c_style() or something).
> 
> 3. [less desirable] Move the 'argc == 1' special case to its own
>    function, which will result in a bit of duplicated code, but the
>    result should at least be easier to follow.

I think this is the most desirable (and was going to be my first
suggestion); the duplication is minimal and it makes the code _much_
easier to follow. [I suppose separate test programs (ie. point 1) would
be my second choice.]

ATB,
Ramsay Jones

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v6 00/11] Fix icase grep on non-ascii
  2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
                   ` (10 preceding siblings ...)
  2016-01-28 23:54 ` [PATCH v5 00/10] Fix icase grep " Junio C Hamano
@ 2016-02-06  2:02 ` Nguyễn Thái Ngọc Duy
  2016-02-06  2:03   ` [PATCH v6 01/11] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
                     ` (13 more replies)
  11 siblings, 14 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-06  2:02 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Ramsay Jones, Nguyễn Thái Ngọc Duy

v6 fixes comments from Ramsay and Eric. Interdiff below. The only
thing to add is, I decided not to replace !icase_non_ascii with
icase_ascii_only. I went with spelling out "!icase || ascii_only". I
think it expresses the intention better.

diff --git a/grep.c b/grep.c
index 2e4f71d..aed4fe0 100644
--- a/grep.c
+++ b/grep.c
@@ -412,11 +412,11 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 	basic_regex_quote_buf(&sb, p->pattern);
 	err = regcomp(&p->regexp, sb.buf, opt->regflags & ~REG_EXTENDED);
 	if (opt->debug)
-		fprintf(stderr, "fixed%s\n", sb.buf);
+		fprintf(stderr, "fixed %s\n", sb.buf);
 	strbuf_release(&sb);
 	if (err) {
 		char errbuf[1024];
-		regerror(err, &p->regexp, errbuf, 1024);
+		regerror(err, &p->regexp, errbuf, sizeof(errbuf));
 		regfree(&p->regexp);
 		compile_regexp_failed(p, errbuf);
 	}
@@ -424,19 +424,18 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
-	int icase_non_ascii;
+	int icase, ascii_only;
 	int err;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
-	icase_non_ascii =
-		(opt->regflags & REG_ICASE || p->ignore_case) &&
-		has_non_ascii(p->pattern);
+	icase	       = opt->regflags & REG_ICASE || p->ignore_case;
+	ascii_only     = !has_non_ascii(p->pattern);
 
-	if (!icase_non_ascii && is_fixed(p->pattern, p->patternlen))
+	if ((!icase || ascii_only) && is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else if (opt->fixed) {
-		p->fixed = !icase_non_ascii;
+		p->fixed = !icase || ascii_only;
 		if (!p->fixed) {
 			compile_fixed_regexp(p, opt);
 			return;
diff --git a/t/t0070-fundamental.sh b/t/t0070-fundamental.sh
index 5ed69a6..991ed2a 100755
--- a/t/t0070-fundamental.sh
+++ b/t/t0070-fundamental.sh
@@ -31,7 +31,7 @@ test_expect_success 'git_mkstemps_mode does not fail if fd 0 is not open' '
 
 test_expect_success 'check for a bug in the regex routines' '
 	# if this test fails, re-build git with NO_REGEX=1
-	test-regex
+	test-regex --bug
 '
 
 test_done
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index a5475bb..4176625 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -38,12 +38,12 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 string with "+"' '
 test_expect_success REGEX_LOCALE 'grep literal string, with -F' '
 	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
 		 grep fixed >debug1 &&
-	echo "fixedTILRAUN: Halló Heimur!" >expect1 &&
+	echo "fixed TILRAUN: Halló Heimur!" >expect1 &&
 	test_cmp expect1 debug1 &&
 
 	git grep --debug -i -F "TILRAUN: HALLÓ HEIMUR!"  2>&1 >/dev/null |
 		 grep fixed >debug2 &&
-	echo "fixedTILRAUN: HALLÓ HEIMUR!" >expect2 &&
+	echo "fixed TILRAUN: HALLÓ HEIMUR!" >expect2 &&
 	test_cmp expect2 debug2
 '
 
@@ -52,12 +52,12 @@ test_expect_success REGEX_LOCALE 'grep string with regex, with -F' '
 
 	git grep --debug -i -F "^*TILR^AUN:.* \\Halló \$He[]imur!\$" 2>&1 >/dev/null |
 		 grep fixed >debug1 &&
-	echo "fixed\\^*TILR^AUN:\\.\\* \\\\Halló \$He\\[]imur!\\\$" >expect1 &&
+	echo "fixed \\^*TILR^AUN:\\.\\* \\\\Halló \$He\\[]imur!\\\$" >expect1 &&
 	test_cmp expect1 debug1 &&
 
 	git grep --debug -i -F "^*TILR^AUN:.* \\HALLÓ \$HE[]IMUR!\$"  2>&1 >/dev/null |
 		 grep fixed >debug2 &&
-	echo "fixed\\^*TILR^AUN:\\.\\* \\\\HALLÓ \$HE\\[]IMUR!\\\$" >expect2 &&
+	echo "fixed \\^*TILR^AUN:\\.\\* \\\\HALLÓ \$HE\\[]IMUR!\\\$" >expect2 &&
 	test_cmp expect2 debug2
 '
 
diff --git a/test-regex.c b/test-regex.c
index 3b5641c..d1a952c 100644
--- a/test-regex.c
+++ b/test-regex.c
@@ -17,6 +17,25 @@ static struct reg_flag reg_flags[] = {
 	{ NULL, 0 }
 };
 
+static int test_regex_bug(void)
+{
+	char *pat = "[^={} \t]+";
+	char *str = "={}\nfred";
+	regex_t r;
+	regmatch_t m[1];
+
+	if (regcomp(&r, pat, REG_EXTENDED | REG_NEWLINE))
+		die("failed regcomp() for pattern '%s'", pat);
+	if (regexec(&r, str, 1, m, 0))
+		die("no match of pattern '%s' to string '%s'", pat, str);
+
+	/* http://sourceware.org/bugzilla/show_bug.cgi?id=3957  */
+	if (m[0].rm_so == 3) /* matches '\n' when it should not */
+		die("regex bug confirmed: re-build git with NO_REGEX=1");
+
+	return 0;
+}
+
 int main(int argc, char **argv)
 {
 	const char *pat;
@@ -25,40 +44,32 @@ int main(int argc, char **argv)
 	regex_t r;
 	regmatch_t m[1];
 
-	if (argc == 1) {
-		/* special case, bug check */
-		pat = "[^={} \t]+";
-		str = "={}\nfred";
-		flags = REG_EXTENDED | REG_NEWLINE;
-	} else {
+	if (argc == 2 && !strcmp(argv[1], "--bug"))
+		return test_regex_bug();
+	else if (argc < 3)
+		die("usage: test-regex --bug\n"
+		    "       test-regex <pattern> <string> [<options>]");
+
+	argv++;
+	pat = *argv++;
+	str = *argv++;
+	while (*argv) {
+		struct reg_flag *rf;
+		for (rf = reg_flags; rf->name; rf++)
+			if (!strcmp(*argv, rf->name)) {
+				flags |= rf->flag;
+				break;
+			}
+		if (!rf->name)
+			die("do not recognize %s", *argv);
 		argv++;
-		pat = *argv++;
-		str = *argv++;
-		while (*argv) {
-			struct reg_flag *rf;
-			for (rf = reg_flags; rf->name; rf++)
-				if (!strcmp(*argv, rf->name)) {
-					flags |= rf->flag;
-					break;
-				}
-			if (!rf->name)
-				die("do not recognize %s", *argv);
-			argv++;
-		}
-		git_setup_gettext();
 	}
+	git_setup_gettext();
 
 	if (regcomp(&r, pat, flags))
 		die("failed regcomp() for pattern '%s'", pat);
-	if (regexec(&r, str, 1, m, 0)) {
-		if (argc == 1)
-			die("no match of pattern '%s' to string '%s'", pat, str);
+	if (regexec(&r, str, 1, m, 0))
 		return 1;
-	}
-
-	/* http://sourceware.org/bugzilla/show_bug.cgi?id=3957  */
-	if (argc == 1 && m[0].rm_so == 3) /* matches '\n' when it should not */
-		die("regex bug confirmed: re-build git with NO_REGEX=1");
 
-	exit(0);
+	return 0;
 }
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v6 01/11] grep: allow -F -i combination
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
@ 2016-02-06  2:03   ` Nguyễn Thái Ngọc Duy
  2016-06-17 21:54     ` Junio C Hamano
  2016-02-06  2:03   ` [PATCH v6 02/11] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
                     ` (12 subsequent siblings)
  13 siblings, 1 reply; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-06  2:03 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Ramsay Jones, Nguyễn Thái Ngọc Duy

-F means "no regex", not "case sensitive" so it should not override -i

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 5526fd7..4be0df5 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -809,7 +809,7 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 
 	if (!opt.pattern_list)
 		die(_("no pattern given."));
-	if (!opt.fixed && opt.ignore_case)
+	if (opt.ignore_case)
 		opt.regflags |= REG_ICASE;
 
 	compile_grep_patterns(&opt);
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v6 02/11] grep: break down an "if" stmt in preparation for next changes
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
  2016-02-06  2:03   ` [PATCH v6 01/11] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
@ 2016-02-06  2:03   ` Nguyễn Thái Ngọc Duy
  2016-02-09 18:20     ` Junio C Hamano
  2016-02-06  2:03   ` [PATCH v6 03/11] test-regex: isolate the bug test code Nguyễn Thái Ngọc Duy
                     ` (11 subsequent siblings)
  13 siblings, 1 reply; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-06  2:03 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Ramsay Jones, Nguyễn Thái Ngọc Duy

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index 7b2b96a..e739d20 100644
--- a/grep.c
+++ b/grep.c
@@ -403,9 +403,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
 
-	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
+	if (is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
-	else
+	else if (opt->fixed) {
+		p->fixed = 1;
+	} else
 		p->fixed = 0;
 
 	if (p->fixed) {
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v6 03/11] test-regex: isolate the bug test code
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
  2016-02-06  2:03   ` [PATCH v6 01/11] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
  2016-02-06  2:03   ` [PATCH v6 02/11] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
@ 2016-02-06  2:03   ` Nguyễn Thái Ngọc Duy
  2016-02-06  2:03   ` [PATCH v6 04/11] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
                     ` (10 subsequent siblings)
  13 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-06  2:03 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Ramsay Jones, Nguyễn Thái Ngọc Duy

This is in preparation to turn test-regex into some generic regex
testing command.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 t/t0070-fundamental.sh |  2 +-
 test-regex.c           | 12 ++++++++++--
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/t/t0070-fundamental.sh b/t/t0070-fundamental.sh
index 5ed69a6..991ed2a 100755
--- a/t/t0070-fundamental.sh
+++ b/t/t0070-fundamental.sh
@@ -31,7 +31,7 @@ test_expect_success 'git_mkstemps_mode does not fail if fd 0 is not open' '
 
 test_expect_success 'check for a bug in the regex routines' '
 	# if this test fails, re-build git with NO_REGEX=1
-	test-regex
+	test-regex --bug
 '
 
 test_done
diff --git a/test-regex.c b/test-regex.c
index 0dc598e..8c51f5a 100644
--- a/test-regex.c
+++ b/test-regex.c
@@ -1,6 +1,6 @@
 #include "git-compat-util.h"
 
-int main(int argc, char **argv)
+static int test_regex_bug(void)
 {
 	char *pat = "[^={} \t]+";
 	char *str = "={}\nfred";
@@ -16,5 +16,13 @@ int main(int argc, char **argv)
 	if (m[0].rm_so == 3) /* matches '\n' when it should not */
 		die("regex bug confirmed: re-build git with NO_REGEX=1");
 
-	exit(0);
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	if (argc == 2 && !strcmp(argv[1], "--bug"))
+		return test_regex_bug();
+	else
+		die("must be used with --bug");
 }
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v6 04/11] test-regex: expose full regcomp() to the command line
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
                     ` (2 preceding siblings ...)
  2016-02-06  2:03   ` [PATCH v6 03/11] test-regex: isolate the bug test code Nguyễn Thái Ngọc Duy
@ 2016-02-06  2:03   ` Nguyễn Thái Ngọc Duy
  2016-02-07  8:44     ` Eric Sunshine
  2016-02-06  2:03   ` [PATCH v6 05/11] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
                     ` (9 subsequent siblings)
  13 siblings, 1 reply; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-06  2:03 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Ramsay Jones, Nguyễn Thái Ngọc Duy

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 test-regex.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/test-regex.c b/test-regex.c
index 8c51f5a..d1a952c 100644
--- a/test-regex.c
+++ b/test-regex.c
@@ -1,4 +1,21 @@
 #include "git-compat-util.h"
+#include "gettext.h"
+
+struct reg_flag {
+	const char *name;
+	int flag;
+};
+
+static struct reg_flag reg_flags[] = {
+	{ "EXTENDED",	 REG_EXTENDED	},
+	{ "NEWLINE",	 REG_NEWLINE	},
+	{ "ICASE",	 REG_ICASE	},
+	{ "NOTBOL",	 REG_NOTBOL	},
+#ifdef REG_STARTEND
+	{ "STARTEND",	 REG_STARTEND	},
+#endif
+	{ NULL, 0 }
+};
 
 static int test_regex_bug(void)
 {
@@ -21,8 +38,38 @@ static int test_regex_bug(void)
 
 int main(int argc, char **argv)
 {
+	const char *pat;
+	const char *str;
+	int flags = 0;
+	regex_t r;
+	regmatch_t m[1];
+
 	if (argc == 2 && !strcmp(argv[1], "--bug"))
 		return test_regex_bug();
-	else
-		die("must be used with --bug");
+	else if (argc < 3)
+		die("usage: test-regex --bug\n"
+		    "       test-regex <pattern> <string> [<options>]");
+
+	argv++;
+	pat = *argv++;
+	str = *argv++;
+	while (*argv) {
+		struct reg_flag *rf;
+		for (rf = reg_flags; rf->name; rf++)
+			if (!strcmp(*argv, rf->name)) {
+				flags |= rf->flag;
+				break;
+			}
+		if (!rf->name)
+			die("do not recognize %s", *argv);
+		argv++;
+	}
+	git_setup_gettext();
+
+	if (regcomp(&r, pat, flags))
+		die("failed regcomp() for pattern '%s'", pat);
+	if (regexec(&r, str, 1, m, 0))
+		return 1;
+
+	return 0;
 }
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v6 05/11] grep/icase: avoid kwsset on literal non-ascii strings
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
                     ` (3 preceding siblings ...)
  2016-02-06  2:03   ` [PATCH v6 04/11] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
@ 2016-02-06  2:03   ` Nguyễn Thái Ngọc Duy
  2016-02-06  2:03   ` [PATCH v6 06/11] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
                     ` (8 subsequent siblings)
  13 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-06  2:03 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Ramsay Jones, Nguyễn Thái Ngọc Duy

When we detect the pattern is just a literal string, we avoid heavy
regex engine and use fast substring search implemented in kwsset.c.
But kws uses git-ctype which is locale-independent so it does not know
how to fold case properly outside ascii range. Let regcomp or pcre
take care of this case instead. Slower, but accurate.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Helped-by: René Scharfe <l.s.r@web.de>
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                                   |  6 +++++-
 t/t7812-grep-icase-non-ascii.sh (new +x) | 23 +++++++++++++++++++++++
 2 files changed, 28 insertions(+), 1 deletion(-)
 create mode 100755 t/t7812-grep-icase-non-ascii.sh

diff --git a/grep.c b/grep.c
index e739d20..44817ce 100644
--- a/grep.c
+++ b/grep.c
@@ -4,6 +4,7 @@
 #include "xdiff-interface.h"
 #include "diff.h"
 #include "diffcore.h"
+#include "commit.h"
 
 static int grep_source_load(struct grep_source *gs);
 static int grep_source_is_binary(struct grep_source *gs);
@@ -398,12 +399,15 @@ static int is_fixed(const char *s, size_t len)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
+	int icase, ascii_only;
 	int err;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
+	icase	       = opt->regflags & REG_ICASE || p->ignore_case;
+	ascii_only     = !has_non_ascii(p->pattern);
 
-	if (is_fixed(p->pattern, p->patternlen))
+	if ((!icase || ascii_only) && is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else if (opt->fixed) {
 		p->fixed = 1;
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
new file mode 100755
index 0000000..6eff490
--- /dev/null
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -0,0 +1,23 @@
+#!/bin/sh
+
+test_description='grep icase on non-English locales'
+
+. ./lib-gettext.sh
+
+test_expect_success GETTEXT_LOCALE 'setup' '
+	printf "TILRAUN: Halló Heimur!" >file &&
+	git add file &&
+	LC_ALL="$is_IS_locale" &&
+	export LC_ALL
+'
+
+test_have_prereq GETTEXT_LOCALE &&
+test-regex "HALLÓ" "Halló" ICASE &&
+test_set_prereq REGEX_LOCALE
+
+test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
+	git grep -i "TILRAUN: Halló Heimur!" &&
+	git grep -i "TILRAUN: HALLÓ HEIMUR!"
+'
+
+test_done
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v6 06/11] grep/icase: avoid kwsset when -F is specified
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
                     ` (4 preceding siblings ...)
  2016-02-06  2:03   ` [PATCH v6 05/11] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
@ 2016-02-06  2:03   ` Nguyễn Thái Ngọc Duy
  2016-02-06  2:03   ` [PATCH v6 07/11] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
                     ` (7 subsequent siblings)
  13 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-06  2:03 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Ramsay Jones, Nguyễn Thái Ngọc Duy

Similar to the previous commit, we can't use kws on icase search
outside ascii range. But we can't simply pass the pattern to
regcomp/pcre like the previous commit because it may contain regex
special characters, so we need to quote the regex first.

To avoid misquote traps that could lead to undefined behavior, we
always stick to basic regex engine in this case. We don't need fancy
features for grepping a literal string anyway.

basic_regex_quote_buf() assumes that if the pattern is in a multibyte
encoding, ascii chars must be unambiguously encoded as single
bytes. This is true at least for UTF-8. For others, let's wait until
people yell up. Chances are nobody uses multibyte, non utf-8 charsets
anymore.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Helped-by: René Scharfe <l.s.r@web.de>
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          | 25 ++++++++++++++++++++++++-
 quote.c                         | 37 +++++++++++++++++++++++++++++++++++++
 quote.h                         |  1 +
 t/t7812-grep-icase-non-ascii.sh | 26 ++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/grep.c b/grep.c
index 44817ce..6d34479 100644
--- a/grep.c
+++ b/grep.c
@@ -5,6 +5,7 @@
 #include "diff.h"
 #include "diffcore.h"
 #include "commit.h"
+#include "quote.h"
 
 static int grep_source_load(struct grep_source *gs);
 static int grep_source_is_binary(struct grep_source *gs);
@@ -397,6 +398,24 @@ static int is_fixed(const char *s, size_t len)
 	return 1;
 }
 
+static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int err;
+
+	basic_regex_quote_buf(&sb, p->pattern);
+	err = regcomp(&p->regexp, sb.buf, opt->regflags & ~REG_EXTENDED);
+	if (opt->debug)
+		fprintf(stderr, "fixed %s\n", sb.buf);
+	strbuf_release(&sb);
+	if (err) {
+		char errbuf[1024];
+		regerror(err, &p->regexp, errbuf, sizeof(errbuf));
+		regfree(&p->regexp);
+		compile_regexp_failed(p, errbuf);
+	}
+}
+
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
 	int icase, ascii_only;
@@ -410,7 +429,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if ((!icase || ascii_only) && is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else if (opt->fixed) {
-		p->fixed = 1;
+		p->fixed = !icase || ascii_only;
+		if (!p->fixed) {
+			compile_fixed_regexp(p, opt);
+			return;
+		}
 	} else
 		p->fixed = 0;
 
diff --git a/quote.c b/quote.c
index fe884d2..c67adb7 100644
--- a/quote.c
+++ b/quote.c
@@ -440,3 +440,40 @@ void tcl_quote_buf(struct strbuf *sb, const char *src)
 	}
 	strbuf_addch(sb, '"');
 }
+
+void basic_regex_quote_buf(struct strbuf *sb, const char *src)
+{
+	char c;
+
+	if (*src == '^') {
+		/* only beginning '^' is special and needs quoting */
+		strbuf_addch(sb, '\\');
+		strbuf_addch(sb, *src++);
+	}
+	if (*src == '*')
+		/* beginning '*' is not special, no quoting */
+		strbuf_addch(sb, *src++);
+
+	while ((c = *src++)) {
+		switch (c) {
+		case '[':
+		case '.':
+		case '\\':
+		case '*':
+			strbuf_addch(sb, '\\');
+			strbuf_addch(sb, c);
+			break;
+
+		case '$':
+			/* only the end '$' is special and needs quoting */
+			if (*src == '\0')
+				strbuf_addch(sb, '\\');
+			strbuf_addch(sb, c);
+			break;
+
+		default:
+			strbuf_addch(sb, c);
+			break;
+		}
+	}
+}
diff --git a/quote.h b/quote.h
index 99e04d3..362d315 100644
--- a/quote.h
+++ b/quote.h
@@ -67,5 +67,6 @@ extern char *quote_path_relative(const char *in, const char *prefix,
 extern void perl_quote_buf(struct strbuf *sb, const char *src);
 extern void python_quote_buf(struct strbuf *sb, const char *src);
 extern void tcl_quote_buf(struct strbuf *sb, const char *src);
+extern void basic_regex_quote_buf(struct strbuf *sb, const char *src);
 
 #endif
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 6eff490..5832684 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -20,4 +20,30 @@ test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success REGEX_LOCALE 'grep literal string, with -F' '
+	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
+		 grep fixed >debug1 &&
+	echo "fixed TILRAUN: Halló Heimur!" >expect1 &&
+	test_cmp expect1 debug1 &&
+
+	git grep --debug -i -F "TILRAUN: HALLÓ HEIMUR!"  2>&1 >/dev/null |
+		 grep fixed >debug2 &&
+	echo "fixed TILRAUN: HALLÓ HEIMUR!" >expect2 &&
+	test_cmp expect2 debug2
+'
+
+test_expect_success REGEX_LOCALE 'grep string with regex, with -F' '
+	printf "^*TILR^AUN:.* \\Halló \$He[]imur!\$" >file &&
+
+	git grep --debug -i -F "^*TILR^AUN:.* \\Halló \$He[]imur!\$" 2>&1 >/dev/null |
+		 grep fixed >debug1 &&
+	echo "fixed \\^*TILR^AUN:\\.\\* \\\\Halló \$He\\[]imur!\\\$" >expect1 &&
+	test_cmp expect1 debug1 &&
+
+	git grep --debug -i -F "^*TILR^AUN:.* \\HALLÓ \$HE[]IMUR!\$"  2>&1 >/dev/null |
+		 grep fixed >debug2 &&
+	echo "fixed \\^*TILR^AUN:\\.\\* \\\\HALLÓ \$HE\\[]IMUR!\\\$" >expect2 &&
+	test_cmp expect2 debug2
+'
+
 test_done
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v6 07/11] grep/pcre: prepare locale-dependent tables for icase matching
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
                     ` (5 preceding siblings ...)
  2016-02-06  2:03   ` [PATCH v6 06/11] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
@ 2016-02-06  2:03   ` Nguyễn Thái Ngọc Duy
  2016-02-06  2:03   ` [PATCH v6 08/11] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
                     ` (6 subsequent siblings)
  13 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-06  2:03 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Ramsay Jones, Nguyễn Thái Ngọc Duy

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2362 bytes --]

The default tables are usually built with C locale and only suitable
for LANG=C or similar.  This should make case insensitive search work
correctly for all single-byte charsets.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                             |  8 ++++++--
 grep.h                             |  1 +
 t/t7813-grep-icase-iso.sh (new +x) | 19 +++++++++++++++++++
 3 files changed, 26 insertions(+), 2 deletions(-)
 create mode 100755 t/t7813-grep-icase-iso.sh

diff --git a/grep.c b/grep.c
index 6d34479..843e180 100644
--- a/grep.c
+++ b/grep.c
@@ -324,11 +324,14 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 	int erroffset;
 	int options = PCRE_MULTILINE;
 
-	if (opt->ignore_case)
+	if (opt->ignore_case) {
+		if (has_non_ascii(p->pattern))
+			p->pcre_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
+	}
 
 	p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
-			NULL);
+				      p->pcre_tables);
 	if (!p->pcre_regexp)
 		compile_regexp_failed(p, error);
 
@@ -362,6 +365,7 @@ static void free_pcre_regexp(struct grep_pat *p)
 {
 	pcre_free(p->pcre_regexp);
 	pcre_free(p->pcre_extra_info);
+	pcre_free((void *)p->pcre_tables);
 }
 #else /* !USE_LIBPCRE */
 static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
diff --git a/grep.h b/grep.h
index 95f197a..cee4357 100644
--- a/grep.h
+++ b/grep.h
@@ -48,6 +48,7 @@ struct grep_pat {
 	regex_t regexp;
 	pcre *pcre_regexp;
 	pcre_extra *pcre_extra_info;
+	const unsigned char *pcre_tables;
 	kwset_t kws;
 	unsigned fixed:1;
 	unsigned ignore_case:1;
diff --git a/t/t7813-grep-icase-iso.sh b/t/t7813-grep-icase-iso.sh
new file mode 100755
index 0000000..efef7fb
--- /dev/null
+++ b/t/t7813-grep-icase-iso.sh
@@ -0,0 +1,19 @@
+#!/bin/sh
+
+test_description='grep icase on non-English locales'
+
+. ./lib-gettext.sh
+
+test_expect_success GETTEXT_ISO_LOCALE 'setup' '
+	printf "TILRAUN: Halló Heimur!" >file &&
+	git add file &&
+	LC_ALL="$is_IS_iso_locale" &&
+	export LC_ALL
+'
+
+test_expect_success GETTEXT_ISO_LOCALE,LIBPCRE 'grep pcre string' '
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
+test_done
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v6 08/11] gettext: add is_utf8_locale()
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
                     ` (6 preceding siblings ...)
  2016-02-06  2:03   ` [PATCH v6 07/11] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
@ 2016-02-06  2:03   ` Nguyễn Thái Ngọc Duy
  2016-02-06  2:03   ` [PATCH v6 09/11] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
                     ` (5 subsequent siblings)
  13 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-06  2:03 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Ramsay Jones, Nguyễn Thái Ngọc Duy

This function returns true if git is running under an UTF-8
locale. pcre in the next patch will need this.

is_encoding_utf8() is used instead of strcmp() to catch both "utf-8"
and "utf8" suffixes.

When built with no gettext support, we peek in several env variables
to detect UTF-8. pcre library might support utf-8 even if libc is
built without locale support.. The peeking code is a copy from
compat/regex/regcomp.c

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 gettext.c | 24 ++++++++++++++++++++++--
 gettext.h |  1 +
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/gettext.c b/gettext.c
index a268a2c..db727ea 100644
--- a/gettext.c
+++ b/gettext.c
@@ -18,6 +18,8 @@
 #	endif
 #endif
 
+static const char *charset;
+
 /*
  * Guess the user's preferred languages from the value in LANGUAGE environment
  * variable and LC_MESSAGES locale category if NO_GETTEXT is not defined.
@@ -65,7 +67,6 @@ static int test_vsnprintf(const char *fmt, ...)
 	return ret;
 }
 
-static const char *charset;
 static void init_gettext_charset(const char *domain)
 {
 	/*
@@ -172,8 +173,27 @@ int gettext_width(const char *s)
 {
 	static int is_utf8 = -1;
 	if (is_utf8 == -1)
-		is_utf8 = !strcmp(charset, "UTF-8");
+		is_utf8 = is_utf8_locale();
 
 	return is_utf8 ? utf8_strwidth(s) : strlen(s);
 }
 #endif
+
+int is_utf8_locale(void)
+{
+#ifdef NO_GETTEXT
+	if (!charset) {
+		const char *env = getenv("LC_ALL");
+		if (!env || !*env)
+			env = getenv("LC_CTYPE");
+		if (!env || !*env)
+			env = getenv("LANG");
+		if (!env)
+			env = "";
+		if (strchr(env, '.'))
+			env = strchr(env, '.') + 1;
+		charset = xstrdup(env);
+	}
+#endif
+	return is_encoding_utf8(charset);
+}
diff --git a/gettext.h b/gettext.h
index 33696a4..7eee64a 100644
--- a/gettext.h
+++ b/gettext.h
@@ -90,5 +90,6 @@ const char *Q_(const char *msgid, const char *plu, unsigned long n)
 #endif
 
 const char *get_preferred_languages(void);
+extern int is_utf8_locale(void);
 
 #endif
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v6 09/11] grep/pcre: support utf-8
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
                     ` (7 preceding siblings ...)
  2016-02-06  2:03   ` [PATCH v6 08/11] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
@ 2016-02-06  2:03   ` Nguyễn Thái Ngọc Duy
  2016-02-06  2:03   ` [PATCH v6 10/11] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
                     ` (4 subsequent siblings)
  13 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-06  2:03 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Ramsay Jones, Nguyễn Thái Ngọc Duy

In the previous change in this function, we add locale support for
single-byte encodings only. It looks like pcre only supports utf-* as
multibyte encodings, the others are left in the cold (which is
fine).

We need to enable PCRE_UTF8 so pcre can find character boundary
correctly. It's needed for case folding (when --ignore-case is used)
or '*', '+' or similar syntax is used.

The "has_non_ascii()" check is to be on the conservative side. If
there's non-ascii in the pattern, the searched content could still be
in utf-8, but we can treat it just like a byte stream and everything
should work. If we force utf-8 based on locale only and pcre validates
utf-8 and the file content is in non-utf8 encoding, things break.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Helped-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          |  2 ++
 t/t7812-grep-icase-non-ascii.sh | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/grep.c b/grep.c
index 843e180..aed4fe0 100644
--- a/grep.c
+++ b/grep.c
@@ -329,6 +329,8 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 			p->pcre_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
 	}
+	if (is_utf8_locale() && has_non_ascii(p->pattern))
+		options |= PCRE_UTF8;
 
 	p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
 				      p->pcre_tables);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 5832684..842b26a 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -20,6 +20,21 @@ test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 icase' '
+	git grep --perl-regexp    "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 string with "+"' '
+	printf "TILRAUN: Hallóó Heimur!" >file2 &&
+	git add file2 &&
+	git grep -l --perl-regexp "TILRAUN: H.lló+ Heimur!" >actual &&
+	echo file >expected &&
+	echo file2 >>expected &&
+	test_cmp expected actual
+'
+
 test_expect_success REGEX_LOCALE 'grep literal string, with -F' '
 	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
 		 grep fixed >debug1 &&
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v6 10/11] diffcore-pickaxe: "share" regex error handling code
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
                     ` (8 preceding siblings ...)
  2016-02-06  2:03   ` [PATCH v6 09/11] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
@ 2016-02-06  2:03   ` Nguyễn Thái Ngọc Duy
  2016-02-06  2:03   ` [PATCH v6 11/11] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
                     ` (3 subsequent siblings)
  13 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-06  2:03 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Ramsay Jones, Nguyễn Thái Ngọc Duy

There's another regcomp code block coming in this function. By moving
the error handling code out of this block, we don't have to add the
same error handling code in the new block.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 diffcore-pickaxe.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 7715c13..69c6567 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -204,20 +204,13 @@ void diffcore_pickaxe(struct diff_options *o)
 	int opts = o->pickaxe_opts;
 	regex_t regex, *regexp = NULL;
 	kwset_t kws = NULL;
+	int err = 0;
 
 	if (opts & (DIFF_PICKAXE_REGEX | DIFF_PICKAXE_KIND_G)) {
-		int err;
 		int cflags = REG_EXTENDED | REG_NEWLINE;
 		if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE))
 			cflags |= REG_ICASE;
 		err = regcomp(&regex, needle, cflags);
-		if (err) {
-			/* The POSIX.2 people are surely sick */
-			char errbuf[1024];
-			regerror(err, &regex, errbuf, 1024);
-			regfree(&regex);
-			die("invalid regex: %s", errbuf);
-		}
 		regexp = &regex;
 	} else {
 		kws = kwsalloc(DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE)
@@ -225,6 +218,13 @@ void diffcore_pickaxe(struct diff_options *o)
 		kwsincr(kws, needle, strlen(needle));
 		kwsprep(kws);
 	}
+	if (err) {
+		/* The POSIX.2 people are surely sick */
+		char errbuf[1024];
+		regerror(err, &regex, errbuf, 1024);
+		regfree(&regex);
+		die("invalid regex: %s", errbuf);
+	}
 
 	/* Might want to warn when both S and G are on; I don't care... */
 	pickaxe(&diff_queued_diff, o, regexp, kws,
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v6 11/11] diffcore-pickaxe: support case insensitive match on non-ascii
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
                     ` (9 preceding siblings ...)
  2016-02-06  2:03   ` [PATCH v6 10/11] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
@ 2016-02-06  2:03   ` Nguyễn Thái Ngọc Duy
  2016-02-07  8:48   ` [PATCH v6 00/11] Fix icase grep " Eric Sunshine
                     ` (2 subsequent siblings)
  13 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-06  2:03 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Ramsay Jones, Nguyễn Thái Ngọc Duy

Similar to the "grep -F -i" case, we can't use kws on icase search
outside ascii range, so we quote the string and pass it to regcomp as
a basic regexp and let regex engine deal with case sensitivity.

The new test is put in t7812 instead of t4209-log-pickaxe because
lib-gettext.sh might cause problems elsewhere, probably.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 diffcore-pickaxe.c              | 11 +++++++++++
 t/t7812-grep-icase-non-ascii.sh |  7 +++++++
 2 files changed, 18 insertions(+)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 69c6567..0a5f877 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -7,6 +7,8 @@
 #include "diffcore.h"
 #include "xdiff-interface.h"
 #include "kwset.h"
+#include "commit.h"
+#include "quote.h"
 
 typedef int (*pickaxe_fn)(mmfile_t *one, mmfile_t *two,
 			  struct diff_options *o,
@@ -212,6 +214,15 @@ void diffcore_pickaxe(struct diff_options *o)
 			cflags |= REG_ICASE;
 		err = regcomp(&regex, needle, cflags);
 		regexp = &regex;
+	} else if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE) &&
+		   has_non_ascii(needle)) {
+		struct strbuf sb = STRBUF_INIT;
+		int cflags = REG_NEWLINE | REG_ICASE;
+
+		basic_regex_quote_buf(&sb, needle);
+		err = regcomp(&regex, sb.buf, cflags);
+		strbuf_release(&sb);
+		regexp = &regex;
 	} else {
 		kws = kwsalloc(DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE)
 			       ? tolower_trans_tbl : NULL);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 842b26a..4176625 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -61,4 +61,11 @@ test_expect_success REGEX_LOCALE 'grep string with regex, with -F' '
 	test_cmp expect2 debug2
 '
 
+test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
+	git commit -m first &&
+	git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual &&
+	echo first >expected &&
+	test_cmp expected actual
+'
+
 test_done
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 04/11] test-regex: expose full regcomp() to the command line
  2016-02-06  2:03   ` [PATCH v6 04/11] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
@ 2016-02-07  8:44     ` Eric Sunshine
  2016-02-09 18:21       ` Junio C Hamano
  0 siblings, 1 reply; 56+ messages in thread
From: Eric Sunshine @ 2016-02-07  8:44 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: Git List, Ramsay Jones

On Fri, Feb 5, 2016 at 9:03 PM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
> diff --git a/test-regex.c b/test-regex.c
> @@ -21,8 +38,38 @@ static int test_regex_bug(void)
>  int main(int argc, char **argv)
>  {
> +       const char *pat;
> +       const char *str;
> +       int flags = 0;
> +       regex_t r;
> +       regmatch_t m[1];
> +
>         if (argc == 2 && !strcmp(argv[1], "--bug"))
>                 return test_regex_bug();
> -       else
> -               die("must be used with --bug");
> +       else if (argc < 3)
> +               die("usage: test-regex --bug\n"
> +                   "       test-regex <pattern> <string> [<options>]");

This is just a test program, so it probably isn't that important, but
die() automatically prepends "fatal: " which means the alignment of
the second line will be wrong. Perhaps you want to use usage() instead
which automatically prepends "usage: " (and drop the literal "usage: "
from the above string).

> +       argv++;
> +       pat = *argv++;
> +       str = *argv++;
> +       while (*argv) {
> +               struct reg_flag *rf;
> +               for (rf = reg_flags; rf->name; rf++)
> +                       if (!strcmp(*argv, rf->name)) {
> +                               flags |= rf->flag;
> +                               break;
> +                       }
> +               if (!rf->name)
> +                       die("do not recognize %s", *argv);
> +               argv++;
> +       }
> +       git_setup_gettext();
> +
> +       if (regcomp(&r, pat, flags))
> +               die("failed regcomp() for pattern '%s'", pat);
> +       if (regexec(&r, str, 1, m, 0))
> +               return 1;
> +
> +       return 0;
>  }

This version is much easier to read without the "bug" special case
spread throughout the code. Thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 00/11] Fix icase grep on non-ascii
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
                     ` (10 preceding siblings ...)
  2016-02-06  2:03   ` [PATCH v6 11/11] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
@ 2016-02-07  8:48   ` Eric Sunshine
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
  2016-06-17 23:17   ` [PATCH v6 00/11] Fix icase grep on non-ascii Junio C Hamano
  13 siblings, 0 replies; 56+ messages in thread
From: Eric Sunshine @ 2016-02-07  8:48 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: Git List, Ramsay Jones

On Fri, Feb 5, 2016 at 9:02 PM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
> v6 fixes comments from Ramsay and Eric. Interdiff below. The only
> thing to add is, I decided not to replace !icase_non_ascii with
> icase_ascii_only. I went with spelling out "!icase || ascii_only". I
> think it expresses the intention better.

v6 appears to address all the comments raised in my v5 review. Thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 02/11] grep: break down an "if" stmt in preparation for next changes
  2016-02-06  2:03   ` [PATCH v6 02/11] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
@ 2016-02-09 18:20     ` Junio C Hamano
  0 siblings, 0 replies; 56+ messages in thread
From: Junio C Hamano @ 2016-02-09 18:20 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git, Eric Sunshine, Ramsay Jones

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  grep.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/grep.c b/grep.c
> index 7b2b96a..e739d20 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -403,9 +403,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>  	p->word_regexp = opt->word_regexp;
>  	p->ignore_case = opt->ignore_case;
>  
> -	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
> +	if (is_fixed(p->pattern, p->patternlen))
>  		p->fixed = 1;
> -	else
> +	else if (opt->fixed) {
> +		p->fixed = 1;
> +	} else
>  		p->fixed = 0;

The original used to check inexpensive one first and then more
expensive one.  The updated one gets it backwards.

Is that intended?


>  
>  	if (p->fixed) {

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 04/11] test-regex: expose full regcomp() to the command line
  2016-02-07  8:44     ` Eric Sunshine
@ 2016-02-09 18:21       ` Junio C Hamano
  0 siblings, 0 replies; 56+ messages in thread
From: Junio C Hamano @ 2016-02-09 18:21 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Nguyễn Thái Ngọc Duy, Git List, Ramsay Jones

Eric Sunshine <sunshine@sunshineco.com> writes:

> On Fri, Feb 5, 2016 at 9:03 PM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
>> diff --git a/test-regex.c b/test-regex.c
>> @@ -21,8 +38,38 @@ static int test_regex_bug(void)
>>  int main(int argc, char **argv)
>>  {
>> +       const char *pat;
>> +       const char *str;
>> +       int flags = 0;
>> +       regex_t r;
>> +       regmatch_t m[1];
>> +
>>         if (argc == 2 && !strcmp(argv[1], "--bug"))
>>                 return test_regex_bug();
>> -       else
>> -               die("must be used with --bug");
>> +       else if (argc < 3)
>> +               die("usage: test-regex --bug\n"
>> +                   "       test-regex <pattern> <string> [<options>]");
>
> This is just a test program, so it probably isn't that important, but
> die() automatically prepends "fatal: " which means the alignment of
> the second line will be wrong. Perhaps you want to use usage() instead
> which automatically prepends "usage: " (and drop the literal "usage: "
> from the above string).

So true.

>> +       argv++;
>> +       pat = *argv++;
>> +       str = *argv++;
>> +       while (*argv) {
>> +               struct reg_flag *rf;
>> +               for (rf = reg_flags; rf->name; rf++)
>> +                       if (!strcmp(*argv, rf->name)) {
>> +                               flags |= rf->flag;
>> +                               break;
>> +                       }
>> +               if (!rf->name)
>> +                       die("do not recognize %s", *argv);
>> +               argv++;
>> +       }
>> +       git_setup_gettext();
>> +
>> +       if (regcomp(&r, pat, flags))
>> +               die("failed regcomp() for pattern '%s'", pat);
>> +       if (regexec(&r, str, 1, m, 0))
>> +               return 1;
>> +
>> +       return 0;
>>  }
>
> This version is much easier to read without the "bug" special case
> spread throughout the code. Thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 00/12] nd/icase updates
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
                     ` (11 preceding siblings ...)
  2016-02-07  8:48   ` [PATCH v6 00/11] Fix icase grep " Eric Sunshine
@ 2016-02-14 11:49   ` Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 01/12] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
                       ` (11 more replies)
  2016-06-17 23:17   ` [PATCH v6 00/11] Fix icase grep on non-ascii Junio C Hamano
  13 siblings, 12 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

v7 addresses two comments from Junio and Eric in v6 and adds an extra
patch, 12/12, which reuses "icase" variable and avoids recalculating
the same thing (which can't be done before v6). Interdiff

diff --git a/grep.c b/grep.c
index aed4fe0..cb058a5 100644
--- a/grep.c
+++ b/grep.c
@@ -432,22 +432,20 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	icase	       = opt->regflags & REG_ICASE || p->ignore_case;
 	ascii_only     = !has_non_ascii(p->pattern);
 
-	if ((!icase || ascii_only) && is_fixed(p->pattern, p->patternlen))
-		p->fixed = 1;
-	else if (opt->fixed) {
+	if (opt->fixed) {
 		p->fixed = !icase || ascii_only;
 		if (!p->fixed) {
 			compile_fixed_regexp(p, opt);
 			return;
 		}
-	} else
+	} else if ((!icase || ascii_only) &&
+		   is_fixed(p->pattern, p->patternlen))
+		p->fixed = 1;
+	else
 		p->fixed = 0;
 
 	if (p->fixed) {
-		if (opt->regflags & REG_ICASE || p->ignore_case)
-			p->kws = kwsalloc(tolower_trans_tbl);
-		else
-			p->kws = kwsalloc(NULL);
+		p->kws = kwsalloc(icase ? tolower_trans_tbl : NULL);
 		kwsincr(p->kws, p->pattern, p->patternlen);
 		kwsprep(p->kws);
 		return;
diff --git a/test-regex.c b/test-regex.c
index d1a952c..eff26f5 100644
--- a/test-regex.c
+++ b/test-regex.c
@@ -47,8 +47,8 @@ int main(int argc, char **argv)
 	if (argc == 2 && !strcmp(argv[1], "--bug"))
 		return test_regex_bug();
 	else if (argc < 3)
-		die("usage: test-regex --bug\n"
-		    "       test-regex <pattern> <string> [<options>]");
+		usage("test-regex --bug\n"
+		      "test-regex <pattern> <string> [<options>]");
 
 	argv++;
 	pat = *argv++;

Nguyễn Thái Ngọc Duy (12):
  grep: allow -F -i combination
  grep: break down an "if" stmt in preparation for next changes
  test-regex: isolate the bug test code
  test-regex: expose full regcomp() to the command line
  grep/icase: avoid kwsset on literal non-ascii strings
  grep/icase: avoid kwsset when -F is specified
  grep/pcre: prepare locale-dependent tables for icase matching
  gettext: add is_utf8_locale()
  grep/pcre: support utf-8
  diffcore-pickaxe: "share" regex error handling code
  diffcore-pickaxe: support case insensitive match on non-ascii
  grep.c: reuse "icase" variable

 builtin/grep.c                           |  2 +-
 diffcore-pickaxe.c                       | 27 ++++++++----
 gettext.c                                | 24 ++++++++++-
 gettext.h                                |  1 +
 grep.c                                   | 47 +++++++++++++++++----
 grep.h                                   |  1 +
 quote.c                                  | 37 +++++++++++++++++
 quote.h                                  |  1 +
 t/t0070-fundamental.sh                   |  2 +-
 t/t7812-grep-icase-non-ascii.sh (new +x) | 71 ++++++++++++++++++++++++++++++++
 t/t7813-grep-icase-iso.sh (new +x)       | 19 +++++++++
 test-regex.c                             | 59 +++++++++++++++++++++++++-
 12 files changed, 270 insertions(+), 21 deletions(-)
 create mode 100755 t/t7812-grep-icase-non-ascii.sh
 create mode 100755 t/t7813-grep-icase-iso.sh

-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 01/12] grep: allow -F -i combination
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
@ 2016-02-14 11:49     ` Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 02/12] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
                       ` (10 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

-F means "no regex", not "case sensitive" so it should not override -i

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 8c516a9..46c5ba1 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -809,7 +809,7 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 
 	if (!opt.pattern_list)
 		die(_("no pattern given."));
-	if (!opt.fixed && opt.ignore_case)
+	if (opt.ignore_case)
 		opt.regflags |= REG_ICASE;
 
 	compile_grep_patterns(&opt);
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 02/12] grep: break down an "if" stmt in preparation for next changes
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 01/12] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
@ 2016-02-14 11:49     ` Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 03/12] test-regex: isolate the bug test code Nguyễn Thái Ngọc Duy
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/grep.c b/grep.c
index 7b2b96a..609f218 100644
--- a/grep.c
+++ b/grep.c
@@ -403,7 +403,9 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
 
-	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
+	if (opt->fixed) {
+		p->fixed = 1;
+	} else if (is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else
 		p->fixed = 0;
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 03/12] test-regex: isolate the bug test code
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 01/12] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 02/12] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
@ 2016-02-14 11:49     ` Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 04/12] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
                       ` (8 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

This is in preparation to turn test-regex into some generic regex
testing command.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 t/t0070-fundamental.sh |  2 +-
 test-regex.c           | 12 ++++++++++--
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/t/t0070-fundamental.sh b/t/t0070-fundamental.sh
index 5ed69a6..991ed2a 100755
--- a/t/t0070-fundamental.sh
+++ b/t/t0070-fundamental.sh
@@ -31,7 +31,7 @@ test_expect_success 'git_mkstemps_mode does not fail if fd 0 is not open' '
 
 test_expect_success 'check for a bug in the regex routines' '
 	# if this test fails, re-build git with NO_REGEX=1
-	test-regex
+	test-regex --bug
 '
 
 test_done
diff --git a/test-regex.c b/test-regex.c
index 0dc598e..67a1a65 100644
--- a/test-regex.c
+++ b/test-regex.c
@@ -1,6 +1,6 @@
 #include "git-compat-util.h"
 
-int main(int argc, char **argv)
+static int test_regex_bug(void)
 {
 	char *pat = "[^={} \t]+";
 	char *str = "={}\nfred";
@@ -16,5 +16,13 @@ int main(int argc, char **argv)
 	if (m[0].rm_so == 3) /* matches '\n' when it should not */
 		die("regex bug confirmed: re-build git with NO_REGEX=1");
 
-	exit(0);
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	if (argc == 2 && !strcmp(argv[1], "--bug"))
+		return test_regex_bug();
+	else
+		usage("test-regex --bug");
 }
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 04/12] test-regex: expose full regcomp() to the command line
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
                       ` (2 preceding siblings ...)
  2016-02-14 11:49     ` [PATCH v7 03/12] test-regex: isolate the bug test code Nguyễn Thái Ngọc Duy
@ 2016-02-14 11:49     ` Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 05/12] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
                       ` (7 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 test-regex.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/test-regex.c b/test-regex.c
index 67a1a65..eff26f5 100644
--- a/test-regex.c
+++ b/test-regex.c
@@ -1,4 +1,21 @@
 #include "git-compat-util.h"
+#include "gettext.h"
+
+struct reg_flag {
+	const char *name;
+	int flag;
+};
+
+static struct reg_flag reg_flags[] = {
+	{ "EXTENDED",	 REG_EXTENDED	},
+	{ "NEWLINE",	 REG_NEWLINE	},
+	{ "ICASE",	 REG_ICASE	},
+	{ "NOTBOL",	 REG_NOTBOL	},
+#ifdef REG_STARTEND
+	{ "STARTEND",	 REG_STARTEND	},
+#endif
+	{ NULL, 0 }
+};
 
 static int test_regex_bug(void)
 {
@@ -21,8 +38,38 @@ static int test_regex_bug(void)
 
 int main(int argc, char **argv)
 {
+	const char *pat;
+	const char *str;
+	int flags = 0;
+	regex_t r;
+	regmatch_t m[1];
+
 	if (argc == 2 && !strcmp(argv[1], "--bug"))
 		return test_regex_bug();
-	else
-		usage("test-regex --bug");
+	else if (argc < 3)
+		usage("test-regex --bug\n"
+		      "test-regex <pattern> <string> [<options>]");
+
+	argv++;
+	pat = *argv++;
+	str = *argv++;
+	while (*argv) {
+		struct reg_flag *rf;
+		for (rf = reg_flags; rf->name; rf++)
+			if (!strcmp(*argv, rf->name)) {
+				flags |= rf->flag;
+				break;
+			}
+		if (!rf->name)
+			die("do not recognize %s", *argv);
+		argv++;
+	}
+	git_setup_gettext();
+
+	if (regcomp(&r, pat, flags))
+		die("failed regcomp() for pattern '%s'", pat);
+	if (regexec(&r, str, 1, m, 0))
+		return 1;
+
+	return 0;
 }
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 05/12] grep/icase: avoid kwsset on literal non-ascii strings
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
                       ` (3 preceding siblings ...)
  2016-02-14 11:49     ` [PATCH v7 04/12] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
@ 2016-02-14 11:49     ` Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 06/12] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
                       ` (6 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

When we detect the pattern is just a literal string, we avoid heavy
regex engine and use fast substring search implemented in kwsset.c.
But kws uses git-ctype which is locale-independent so it does not know
how to fold case properly outside ascii range. Let regcomp or pcre
take care of this case instead. Slower, but accurate.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Helped-by: René Scharfe <l.s.r@web.de>
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                                   |  7 ++++++-
 t/t7812-grep-icase-non-ascii.sh (new +x) | 23 +++++++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletion(-)
 create mode 100755 t/t7812-grep-icase-non-ascii.sh

diff --git a/grep.c b/grep.c
index 609f218..f192727 100644
--- a/grep.c
+++ b/grep.c
@@ -4,6 +4,7 @@
 #include "xdiff-interface.h"
 #include "diff.h"
 #include "diffcore.h"
+#include "commit.h"
 
 static int grep_source_load(struct grep_source *gs);
 static int grep_source_is_binary(struct grep_source *gs);
@@ -398,14 +399,18 @@ static int is_fixed(const char *s, size_t len)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
+	int icase, ascii_only;
 	int err;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
+	icase	       = opt->regflags & REG_ICASE || p->ignore_case;
+	ascii_only     = !has_non_ascii(p->pattern);
 
 	if (opt->fixed) {
 		p->fixed = 1;
-	} else if (is_fixed(p->pattern, p->patternlen))
+	} else if ((!icase || ascii_only) &&
+		   is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else
 		p->fixed = 0;
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
new file mode 100755
index 0000000..6eff490
--- /dev/null
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -0,0 +1,23 @@
+#!/bin/sh
+
+test_description='grep icase on non-English locales'
+
+. ./lib-gettext.sh
+
+test_expect_success GETTEXT_LOCALE 'setup' '
+	printf "TILRAUN: Halló Heimur!" >file &&
+	git add file &&
+	LC_ALL="$is_IS_locale" &&
+	export LC_ALL
+'
+
+test_have_prereq GETTEXT_LOCALE &&
+test-regex "HALLÓ" "Halló" ICASE &&
+test_set_prereq REGEX_LOCALE
+
+test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
+	git grep -i "TILRAUN: Halló Heimur!" &&
+	git grep -i "TILRAUN: HALLÓ HEIMUR!"
+'
+
+test_done
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 06/12] grep/icase: avoid kwsset when -F is specified
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
                       ` (4 preceding siblings ...)
  2016-02-14 11:49     ` [PATCH v7 05/12] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
@ 2016-02-14 11:49     ` Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 07/12] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
                       ` (5 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

Similar to the previous commit, we can't use kws on icase search
outside ascii range. But we can't simply pass the pattern to
regcomp/pcre like the previous commit because it may contain regex
special characters, so we need to quote the regex first.

To avoid misquote traps that could lead to undefined behavior, we
always stick to basic regex engine in this case. We don't need fancy
features for grepping a literal string anyway.

basic_regex_quote_buf() assumes that if the pattern is in a multibyte
encoding, ascii chars must be unambiguously encoded as single
bytes. This is true at least for UTF-8. For others, let's wait until
people yell up. Chances are nobody uses multibyte, non utf-8 charsets
anymore.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Helped-by: René Scharfe <l.s.r@web.de>
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          | 25 ++++++++++++++++++++++++-
 quote.c                         | 37 +++++++++++++++++++++++++++++++++++++
 quote.h                         |  1 +
 t/t7812-grep-icase-non-ascii.sh | 26 ++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/grep.c b/grep.c
index f192727..6de58f3 100644
--- a/grep.c
+++ b/grep.c
@@ -5,6 +5,7 @@
 #include "diff.h"
 #include "diffcore.h"
 #include "commit.h"
+#include "quote.h"
 
 static int grep_source_load(struct grep_source *gs);
 static int grep_source_is_binary(struct grep_source *gs);
@@ -397,6 +398,24 @@ static int is_fixed(const char *s, size_t len)
 	return 1;
 }
 
+static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int err;
+
+	basic_regex_quote_buf(&sb, p->pattern);
+	err = regcomp(&p->regexp, sb.buf, opt->regflags & ~REG_EXTENDED);
+	if (opt->debug)
+		fprintf(stderr, "fixed %s\n", sb.buf);
+	strbuf_release(&sb);
+	if (err) {
+		char errbuf[1024];
+		regerror(err, &p->regexp, errbuf, sizeof(errbuf));
+		regfree(&p->regexp);
+		compile_regexp_failed(p, errbuf);
+	}
+}
+
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
 	int icase, ascii_only;
@@ -408,7 +427,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	ascii_only     = !has_non_ascii(p->pattern);
 
 	if (opt->fixed) {
-		p->fixed = 1;
+		p->fixed = !icase || ascii_only;
+		if (!p->fixed) {
+			compile_fixed_regexp(p, opt);
+			return;
+		}
 	} else if ((!icase || ascii_only) &&
 		   is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
diff --git a/quote.c b/quote.c
index fe884d2..c67adb7 100644
--- a/quote.c
+++ b/quote.c
@@ -440,3 +440,40 @@ void tcl_quote_buf(struct strbuf *sb, const char *src)
 	}
 	strbuf_addch(sb, '"');
 }
+
+void basic_regex_quote_buf(struct strbuf *sb, const char *src)
+{
+	char c;
+
+	if (*src == '^') {
+		/* only beginning '^' is special and needs quoting */
+		strbuf_addch(sb, '\\');
+		strbuf_addch(sb, *src++);
+	}
+	if (*src == '*')
+		/* beginning '*' is not special, no quoting */
+		strbuf_addch(sb, *src++);
+
+	while ((c = *src++)) {
+		switch (c) {
+		case '[':
+		case '.':
+		case '\\':
+		case '*':
+			strbuf_addch(sb, '\\');
+			strbuf_addch(sb, c);
+			break;
+
+		case '$':
+			/* only the end '$' is special and needs quoting */
+			if (*src == '\0')
+				strbuf_addch(sb, '\\');
+			strbuf_addch(sb, c);
+			break;
+
+		default:
+			strbuf_addch(sb, c);
+			break;
+		}
+	}
+}
diff --git a/quote.h b/quote.h
index 99e04d3..362d315 100644
--- a/quote.h
+++ b/quote.h
@@ -67,5 +67,6 @@ extern char *quote_path_relative(const char *in, const char *prefix,
 extern void perl_quote_buf(struct strbuf *sb, const char *src);
 extern void python_quote_buf(struct strbuf *sb, const char *src);
 extern void tcl_quote_buf(struct strbuf *sb, const char *src);
+extern void basic_regex_quote_buf(struct strbuf *sb, const char *src);
 
 #endif
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 6eff490..5832684 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -20,4 +20,30 @@ test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success REGEX_LOCALE 'grep literal string, with -F' '
+	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
+		 grep fixed >debug1 &&
+	echo "fixed TILRAUN: Halló Heimur!" >expect1 &&
+	test_cmp expect1 debug1 &&
+
+	git grep --debug -i -F "TILRAUN: HALLÓ HEIMUR!"  2>&1 >/dev/null |
+		 grep fixed >debug2 &&
+	echo "fixed TILRAUN: HALLÓ HEIMUR!" >expect2 &&
+	test_cmp expect2 debug2
+'
+
+test_expect_success REGEX_LOCALE 'grep string with regex, with -F' '
+	printf "^*TILR^AUN:.* \\Halló \$He[]imur!\$" >file &&
+
+	git grep --debug -i -F "^*TILR^AUN:.* \\Halló \$He[]imur!\$" 2>&1 >/dev/null |
+		 grep fixed >debug1 &&
+	echo "fixed \\^*TILR^AUN:\\.\\* \\\\Halló \$He\\[]imur!\\\$" >expect1 &&
+	test_cmp expect1 debug1 &&
+
+	git grep --debug -i -F "^*TILR^AUN:.* \\HALLÓ \$HE[]IMUR!\$"  2>&1 >/dev/null |
+		 grep fixed >debug2 &&
+	echo "fixed \\^*TILR^AUN:\\.\\* \\\\HALLÓ \$HE\\[]IMUR!\\\$" >expect2 &&
+	test_cmp expect2 debug2
+'
+
 test_done
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 07/12] grep/pcre: prepare locale-dependent tables for icase matching
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
                       ` (5 preceding siblings ...)
  2016-02-14 11:49     ` [PATCH v7 06/12] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
@ 2016-02-14 11:49     ` Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 08/12] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
                       ` (4 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2362 bytes --]

The default tables are usually built with C locale and only suitable
for LANG=C or similar.  This should make case insensitive search work
correctly for all single-byte charsets.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                             |  8 ++++++--
 grep.h                             |  1 +
 t/t7813-grep-icase-iso.sh (new +x) | 19 +++++++++++++++++++
 3 files changed, 26 insertions(+), 2 deletions(-)
 create mode 100755 t/t7813-grep-icase-iso.sh

diff --git a/grep.c b/grep.c
index 6de58f3..22f4d99 100644
--- a/grep.c
+++ b/grep.c
@@ -324,11 +324,14 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 	int erroffset;
 	int options = PCRE_MULTILINE;
 
-	if (opt->ignore_case)
+	if (opt->ignore_case) {
+		if (has_non_ascii(p->pattern))
+			p->pcre_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
+	}
 
 	p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
-			NULL);
+				      p->pcre_tables);
 	if (!p->pcre_regexp)
 		compile_regexp_failed(p, error);
 
@@ -362,6 +365,7 @@ static void free_pcre_regexp(struct grep_pat *p)
 {
 	pcre_free(p->pcre_regexp);
 	pcre_free(p->pcre_extra_info);
+	pcre_free((void *)p->pcre_tables);
 }
 #else /* !USE_LIBPCRE */
 static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
diff --git a/grep.h b/grep.h
index 95f197a..cee4357 100644
--- a/grep.h
+++ b/grep.h
@@ -48,6 +48,7 @@ struct grep_pat {
 	regex_t regexp;
 	pcre *pcre_regexp;
 	pcre_extra *pcre_extra_info;
+	const unsigned char *pcre_tables;
 	kwset_t kws;
 	unsigned fixed:1;
 	unsigned ignore_case:1;
diff --git a/t/t7813-grep-icase-iso.sh b/t/t7813-grep-icase-iso.sh
new file mode 100755
index 0000000..efef7fb
--- /dev/null
+++ b/t/t7813-grep-icase-iso.sh
@@ -0,0 +1,19 @@
+#!/bin/sh
+
+test_description='grep icase on non-English locales'
+
+. ./lib-gettext.sh
+
+test_expect_success GETTEXT_ISO_LOCALE 'setup' '
+	printf "TILRAUN: Halló Heimur!" >file &&
+	git add file &&
+	LC_ALL="$is_IS_iso_locale" &&
+	export LC_ALL
+'
+
+test_expect_success GETTEXT_ISO_LOCALE,LIBPCRE 'grep pcre string' '
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
+test_done
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 08/12] gettext: add is_utf8_locale()
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
                       ` (6 preceding siblings ...)
  2016-02-14 11:49     ` [PATCH v7 07/12] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
@ 2016-02-14 11:49     ` Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 09/12] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
                       ` (3 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

This function returns true if git is running under an UTF-8
locale. pcre in the next patch will need this.

is_encoding_utf8() is used instead of strcmp() to catch both "utf-8"
and "utf8" suffixes.

When built with no gettext support, we peek in several env variables
to detect UTF-8. pcre library might support utf-8 even if libc is
built without locale support.. The peeking code is a copy from
compat/regex/regcomp.c

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 gettext.c | 24 ++++++++++++++++++++++--
 gettext.h |  1 +
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/gettext.c b/gettext.c
index a268a2c..db727ea 100644
--- a/gettext.c
+++ b/gettext.c
@@ -18,6 +18,8 @@
 #	endif
 #endif
 
+static const char *charset;
+
 /*
  * Guess the user's preferred languages from the value in LANGUAGE environment
  * variable and LC_MESSAGES locale category if NO_GETTEXT is not defined.
@@ -65,7 +67,6 @@ static int test_vsnprintf(const char *fmt, ...)
 	return ret;
 }
 
-static const char *charset;
 static void init_gettext_charset(const char *domain)
 {
 	/*
@@ -172,8 +173,27 @@ int gettext_width(const char *s)
 {
 	static int is_utf8 = -1;
 	if (is_utf8 == -1)
-		is_utf8 = !strcmp(charset, "UTF-8");
+		is_utf8 = is_utf8_locale();
 
 	return is_utf8 ? utf8_strwidth(s) : strlen(s);
 }
 #endif
+
+int is_utf8_locale(void)
+{
+#ifdef NO_GETTEXT
+	if (!charset) {
+		const char *env = getenv("LC_ALL");
+		if (!env || !*env)
+			env = getenv("LC_CTYPE");
+		if (!env || !*env)
+			env = getenv("LANG");
+		if (!env)
+			env = "";
+		if (strchr(env, '.'))
+			env = strchr(env, '.') + 1;
+		charset = xstrdup(env);
+	}
+#endif
+	return is_encoding_utf8(charset);
+}
diff --git a/gettext.h b/gettext.h
index 33696a4..7eee64a 100644
--- a/gettext.h
+++ b/gettext.h
@@ -90,5 +90,6 @@ const char *Q_(const char *msgid, const char *plu, unsigned long n)
 #endif
 
 const char *get_preferred_languages(void);
+extern int is_utf8_locale(void);
 
 #endif
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 09/12] grep/pcre: support utf-8
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
                       ` (7 preceding siblings ...)
  2016-02-14 11:49     ` [PATCH v7 08/12] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
@ 2016-02-14 11:49     ` Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 10/12] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
                       ` (2 subsequent siblings)
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

In the previous change in this function, we add locale support for
single-byte encodings only. It looks like pcre only supports utf-* as
multibyte encodings, the others are left in the cold (which is
fine).

We need to enable PCRE_UTF8 so pcre can find character boundary
correctly. It's needed for case folding (when --ignore-case is used)
or '*', '+' or similar syntax is used.

The "has_non_ascii()" check is to be on the conservative side. If
there's non-ascii in the pattern, the searched content could still be
in utf-8, but we can treat it just like a byte stream and everything
should work. If we force utf-8 based on locale only and pcre validates
utf-8 and the file content is in non-utf8 encoding, things break.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Helped-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          |  2 ++
 t/t7812-grep-icase-non-ascii.sh | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/grep.c b/grep.c
index 22f4d99..6e99b01 100644
--- a/grep.c
+++ b/grep.c
@@ -329,6 +329,8 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 			p->pcre_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
 	}
+	if (is_utf8_locale() && has_non_ascii(p->pattern))
+		options |= PCRE_UTF8;
 
 	p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
 				      p->pcre_tables);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 5832684..842b26a 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -20,6 +20,21 @@ test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 icase' '
+	git grep --perl-regexp    "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 string with "+"' '
+	printf "TILRAUN: Hallóó Heimur!" >file2 &&
+	git add file2 &&
+	git grep -l --perl-regexp "TILRAUN: H.lló+ Heimur!" >actual &&
+	echo file >expected &&
+	echo file2 >>expected &&
+	test_cmp expected actual
+'
+
 test_expect_success REGEX_LOCALE 'grep literal string, with -F' '
 	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
 		 grep fixed >debug1 &&
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 10/12] diffcore-pickaxe: "share" regex error handling code
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
                       ` (8 preceding siblings ...)
  2016-02-14 11:49     ` [PATCH v7 09/12] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
@ 2016-02-14 11:49     ` Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 11/12] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 12/12] grep.c: reuse "icase" variable Nguyễn Thái Ngọc Duy
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

There's another regcomp code block coming in this function. By moving
the error handling code out of this block, we don't have to add the
same error handling code in the new block.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 diffcore-pickaxe.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 7715c13..69c6567 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -204,20 +204,13 @@ void diffcore_pickaxe(struct diff_options *o)
 	int opts = o->pickaxe_opts;
 	regex_t regex, *regexp = NULL;
 	kwset_t kws = NULL;
+	int err = 0;
 
 	if (opts & (DIFF_PICKAXE_REGEX | DIFF_PICKAXE_KIND_G)) {
-		int err;
 		int cflags = REG_EXTENDED | REG_NEWLINE;
 		if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE))
 			cflags |= REG_ICASE;
 		err = regcomp(&regex, needle, cflags);
-		if (err) {
-			/* The POSIX.2 people are surely sick */
-			char errbuf[1024];
-			regerror(err, &regex, errbuf, 1024);
-			regfree(&regex);
-			die("invalid regex: %s", errbuf);
-		}
 		regexp = &regex;
 	} else {
 		kws = kwsalloc(DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE)
@@ -225,6 +218,13 @@ void diffcore_pickaxe(struct diff_options *o)
 		kwsincr(kws, needle, strlen(needle));
 		kwsprep(kws);
 	}
+	if (err) {
+		/* The POSIX.2 people are surely sick */
+		char errbuf[1024];
+		regerror(err, &regex, errbuf, 1024);
+		regfree(&regex);
+		die("invalid regex: %s", errbuf);
+	}
 
 	/* Might want to warn when both S and G are on; I don't care... */
 	pickaxe(&diff_queued_diff, o, regexp, kws,
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 11/12] diffcore-pickaxe: support case insensitive match on non-ascii
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
                       ` (9 preceding siblings ...)
  2016-02-14 11:49     ` [PATCH v7 10/12] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
@ 2016-02-14 11:49     ` Nguyễn Thái Ngọc Duy
  2016-02-14 11:49     ` [PATCH v7 12/12] grep.c: reuse "icase" variable Nguyễn Thái Ngọc Duy
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

Similar to the "grep -F -i" case, we can't use kws on icase search
outside ascii range, so we quote the string and pass it to regcomp as
a basic regexp and let regex engine deal with case sensitivity.

The new test is put in t7812 instead of t4209-log-pickaxe because
lib-gettext.sh might cause problems elsewhere, probably.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 diffcore-pickaxe.c              | 11 +++++++++++
 t/t7812-grep-icase-non-ascii.sh |  7 +++++++
 2 files changed, 18 insertions(+)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 69c6567..0a5f877 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -7,6 +7,8 @@
 #include "diffcore.h"
 #include "xdiff-interface.h"
 #include "kwset.h"
+#include "commit.h"
+#include "quote.h"
 
 typedef int (*pickaxe_fn)(mmfile_t *one, mmfile_t *two,
 			  struct diff_options *o,
@@ -212,6 +214,15 @@ void diffcore_pickaxe(struct diff_options *o)
 			cflags |= REG_ICASE;
 		err = regcomp(&regex, needle, cflags);
 		regexp = &regex;
+	} else if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE) &&
+		   has_non_ascii(needle)) {
+		struct strbuf sb = STRBUF_INIT;
+		int cflags = REG_NEWLINE | REG_ICASE;
+
+		basic_regex_quote_buf(&sb, needle);
+		err = regcomp(&regex, sb.buf, cflags);
+		strbuf_release(&sb);
+		regexp = &regex;
 	} else {
 		kws = kwsalloc(DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE)
 			       ? tolower_trans_tbl : NULL);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 842b26a..4176625 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -61,4 +61,11 @@ test_expect_success REGEX_LOCALE 'grep string with regex, with -F' '
 	test_cmp expect2 debug2
 '
 
+test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
+	git commit -m first &&
+	git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual &&
+	echo first >expected &&
+	test_cmp expected actual
+'
+
 test_done
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v7 12/12] grep.c: reuse "icase" variable
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
                       ` (10 preceding siblings ...)
  2016-02-14 11:49     ` [PATCH v7 11/12] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
@ 2016-02-14 11:49     ` Nguyễn Thái Ngọc Duy
  11 siblings, 0 replies; 56+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-02-14 11:49 UTC (permalink / raw)
  To: git; +Cc: Eric Sunshine, Junio C Hamano, Nguyễn Thái Ngọc Duy

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/grep.c b/grep.c
index 6e99b01..cb058a5 100644
--- a/grep.c
+++ b/grep.c
@@ -445,10 +445,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 		p->fixed = 0;
 
 	if (p->fixed) {
-		if (opt->regflags & REG_ICASE || p->ignore_case)
-			p->kws = kwsalloc(tolower_trans_tbl);
-		else
-			p->kws = kwsalloc(NULL);
+		p->kws = kwsalloc(icase ? tolower_trans_tbl : NULL);
 		kwsincr(p->kws, p->pattern, p->patternlen);
 		kwsprep(p->kws);
 		return;
-- 
2.7.0.377.g4cd97dd

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 01/11] grep: allow -F -i combination
  2016-02-06  2:03   ` [PATCH v6 01/11] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
@ 2016-06-17 21:54     ` Junio C Hamano
  2016-06-18  0:07       ` Duy Nguyen
  0 siblings, 1 reply; 56+ messages in thread
From: Junio C Hamano @ 2016-06-17 21:54 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git, Eric Sunshine, Ramsay Jones

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> -F means "no regex", not "case sensitive" so it should not override -i

That logic is flawed, isn't it?

"-F" means "no regex", so it should not touch opt.regflags at all.

> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  builtin/grep.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 5526fd7..4be0df5 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -809,7 +809,7 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>  
>  	if (!opt.pattern_list)
>  		die(_("no pattern given."));
> -	if (!opt.fixed && opt.ignore_case)
> +	if (opt.ignore_case)
>  		opt.regflags |= REG_ICASE;
>  
>  	compile_grep_patterns(&opt);

In grep.c, we do this:

static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
{
	int err;

	p->word_regexp = opt->word_regexp;
	p->ignore_case = opt->ignore_case;

	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
		p->fixed = 1;
	else
		p->fixed = 0;

	if (p->fixed) {
		if (opt->regflags & REG_ICASE || p->ignore_case)
			p->kws = kwsalloc(tolower_trans_tbl);
		else

It is possible that your later changes _depend_ on having REG_ICASE
set in opt->regflags, but if that is why this commit is needed, then
you are going in a wrong direction.

I liked the overall objective of making "-i" work better on
non-ASCII, and I wanted to like this whole series, but at least this
change needs to be dropped (and the remainder of the series fixed if
they depend on this change).

Thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 00/11] Fix icase grep on non-ascii
  2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
                     ` (12 preceding siblings ...)
  2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
@ 2016-06-17 23:17   ` Junio C Hamano
  2016-06-18  0:26     ` Duy Nguyen
  13 siblings, 1 reply; 56+ messages in thread
From: Junio C Hamano @ 2016-06-17 23:17 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git, Eric Sunshine, Ramsay Jones

Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:

> v6 fixes comments from Ramsay and Eric. Interdiff below.

Another thing I noticed with this is that the non-ascii test breaks
when run under dash (but passes under bash).  You need to have is_IS
locale on the system to see the breakage, it seems (which is why I
didn't see it so far).


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 01/11] grep: allow -F -i combination
  2016-06-17 21:54     ` Junio C Hamano
@ 2016-06-18  0:07       ` Duy Nguyen
  0 siblings, 0 replies; 56+ messages in thread
From: Duy Nguyen @ 2016-06-18  0:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, Eric Sunshine, Ramsay Jones

On Sat, Jun 18, 2016 at 4:54 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:
>
>> -F means "no regex", not "case sensitive" so it should not override -i
>
> That logic is flawed, isn't it?
>
> "-F" means "no regex", so it should not touch opt.regflags at all.
>
>> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
>> ---
>>  builtin/grep.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/builtin/grep.c b/builtin/grep.c
>> index 5526fd7..4be0df5 100644
>> --- a/builtin/grep.c
>> +++ b/builtin/grep.c
>> @@ -809,7 +809,7 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>>
>>       if (!opt.pattern_list)
>>               die(_("no pattern given."));
>> -     if (!opt.fixed && opt.ignore_case)
>> +     if (opt.ignore_case)
>>               opt.regflags |= REG_ICASE;
>>
>>       compile_grep_patterns(&opt);
>
> In grep.c, we do this:
>
> static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
> {
>         int err;
>
>         p->word_regexp = opt->word_regexp;
>         p->ignore_case = opt->ignore_case;
>
>         if (opt->fixed || is_fixed(p->pattern, p->patternlen))
>                 p->fixed = 1;
>         else
>                 p->fixed = 0;
>
>         if (p->fixed) {
>                 if (opt->regflags & REG_ICASE || p->ignore_case)
>                         p->kws = kwsalloc(tolower_trans_tbl);
>                 else
>
> It is possible that your later changes _depend_ on having REG_ICASE
> set in opt->regflags, but if that is why this commit is needed, then
> you are going in a wrong direction.

Yeah.. some commits down the line, we need to avoid kws (because it
can't deal with non-ascii) and fall back to regex after all special
chars are quoted. Will fix.
-- 
Duy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 00/11] Fix icase grep on non-ascii
  2016-06-17 23:17   ` [PATCH v6 00/11] Fix icase grep on non-ascii Junio C Hamano
@ 2016-06-18  0:26     ` Duy Nguyen
  2016-06-22 18:29       ` Duy Nguyen
  0 siblings, 1 reply; 56+ messages in thread
From: Duy Nguyen @ 2016-06-18  0:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, Eric Sunshine

On Sat, Jun 18, 2016 at 6:17 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:
>
>> v6 fixes comments from Ramsay and Eric. Interdiff below.
>
> Another thing I noticed with this is that the non-ascii test breaks
> when run under dash (but passes under bash).  You need to have is_IS
> locale on the system to see the breakage, it seems (which is why I
> didn't see it so far).

Is it a special version, maybe from debian? It works for me on gentoo.

> ~/w/git/temp/t $ equery  --quiet list dash
app-shells/dash-0.5.8.2
> ~/w/git/temp/t $ dash ./t7812-grep-icase-non-ascii.sh
# lib-gettext: Found 'is_IS.utf8' as an is_IS UTF-8 locale
# lib-gettext: Found 'is_IS.iso88591' as an is_IS ISO-8859-1 locale
ok 1 - setup
ok 2 - grep literal string, no -F
ok 3 - grep pcre utf-8 icase
ok 4 - grep pcre utf-8 string with "+"
ok 5 - grep literal string, with -F
ok 6 - grep string with regex, with -F
ok 7 - pickaxe -i on non-ascii
# passed all 7 test(s)
1..7
-- 
Duy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 00/11] Fix icase grep on non-ascii
  2016-06-18  0:26     ` Duy Nguyen
@ 2016-06-22 18:29       ` Duy Nguyen
  2016-06-22 18:36         ` Junio C Hamano
  0 siblings, 1 reply; 56+ messages in thread
From: Duy Nguyen @ 2016-06-22 18:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, Eric Sunshine

On Sat, Jun 18, 2016 at 2:26 AM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Sat, Jun 18, 2016 at 6:17 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> Nguyễn Thái Ngọc Duy  <pclouds@gmail.com> writes:
>>
>>> v6 fixes comments from Ramsay and Eric. Interdiff below.
>>
>> Another thing I noticed with this is that the non-ascii test breaks
>> when run under dash (but passes under bash).  You need to have is_IS
>> locale on the system to see the breakage, it seems (which is why I
>> didn't see it so far).
>
> Is it a special version, maybe from debian? It works for me on gentoo.
>
>> ~/w/git/temp/t $ equery  --quiet list dash
> app-shells/dash-0.5.8.2
>> ~/w/git/temp/t $ dash ./t7812-grep-icase-non-ascii.sh
> # lib-gettext: Found 'is_IS.utf8' as an is_IS UTF-8 locale
> # lib-gettext: Found 'is_IS.iso88591' as an is_IS ISO-8859-1 locale
> ok 1 - setup
> ok 2 - grep literal string, no -F
> ok 3 - grep pcre utf-8 icase
> ok 4 - grep pcre utf-8 string with "+"
> ok 5 - grep literal string, with -F
> ok 6 - grep string with regex, with -F
> ok 7 - pickaxe -i on non-ascii
> # passed all 7 test(s)
> 1..7

Can any shell wizards explain this to me? With this code

BS=\\
echo ${BS}${BS}

Debian's dash returns a single backslash while bash returns two
backslashes. Section 2.2.1 [1] does not say anything about one
backslash (hidden behind a variable!) after escaping the following one
and still eats the one after that..

[1] http://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html
-- 
Duy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 00/11] Fix icase grep on non-ascii
  2016-06-22 18:29       ` Duy Nguyen
@ 2016-06-22 18:36         ` Junio C Hamano
  2016-06-22 18:41           ` Duy Nguyen
  0 siblings, 1 reply; 56+ messages in thread
From: Junio C Hamano @ 2016-06-22 18:36 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List, Eric Sunshine

On Wed, Jun 22, 2016 at 11:29 AM, Duy Nguyen <pclouds@gmail.com> wrote:
>
> Can any shell wizards explain this to me? With this code
>
> BS=\\
> echo ${BS}${BS}
>
> Debian's dash returns a single backslash while bash returns two
> backslashes. Section 2.2.1 [1] does not say anything about one
> backslash (hidden behind a variable!) after escaping the following one
> and still eats the one after that..
>
> [1] http://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html

I am not a wizard, but is the difference between the shell syntax, or just their
implementation of builtin-echo?  IOW, how do these three compare?

printf "%s\n" "${BS}${BS}"
echo "${BS}${BS}"
echo ${BS}$BS}

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 00/11] Fix icase grep on non-ascii
  2016-06-22 18:36         ` Junio C Hamano
@ 2016-06-22 18:41           ` Duy Nguyen
  2016-06-22 18:59             ` Junio C Hamano
  0 siblings, 1 reply; 56+ messages in thread
From: Duy Nguyen @ 2016-06-22 18:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, Eric Sunshine

On Wed, Jun 22, 2016 at 8:36 PM, Junio C Hamano <gitster@pobox.com> wrote:
> On Wed, Jun 22, 2016 at 11:29 AM, Duy Nguyen <pclouds@gmail.com> wrote:
>>
>> Can any shell wizards explain this to me? With this code
>>
>> BS=\\
>> echo ${BS}${BS}
>>
>> Debian's dash returns a single backslash while bash returns two
>> backslashes. Section 2.2.1 [1] does not say anything about one
>> backslash (hidden behind a variable!) after escaping the following one
>> and still eats the one after that..
>>
>> [1] http://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html
>
> I am not a wizard, but is the difference between the shell syntax, or just their
> implementation of builtin-echo?  IOW, how do these three compare?
>
> printf "%s\n" "${BS}${BS}"
> echo "${BS}${BS}"
> echo ${BS}$BS}

Great! printf shows two backslashes while both echo'es show one.
printf "\\\\" behaves like echo though. Doesn't matter, at least I
should be able to make the tests work on Debian dash.
-- 
Duy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 00/11] Fix icase grep on non-ascii
  2016-06-22 18:41           ` Duy Nguyen
@ 2016-06-22 18:59             ` Junio C Hamano
  2016-06-22 19:32               ` Junio C Hamano
  0 siblings, 1 reply; 56+ messages in thread
From: Junio C Hamano @ 2016-06-22 18:59 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List, Eric Sunshine

On Wed, Jun 22, 2016 at 11:41 AM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Wed, Jun 22, 2016 at 8:36 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> On Wed, Jun 22, 2016 at 11:29 AM, Duy Nguyen <pclouds@gmail.com> wrote:
>>>
>>> Can any shell wizards explain this to me? With this code
>>>
>>> BS=\\
>>> echo ${BS}${BS}
>>>
>>> Debian's dash returns a single backslash while bash returns two
>>> backslashes. Section 2.2.1 [1] does not say anything about one
>>> backslash (hidden behind a variable!) after escaping the following one
>>> and still eats the one after that..
>>>
>>> [1] http://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html
>>
>> I am not a wizard, but is the difference between the shell syntax, or just their
>> implementation of builtin-echo?  IOW, how do these three compare?
>>
>> printf "%s\n" "${BS}${BS}"
>> echo "${BS}${BS}"
>> echo ${BS}$BS}
>
> Great! printf shows two backslashes while both echo'es show one.
> printf "\\\\" behaves like echo though. Doesn't matter, at least I
> should be able to make the tests work on Debian dash.

I think somebody's implementation of "echo" is not POSIX kosher. According to

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/echo.html

you should expect a single backslash. If a script depends on seeing two, the
script is buggy.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v6 00/11] Fix icase grep on non-ascii
  2016-06-22 18:59             ` Junio C Hamano
@ 2016-06-22 19:32               ` Junio C Hamano
  0 siblings, 0 replies; 56+ messages in thread
From: Junio C Hamano @ 2016-06-22 19:32 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List, Eric Sunshine

Junio C Hamano <gitster@pobox.com> writes:

> I think somebody's implementation of "echo" is not POSIX kosher.

Oops, I misspoke.  s/POSIX/XSI/; obviously.

But the conclusion is the same.  echo '\\\tA' may say \\\tA or
backslash HT A, depending on the system, so we cannot rely on that
in tests that are meant to be portable.


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2016-06-22 19:32 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 01/10] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 02/10] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 03/10] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
2016-01-29  5:31   ` Eric Sunshine
2016-01-29 14:29     ` Ramsay Jones
2016-01-28 11:56 ` [PATCH v5 04/10] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2016-01-29  6:18   ` Eric Sunshine
2016-01-29  6:41     ` Eric Sunshine
2016-01-28 11:56 ` [PATCH v5 05/10] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2016-01-29  6:23   ` Eric Sunshine
2016-01-28 11:56 ` [PATCH v5 06/10] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 07/10] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 08/10] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 09/10] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 10/10] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2016-01-29  6:38   ` Eric Sunshine
2016-01-28 23:54 ` [PATCH v5 00/10] Fix icase grep " Junio C Hamano
2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 01/11] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2016-06-17 21:54     ` Junio C Hamano
2016-06-18  0:07       ` Duy Nguyen
2016-02-06  2:03   ` [PATCH v6 02/11] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2016-02-09 18:20     ` Junio C Hamano
2016-02-06  2:03   ` [PATCH v6 03/11] test-regex: isolate the bug test code Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 04/11] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
2016-02-07  8:44     ` Eric Sunshine
2016-02-09 18:21       ` Junio C Hamano
2016-02-06  2:03   ` [PATCH v6 05/11] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 06/11] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 07/11] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 08/11] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 09/11] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 10/11] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 11/11] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2016-02-07  8:48   ` [PATCH v6 00/11] Fix icase grep " Eric Sunshine
2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 01/12] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 02/12] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 03/12] test-regex: isolate the bug test code Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 04/12] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 05/12] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 06/12] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 07/12] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 08/12] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 09/12] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 10/12] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 11/12] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 12/12] grep.c: reuse "icase" variable Nguyễn Thái Ngọc Duy
2016-06-17 23:17   ` [PATCH v6 00/11] Fix icase grep on non-ascii Junio C Hamano
2016-06-18  0:26     ` Duy Nguyen
2016-06-22 18:29       ` Duy Nguyen
2016-06-22 18:36         ` Junio C Hamano
2016-06-22 18:41           ` Duy Nguyen
2016-06-22 18:59             ` Junio C Hamano
2016-06-22 19:32               ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.