All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
To: git@vger.kernel.org
Cc: "Eric Sunshine" <sunshine@sunshineco.com>,
	"Ramsay Jones" <ramsay@ramsay1.demon.co.uk>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Subject: [PATCH v6 09/11] grep/pcre: support utf-8
Date: Sat,  6 Feb 2016 09:03:08 +0700	[thread overview]
Message-ID: <1454724190-14063-10-git-send-email-pclouds@gmail.com> (raw)
In-Reply-To: <1454724190-14063-1-git-send-email-pclouds@gmail.com>

In the previous change in this function, we add locale support for
single-byte encodings only. It looks like pcre only supports utf-* as
multibyte encodings, the others are left in the cold (which is
fine).

We need to enable PCRE_UTF8 so pcre can find character boundary
correctly. It's needed for case folding (when --ignore-case is used)
or '*', '+' or similar syntax is used.

The "has_non_ascii()" check is to be on the conservative side. If
there's non-ascii in the pattern, the searched content could still be
in utf-8, but we can treat it just like a byte stream and everything
should work. If we force utf-8 based on locale only and pcre validates
utf-8 and the file content is in non-utf8 encoding, things break.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Helped-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          |  2 ++
 t/t7812-grep-icase-non-ascii.sh | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/grep.c b/grep.c
index 843e180..aed4fe0 100644
--- a/grep.c
+++ b/grep.c
@@ -329,6 +329,8 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 			p->pcre_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
 	}
+	if (is_utf8_locale() && has_non_ascii(p->pattern))
+		options |= PCRE_UTF8;
 
 	p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
 				      p->pcre_tables);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 5832684..842b26a 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -20,6 +20,21 @@ test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 icase' '
+	git grep --perl-regexp    "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 string with "+"' '
+	printf "TILRAUN: Hallóó Heimur!" >file2 &&
+	git add file2 &&
+	git grep -l --perl-regexp "TILRAUN: H.lló+ Heimur!" >actual &&
+	echo file >expected &&
+	echo file2 >>expected &&
+	test_cmp expected actual
+'
+
 test_expect_success REGEX_LOCALE 'grep literal string, with -F' '
 	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
 		 grep fixed >debug1 &&
-- 
2.7.0.377.g4cd97dd

  parent reply	other threads:[~2016-02-06  2:04 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-28 11:56 [PATCH v5 00/10] Fix icase grep on non-ascii Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 01/10] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 02/10] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 03/10] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
2016-01-29  5:31   ` Eric Sunshine
2016-01-29 14:29     ` Ramsay Jones
2016-01-28 11:56 ` [PATCH v5 04/10] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2016-01-29  6:18   ` Eric Sunshine
2016-01-29  6:41     ` Eric Sunshine
2016-01-28 11:56 ` [PATCH v5 05/10] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2016-01-29  6:23   ` Eric Sunshine
2016-01-28 11:56 ` [PATCH v5 06/10] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 07/10] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 08/10] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 09/10] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2016-01-28 11:56 ` [PATCH v5 10/10] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2016-01-29  6:38   ` Eric Sunshine
2016-01-28 23:54 ` [PATCH v5 00/10] Fix icase grep " Junio C Hamano
2016-02-06  2:02 ` [PATCH v6 00/11] " Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 01/11] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2016-06-17 21:54     ` Junio C Hamano
2016-06-18  0:07       ` Duy Nguyen
2016-02-06  2:03   ` [PATCH v6 02/11] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2016-02-09 18:20     ` Junio C Hamano
2016-02-06  2:03   ` [PATCH v6 03/11] test-regex: isolate the bug test code Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 04/11] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
2016-02-07  8:44     ` Eric Sunshine
2016-02-09 18:21       ` Junio C Hamano
2016-02-06  2:03   ` [PATCH v6 05/11] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 06/11] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 07/11] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 08/11] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` Nguyễn Thái Ngọc Duy [this message]
2016-02-06  2:03   ` [PATCH v6 10/11] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2016-02-06  2:03   ` [PATCH v6 11/11] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2016-02-07  8:48   ` [PATCH v6 00/11] Fix icase grep " Eric Sunshine
2016-02-14 11:49   ` [PATCH v7 00/12] nd/icase updates Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 01/12] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 02/12] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 03/12] test-regex: isolate the bug test code Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 04/12] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 05/12] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 06/12] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 07/12] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 08/12] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 09/12] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 10/12] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 11/12] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2016-02-14 11:49     ` [PATCH v7 12/12] grep.c: reuse "icase" variable Nguyễn Thái Ngọc Duy
2016-06-17 23:17   ` [PATCH v6 00/11] Fix icase grep on non-ascii Junio C Hamano
2016-06-18  0:26     ` Duy Nguyen
2016-06-22 18:29       ` Duy Nguyen
2016-06-22 18:36         ` Junio C Hamano
2016-06-22 18:41           ` Duy Nguyen
2016-06-22 18:59             ` Junio C Hamano
2016-06-22 19:32               ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1454724190-14063-10-git-send-email-pclouds@gmail.com \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=ramsay@ramsay1.demon.co.uk \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.