All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] commit: improve UTF-8 validation
@ 2013-07-04 17:17 brian m. carlson
  2013-07-04 17:19 ` [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
  2013-07-04 17:20 ` [PATCH v2 2/2] commit: reject overlong UTF-8 sequences brian m. carlson
  0 siblings, 2 replies; 11+ messages in thread
From: brian m. carlson @ 2013-07-04 17:17 UTC (permalink / raw)
  To: git; +Cc: gitster

This series contains a pair of patches that improve the validation of
the UTF-8 used in commit messages.  Invalid codepoints, such as
surrogates and guaranteed non-characters, are rejected, along with
overlong UTF-8 sequences.

Changes from v1:

* Improved comments to aid those less familiar with Unicode.
* Generated test files using printf as part of the test.
* Removed FIXME comments for things that have been fixed.
* Use a shorter form for detecting surrogate pairs.

brian m. carlson (2):
  commit: reject invalid UTF-8 codepoints
  commit: reject overlong UTF-8 sequences

 commit.c               | 34 ++++++++++++++++++++++++++++------
 t/t3900-i18n-commit.sh | 23 +++++++++++++++++++++++
 2 files changed, 51 insertions(+), 6 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints
  2013-07-04 17:17 [PATCH v2 0/2] commit: improve UTF-8 validation brian m. carlson
@ 2013-07-04 17:19 ` brian m. carlson
  2013-07-04 19:58   ` Torsten Bögershausen
  2013-07-05 12:51   ` Peter Krefting
  2013-07-04 17:20 ` [PATCH v2 2/2] commit: reject overlong UTF-8 sequences brian m. carlson
  1 sibling, 2 replies; 11+ messages in thread
From: brian m. carlson @ 2013-07-04 17:19 UTC (permalink / raw)
  To: git; +Cc: gitster

The commit code already contains code for validating UTF-8, but it does not
check for invalid values, such as guaranteed non-characters and surrogates.  Fix
this by explicitly checking for and rejecting such characters.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 commit.c               | 27 ++++++++++++++++++++++-----
 t/t3900-i18n-commit.sh | 12 ++++++++++++
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/commit.c b/commit.c
index 888e02a..2264106 100644
--- a/commit.c
+++ b/commit.c
@@ -1244,6 +1244,7 @@ static int find_invalid_utf8(const char *buf, int len)
 	while (len) {
 		unsigned char c = *buf++;
 		int bytes, bad_offset;
+		unsigned int codepoint;
 
 		len--;
 		offset++;
@@ -1264,24 +1265,40 @@ static int find_invalid_utf8(const char *buf, int len)
 			bytes++;
 		}
 
-		/* Must be between 1 and 5 more bytes */
-		if (bytes < 1 || bytes > 5)
+		/*
+		 * Must be between 1 and 3 more bytes.  Longer sequences result in
+		 * codepoints beyond U+10FFFF, which are guaranteed never to exist.
+		 */
+		if (bytes < 1 || bytes > 3)
 			return bad_offset;
 
 		/* Do we *have* that many bytes? */
 		if (len < bytes)
 			return bad_offset;
 
+		/* Place the encoded bits at the bottom of the value. */
+		codepoint = (c & 0x7f) >> bytes;
+
 		offset += bytes;
 		len -= bytes;
 
 		/* And verify that they are good continuation bytes */
 		do {
+			codepoint <<= 6;
+			codepoint |= *buf & 0x3f;
 			if ((*buf++ & 0xc0) != 0x80)
 				return bad_offset;
 		} while (--bytes);
 
-		/* We could/should check the value and length here too */
+		/* No codepoints can ever be allocated beyond U+10FFFF. */
+		if (codepoint > 0x10ffff)
+			return bad_offset;
+		/* Surrogates are only for UTF-16 and cannot be encoded in UTF-8. */
+		if ((codepoint & 0x1ff800) == 0xd800)
+			return bad_offset;
+		/* U+FFFE and U+FFFF are guaranteed non-characters. */
+		if ((codepoint & 0x1ffffe) == 0xfffe)
+			return bad_offset;
 	}
 	return -1;
 }
@@ -1292,8 +1309,8 @@ static int find_invalid_utf8(const char *buf, int len)
  * If it isn't, it assumes any non-utf8 characters are Latin1,
  * and does the conversion.
  *
- * Fixme: we should probably also disallow overlong forms and
- * invalid characters. But we don't do that currently.
+ * Fixme: we should probably also disallow overlong forms.
+ * But we don't do that currently.
  */
 static int verify_utf8(struct strbuf *buf)
 {
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index 37ddabb..ee8ba6c 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -39,6 +39,18 @@ test_expect_failure 'UTF-16 refused because of NULs' '
 	git commit -a -F "$TEST_DIRECTORY"/t3900/UTF-16.txt
 '
 
+test_expect_success 'UTF-8 invalid characters refused' '
+	test_when_finished "rm -f $HOME/stderr $HOME/invalid" && 
+	rm -f "$HOME/stderr" &&
+	echo "UTF-8 characters" >F &&
+	printf "Commit message\n\nInvalid surrogate:\355\240\200\n" \
+		>"$HOME/invalid" &&
+	git commit -a -F "$HOME/invalid" \
+		2>"$HOME"/stderr &&
+	grep "did not conform" "$HOME"/stderr
+'
+
+rm -f "$HOME/stderr"
 
 for H in ISO8859-1 eucJP ISO-2022-JP
 do
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] commit: reject overlong UTF-8 sequences
  2013-07-04 17:17 [PATCH v2 0/2] commit: improve UTF-8 validation brian m. carlson
  2013-07-04 17:19 ` [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
@ 2013-07-04 17:20 ` brian m. carlson
  1 sibling, 0 replies; 11+ messages in thread
From: brian m. carlson @ 2013-07-04 17:20 UTC (permalink / raw)
  To: git; +Cc: gitster

The commit code accepts pseudo-UTF-8 sequences that encode a character with more
bytes than necessary.  Reject such sequences, since they are not valid UTF-8.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 commit.c               | 17 +++++++++++------
 t/t3900-i18n-commit.sh | 11 +++++++++++
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/commit.c b/commit.c
index 2264106..b59c187 100644
--- a/commit.c
+++ b/commit.c
@@ -1240,11 +1240,15 @@ int commit_tree(const struct strbuf *msg, unsigned char *tree,
 static int find_invalid_utf8(const char *buf, int len)
 {
 	int offset = 0;
+	static const unsigned int max_codepoint[] = {
+		0x7f, 0x7ff, 0xffff, 0x10ffff
+	};
 
 	while (len) {
 		unsigned char c = *buf++;
 		int bytes, bad_offset;
 		unsigned int codepoint;
+		unsigned int min_val, max_val;
 
 		len--;
 		offset++;
@@ -1276,8 +1280,12 @@ static int find_invalid_utf8(const char *buf, int len)
 		if (len < bytes)
 			return bad_offset;
 
-		/* Place the encoded bits at the bottom of the value. */
+		/* Place the encoded bits at the bottom of the value and compute the
+		 * valid range.
+		 */
 		codepoint = (c & 0x7f) >> bytes;
+		min_val = max_codepoint[bytes-1] + 1;
+		max_val = max_codepoint[bytes];
 
 		offset += bytes;
 		len -= bytes;
@@ -1290,8 +1298,8 @@ static int find_invalid_utf8(const char *buf, int len)
 				return bad_offset;
 		} while (--bytes);
 
-		/* No codepoints can ever be allocated beyond U+10FFFF. */
-		if (codepoint > 0x10ffff)
+		/* Reject codepoints that are out of range for the sequence length. */
+		if (codepoint < min_val || codepoint > max_val)
 			return bad_offset;
 		/* Surrogates are only for UTF-16 and cannot be encoded in UTF-8. */
 		if ((codepoint & 0x1ff800) == 0xd800)
@@ -1308,9 +1316,6 @@ static int find_invalid_utf8(const char *buf, int len)
  *
  * If it isn't, it assumes any non-utf8 characters are Latin1,
  * and does the conversion.
- *
- * Fixme: we should probably also disallow overlong forms.
- * But we don't do that currently.
  */
 static int verify_utf8(struct strbuf *buf)
 {
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index ee8ba6c..94fa1e8 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -50,6 +50,17 @@ test_expect_success 'UTF-8 invalid characters refused' '
 	grep "did not conform" "$HOME"/stderr
 '
 
+test_expect_success 'UTF-8 overlong sequences rejected' '
+	test_when_finished "rm -f $HOME/stderr $HOME/invalid" &&
+	rm -f "$HOME/stderr" "$HOME/invalid" &&
+	echo "UTF-8 overlong" >F &&
+	printf "\340\202\251ommit message\n\nThis is not a space:\300\240\n" \
+		>"$HOME/invalid" &&
+	git commit -a -F "$HOME/invalid" \
+		2>"$HOME"/stderr &&
+	grep "did not conform" "$HOME"/stderr
+'
+
 rm -f "$HOME/stderr"
 
 for H in ISO8859-1 eucJP ISO-2022-JP
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints
  2013-07-04 17:19 ` [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
@ 2013-07-04 19:58   ` Torsten Bögershausen
  2013-07-04 20:39     ` brian m. carlson
  2013-07-05 12:51   ` Peter Krefting
  1 sibling, 1 reply; 11+ messages in thread
From: Torsten Bögershausen @ 2013-07-04 19:58 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, gitster

On 2013-07-04 19.19, brian m. carlson wrote:
> The commit code already contains code for validating UTF-8, but it does not
> check for invalid values, such as guaranteed non-characters and surrogates.  Fix
s/guaranteed non-characters/code points out of range/
> this by explicitly checking for and rejecting such characters.
Do we really reject them, or do we (only) warn about them ? 

Other question:
Now that we have a check for codepoints out of range, beyond U+10FFFF,
do we want to have an additional testcase ?


> +test_expect_success 'UTF-8 invalid characters refused' '
May be:
 test_expect_success 'UTF-8 invalid surrogate' '


> +	test_when_finished "rm -f $HOME/stderr $HOME/invalid" && 
> +	rm -f "$HOME/stderr" &&
> +	echo "UTF-8 characters" >F &&
> +	printf "Commit message\n\nInvalid surrogate:\355\240\200\n" \
> +		>"$HOME/invalid" &&
good
> +	git commit -a -F "$HOME/invalid" \
> +		2>"$HOME"/stderr &&
> +	grep "did not conform" "$HOME"/stderr
> +'
> +
> +rm -f "$HOME/stderr"
Does it make sense to "grep on the fly", like this:
git commit -a -F "$HOME/invalid" 2>&1  | grep "did not conform"

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints
  2013-07-04 19:58   ` Torsten Bögershausen
@ 2013-07-04 20:39     ` brian m. carlson
  0 siblings, 0 replies; 11+ messages in thread
From: brian m. carlson @ 2013-07-04 20:39 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, gitster

[-- Attachment #1: Type: text/plain, Size: 1921 bytes --]

On Thu, Jul 04, 2013 at 09:58:08PM +0200, Torsten Bögershausen wrote:
> On 2013-07-04 19.19, brian m. carlson wrote:
> > The commit code already contains code for validating UTF-8, but it does not
> > check for invalid values, such as guaranteed non-characters and surrogates.  Fix
> s/guaranteed non-characters/code points out of range/

The "such as" is meant to be illustrative, not all-inclusive, and my
patch does check for U+FFFE and U+FFFF, which are guaranteed
non-characters.

> > this by explicitly checking for and rejecting such characters.
> Do we really reject them, or do we (only) warn about them ? 

Well, find_invalid_utf8 rejects them as invalid, and verify_utf8 fixes
them up as if they were Latin-1, and commit_tree_extended warns about
them.  My interpretation was from the point of view of the code that I
touched (find_invalid_utf8), not the binary.  It would be nice if the
binary actually rejected it, too, but that isn't within the scope of
this patch.

> Other question:
> Now that we have a check for codepoints out of range, beyond U+10FFFF,
> do we want to have an additional testcase ?

Sure, why not?

> > +test_expect_success 'UTF-8 invalid characters refused' '
> May be:
>  test_expect_success 'UTF-8 invalid surrogate' '

Since I'll be adding at least one more unit test, as you requested, I'll
change the name.  I suppose I might as well add a test for the
non-characters as well.

> Does it make sense to "grep on the fly", like this:
> git commit -a -F "$HOME/invalid" 2>&1  | grep "did not conform"

I am interested in making sure that git commit succeeds, and using a
pipe will cause any failure of git commit to be ignored.

-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints
  2013-07-04 17:19 ` [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
  2013-07-04 19:58   ` Torsten Bögershausen
@ 2013-07-05 12:51   ` Peter Krefting
  2013-07-08 19:36     ` Junio C Hamano
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Krefting @ 2013-07-05 12:51 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Git Mailing List, gitster

brian m. carlson:

> +		/* U+FFFE and U+FFFF are guaranteed non-characters. */
> +		if ((codepoint & 0x1ffffe) == 0xfffe)
> +			return bad_offset;

I missed this the first time around: All Unicode characters whose 
lower 16-bits are FFFE or FFFF are non-characters, so you can re-write 
that to:

   /* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */
   if ((codepoint & 0xfffe) == 0xfffe)
    return bad_offset;

Also, the range U+FDD0--U+FDEF are also non-characters, if you wish to 
be really pedantic.

$ grep '^[0-9A-F].*<not a' NamesList.txt
FDD0	<not a character>
FDD1	<not a character>
FDD2	<not a character>
FDD3	<not a character>
FDD4	<not a character>
FDD5	<not a character>
FDD6	<not a character>
FDD7	<not a character>
FDD8	<not a character>
FDD9	<not a character>
FDDA	<not a character>
FDDB	<not a character>
FDDC	<not a character>
FDDD	<not a character>
FDDE	<not a character>
FDDF	<not a character>
FDE0	<not a character>
FDE1	<not a character>
FDE2	<not a character>
FDE3	<not a character>
FDE4	<not a character>
FDE5	<not a character>
FDE6	<not a character>
FDE7	<not a character>
FDE8	<not a character>
FDE9	<not a character>
FDEA	<not a character>
FDEB	<not a character>
FDEC	<not a character>
FDED	<not a character>
FDEE	<not a character>
FDEF	<not a character>
FFFE	<not a character>
FFFF	<not a character>
1FFFE	<not a character>
1FFFF	<not a character>
2FFFE	<not a character>
2FFFF	<not a character>
3FFFE	<not a character>
3FFFF	<not a character>
4FFFE	<not a character>
4FFFF	<not a character>
5FFFE	<not a character>
5FFFF	<not a character>
6FFFE	<not a character>
6FFFF	<not a character>
7FFFE	<not a character>
7FFFF	<not a character>
8FFFE	<not a character>
8FFFF	<not a character>
9FFFE	<not a character>
9FFFF	<not a character>
AFFFE	<not a character>
AFFFF	<not a character>
BFFFE	<not a character>
BFFFF	<not a character>
CFFFE	<not a character>
CFFFF	<not a character>
DFFFE	<not a character>
DFFFF	<not a character>
EFFFE	<not a character>
EFFFF	<not a character>
FFFFE	<not a character>
FFFFF	<not a character>
10FFFE	<not a character>
10FFFF	<not a character>

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints
  2013-07-05 12:51   ` Peter Krefting
@ 2013-07-08 19:36     ` Junio C Hamano
  2013-07-09 11:16       ` [PATCH] commit: reject non-characters Peter Krefting
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2013-07-08 19:36 UTC (permalink / raw)
  To: Peter Krefting; +Cc: brian m. carlson, Git Mailing List

Peter Krefting <peter@softwolves.pp.se> writes:

> brian m. carlson:
>
>> +		/* U+FFFE and U+FFFF are guaranteed non-characters. */
>> +		if ((codepoint & 0x1ffffe) == 0xfffe)
>> +			return bad_offset;
>
> I missed this the first time around: All Unicode characters whose
> lower 16-bits are FFFE or FFFF are non-characters, so you can re-write
> that to:
>
>   /* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */
>   if ((codepoint & 0xfffe) == 0xfffe)
>    return bad_offset;
>
> Also, the range U+FDD0--U+FDEF are also non-characters, if you wish to
> be really pedantic.

Yeah, while we are at it, doing this may not hurt.  I think Brian's
two patches are in fairly good shape otherwise, so perhaps you can
do this as a follow-up patch on top of the tip of the topic,
e82bd6cc (commit: reject overlong UTF-8 sequences, 2013-07-04)?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] commit: reject non-characters
  2013-07-08 19:36     ` Junio C Hamano
@ 2013-07-09 11:16       ` Peter Krefting
  2013-08-05 12:48         ` Peter Krefting
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Krefting @ 2013-07-09 11:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, Git Mailing List

Unicode clause D14 defines all characters U+nFFFE and U+nFFFF (where
0 <= n <= 10h) as well as the range U+FDD0..U+FDEF as non-characters,
reserved for internal use only.  Disallow these characters in commit
messages as they are normally not recommended for interchange.

Signed-off-by: Peter Krefting <peter@softwolves.pp.se>
---
Junio C Hamano:

> Yeah, while we are at it, doing this may not hurt.  I think Brian's
> two patches are in fairly good shape otherwise, so perhaps you can
> do this as a follow-up patch on top of the tip of the topic,
> e82bd6cc (commit: reject overlong UTF-8 sequences, 2013-07-04)?

OK, here you are. Enjoy :)

  commit.c               |  7 +++++--
  t/t3900-i18n-commit.sh | 18 ++++++++++++++++++
  2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/commit.c b/commit.c
index 5097dba..0587732 100644
--- a/commit.c
+++ b/commit.c
@@ -1305,8 +1305,11 @@ static int find_invalid_utf8(const char *buf, int len)
  		/* Surrogates are only for UTF-16 and cannot be encoded in UTF-8. */
  		if ((codepoint & 0x1ff800) == 0xd800)
  			return bad_offset;
-		/* U+FFFE and U+FFFF are guaranteed non-characters. */
-		if ((codepoint & 0x1ffffe) == 0xfffe)
+		/* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */
+		if ((codepoint & 0xffffe) == 0xfffe)
+			return bad_offset;
+		/* So are anything in the range U+FDD0..U+FDEF. */
+		if (codepoint >= 0xfdd0 && codepoint <= 0xfdef)
  			return bad_offset;
  	}
  	return -1;
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index 051ea9d..38b00c3 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -58,6 +58,24 @@ test_expect_success 'UTF-8 overlong sequences rejected' '
  	grep "did not conform" "$HOME"/stderr
  '

+test_expect_success 'UTF-8 non-characters refused' '
+	test_when_finished "rm -f $HOME/stderr $HOME/invalid" &&
+	echo "UTF-8 non-character 1" >F &&
+	printf "Commit message\n\nNon-character:\364\217\277\276\n" \
+		>"$HOME/invalid" &&
+	git commit -a -F "$HOME/invalid" 2>"$HOME"/stderr &&
+	grep "did not conform" "$HOME"/stderr
+'
+
+test_expect_success 'UTF-8 non-characters refused' '
+	test_when_finished "rm -f $HOME/stderr $HOME/invalid" &&
+	echo "UTF-8 non-character 2." >F &&
+	printf "Commit message\n\nNon-character:\357\267\220\n" \
+		>"$HOME/invalid" &&
+	git commit -a -F "$HOME/invalid" 2>"$HOME"/stderr &&
+	grep "did not conform" "$HOME"/stderr
+'
+
  for H in ISO8859-1 eucJP ISO-2022-JP
  do
  	test_expect_success "$H setup" '
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] commit: reject non-characters
  2013-07-09 11:16       ` [PATCH] commit: reject non-characters Peter Krefting
@ 2013-08-05 12:48         ` Peter Krefting
  2013-08-05 16:54           ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Krefting @ 2013-08-05 12:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, Git Mailing List

Peter Krefting:

> -		/* U+FFFE and U+FFFF are guaranteed non-characters. */
> -		if ((codepoint & 0x1ffffe) == 0xfffe)
> +		/* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */
> +		if ((codepoint & 0xffffe) == 0xfffe)
> +			return bad_offset;

Drats, there is an F too many in the bitmask, it should be:

  +		if ((codepoint & 0xfffe) == 0xfffe)

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] commit: reject non-characters
  2013-08-05 12:48         ` Peter Krefting
@ 2013-08-05 16:54           ` Junio C Hamano
  2013-08-06  7:03             ` Peter Krefting
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2013-08-05 16:54 UTC (permalink / raw)
  To: Peter Krefting; +Cc: brian m. carlson, Git Mailing List

Peter Krefting <peter@softwolves.pp.se> writes:

> Peter Krefting:
>
>> -		/* U+FFFE and U+FFFF are guaranteed non-characters. */
>> -		if ((codepoint & 0x1ffffe) == 0xfffe)
>> +		/* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */
>> +		if ((codepoint & 0xffffe) == 0xfffe)
>> +			return bad_offset;
>
> Drats, there is an F too many in the bitmask, it should be:
>
>  +		if ((codepoint & 0xfffe) == 0xfffe)

Indeed.

-- >8 --
Subject: [PATCH] commit: typofix for xxFFF[EF] check

We wanted to catch all codepoints that ends with FFFE and FFFF,
not with 0FFFE and 0FFFF.

Noticed and corrected by Peter Krefting.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 commit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 7dcfeea..38d8979 100644
--- a/commit.c
+++ b/commit.c
@@ -1306,7 +1306,7 @@ static int find_invalid_utf8(const char *buf, int len)
 		if ((codepoint & 0x1ff800) == 0xd800)
 			return bad_offset;
 		/* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */
-		if ((codepoint & 0xffffe) == 0xfffe)
+		if ((codepoint & 0xfffe) == 0xfffe)
 			return bad_offset;
 		/* So are anything in the range U+FDD0..U+FDEF. */
 		if (codepoint >= 0xfdd0 && codepoint <= 0xfdef)
-- 
1.8.4-rc1-129-g1f3472b

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] commit: reject non-characters
  2013-08-05 16:54           ` Junio C Hamano
@ 2013-08-06  7:03             ` Peter Krefting
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Krefting @ 2013-08-06  7:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, Git Mailing List

Junio C Hamano:

> Indeed.

Thanks. Testcases are good, but not if they don't actually catch the 
bug one has just introduced :-)

> -- >8 --
> Subject: [PATCH] commit: typofix for xxFFF[EF] check
>
> We wanted to catch all codepoints that ends with FFFE and FFFF,
> not with 0FFFE and 0FFFF.
>
> Noticed and corrected by Peter Krefting.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Signed-off-by: Peter Krefting <peter@softwolves.pp.se>

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-08-06  7:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-04 17:17 [PATCH v2 0/2] commit: improve UTF-8 validation brian m. carlson
2013-07-04 17:19 ` [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
2013-07-04 19:58   ` Torsten Bögershausen
2013-07-04 20:39     ` brian m. carlson
2013-07-05 12:51   ` Peter Krefting
2013-07-08 19:36     ` Junio C Hamano
2013-07-09 11:16       ` [PATCH] commit: reject non-characters Peter Krefting
2013-08-05 12:48         ` Peter Krefting
2013-08-05 16:54           ` Junio C Hamano
2013-08-06  7:03             ` Peter Krefting
2013-07-04 17:20 ` [PATCH v2 2/2] commit: reject overlong UTF-8 sequences brian m. carlson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.