All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] dir: allow a BOM at the beginning of exclude files
@ 2015-04-16 14:05 Carlos Martín Nieto
  2015-04-16 15:03 ` Johannes Schindelin
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Carlos Martín Nieto @ 2015-04-16 14:05 UTC (permalink / raw)
  To: git

Some text editors like Notepad or LibreOffice write an UTF-8 BOM in
order to indicate that the file is Unicode text rather than whatever the
current locale would indicate.

If someone uses such an editor to edit a gitignore file, we are left
with those three bytes at the beginning of the file. If we do not skip
them, we will attempt to match a filename with the BOM as prefix, which
won't match the files the user is expecting.

---

If you're wondering how I came up with LibreOffice, I was doing a
workshop recently and one of the participants was not content with the
choice of vim or nano, so he opened LibreOffice to edit the gitignore
file with confusing consequences.

This codepath doesn't go as far as the config code in validating that
we do not have a partial BOM which would mean there's some invalid
content, but we don't really have invalid content any other way, as
we're just dealing with a list of paths in the file.

 dir.c                      | 8 +++++++-
 t/t7061-wtstatus-ignore.sh | 2 ++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/dir.c b/dir.c
index 0943a81..6368247 100644
--- a/dir.c
+++ b/dir.c
@@ -581,6 +581,7 @@ int add_excludes_from_file_to_list(const char *fname,
 	struct stat st;
 	int fd, i, lineno = 1;
 	size_t size = 0;
+	static const unsigned char *utf8_bom = (unsigned char *) "\xef\xbb\xbf";
 	char *buf, *entry;
 
 	fd = open(fname, O_RDONLY);
@@ -617,7 +618,12 @@ int add_excludes_from_file_to_list(const char *fname,
 	}
 
 	el->filebuf = buf;
-	entry = buf;
+
+	if (size >= 3 && !memcmp(buf, utf8_bom, 3))
+		entry = buf + 3;
+	else
+		entry = buf;
+
 	for (i = 0; i < size; i++) {
 		if (buf[i] == '\n') {
 			if (entry != buf + i && entry[0] != '#') {
diff --git a/t/t7061-wtstatus-ignore.sh b/t/t7061-wtstatus-ignore.sh
index 460789b..0a06fbf 100755
--- a/t/t7061-wtstatus-ignore.sh
+++ b/t/t7061-wtstatus-ignore.sh
@@ -13,6 +13,8 @@ EOF
 
 test_expect_success 'status untracked directory with --ignored' '
 	echo "ignored" >.gitignore &&
+	sed -e "s/^/\xef\xbb\xbf/" .gitignore >.gitignore.new &&
+	mv .gitignore.new .gitignore &&
 	mkdir untracked &&
 	: >untracked/ignored &&
 	: >untracked/uncommitted &&
-- 
2.0.0.5.gbce14aa

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] dir: allow a BOM at the beginning of exclude files
  2015-04-16 14:05 [PATCH] dir: allow a BOM at the beginning of exclude files Carlos Martín Nieto
@ 2015-04-16 15:03 ` Johannes Schindelin
  2015-04-16 15:09   ` Carlos Martín Nieto
  2015-04-16 15:10 ` Carlos Martín Nieto
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Johannes Schindelin @ 2015-04-16 15:03 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: git, git-owner

Hi Carlos,

On 2015-04-16 16:05, Carlos Martín Nieto wrote:
> Some text editors like Notepad or LibreOffice write an UTF-8 BOM in
> order to indicate that the file is Unicode text rather than whatever the
> current locale would indicate.
> 
> If someone uses such an editor to edit a gitignore file, we are left
> with those three bytes at the beginning of the file. If we do not skip
> them, we will attempt to match a filename with the BOM as prefix, which
> won't match the files the user is expecting.
> 
> ---
> 
> If you're wondering how I came up with LibreOffice, I was doing a
> workshop recently and one of the participants was not content with the
> choice of vim or nano, so he opened LibreOffice to edit the gitignore
> file with confusing consequences.
> 
> This codepath doesn't go as far as the config code in validating that
> we do not have a partial BOM which would mean there's some invalid
> content, but we don't really have invalid content any other way, as
> we're just dealing with a list of paths in the file.

Yeah, users are entertaining!

I agree that this is a good patch. *Maybe* we would need the same handling in more places, in which case it might make sense to refactor the test into its own function.

In any case, though, the Git project requires a [Developer's Certificate of Origin](https://github.com/git/git/blob/v2.3.5/Documentation/SubmittingPatches#L234-L277); Would you mind adding that?

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] dir: allow a BOM at the beginning of exclude files
  2015-04-16 15:03 ` Johannes Schindelin
@ 2015-04-16 15:09   ` Carlos Martín Nieto
  0 siblings, 0 replies; 24+ messages in thread
From: Carlos Martín Nieto @ 2015-04-16 15:09 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

On Thu, 2015-04-16 at 17:03 +0200, Johannes Schindelin wrote:
> Hi Carlos,
> 
> On 2015-04-16 16:05, Carlos Martín Nieto wrote:
> > Some text editors like Notepad or LibreOffice write an UTF-8 BOM in
> > order to indicate that the file is Unicode text rather than whatever the
> > current locale would indicate.
> > 
> > If someone uses such an editor to edit a gitignore file, we are left
> > with those three bytes at the beginning of the file. If we do not skip
> > them, we will attempt to match a filename with the BOM as prefix, which
> > won't match the files the user is expecting.
> > 
> > ---
> > 
> > If you're wondering how I came up with LibreOffice, I was doing a
> > workshop recently and one of the participants was not content with the
> > choice of vim or nano, so he opened LibreOffice to edit the gitignore
> > file with confusing consequences.
> > 
> > This codepath doesn't go as far as the config code in validating that
> > we do not have a partial BOM which would mean there's some invalid
> > content, but we don't really have invalid content any other way, as
> > we're just dealing with a list of paths in the file.
> 
> Yeah, users are entertaining!
> 
> I agree that this is a good patch. *Maybe* we would need the same handling in more places, in which case it might make sense to refactor the test into its own function.

Yes, this was my train of thought. If we (discover that) need it in a
third place, then we can unify the test and skip. For two places I
reckoned it was fine if we duplicated a bit.

> 
> In any case, though, the Git project requires a [Developer's Certificate of Origin](https://github.com/git/git/blob/v2.3.5/Documentation/SubmittingPatches#L234-L277); Would you mind adding that?
> 

Yeah, sorry. I keep forgetting to do that. I'll reply to my original
e-mail with it.

Cheers,
   cmn

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] dir: allow a BOM at the beginning of exclude files
  2015-04-16 14:05 [PATCH] dir: allow a BOM at the beginning of exclude files Carlos Martín Nieto
  2015-04-16 15:03 ` Johannes Schindelin
@ 2015-04-16 15:10 ` Carlos Martín Nieto
  2015-04-16 15:39 ` Junio C Hamano
  2015-04-16 16:10 ` Torsten Bögershausen
  3 siblings, 0 replies; 24+ messages in thread
From: Carlos Martín Nieto @ 2015-04-16 15:10 UTC (permalink / raw)
  To: git

On Thu, 2015-04-16 at 16:05 +0200, Carlos Martín Nieto wrote:
> Some text editors like Notepad or LibreOffice write an UTF-8 BOM in
> order to indicate that the file is Unicode text rather than whatever the
> current locale would indicate.
> 
> If someone uses such an editor to edit a gitignore file, we are left
> with those three bytes at the beginning of the file. If we do not skip
> them, we will attempt to match a filename with the BOM as prefix, which
> won't match the files the user is expecting.

Signed-off-by: Carlos Martín Nieto <cmn@elego.de>

which I keep forgetting.

> 
> ---
> 
> If you're wondering how I came up with LibreOffice, I was doing a
> workshop recently and one of the participants was not content with the
> choice of vim or nano, so he opened LibreOffice to edit the gitignore
> file with confusing consequences.
> 
> This codepath doesn't go as far as the config code in validating that
> we do not have a partial BOM which would mean there's some invalid
> content, but we don't really have invalid content any other way, as
> we're just dealing with a list of paths in the file.
> 
>  dir.c                      | 8 +++++++-
>  t/t7061-wtstatus-ignore.sh | 2 ++
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/dir.c b/dir.c
> index 0943a81..6368247 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -581,6 +581,7 @@ int add_excludes_from_file_to_list(const char *fname,
>  	struct stat st;
>  	int fd, i, lineno = 1;
>  	size_t size = 0;
> +	static const unsigned char *utf8_bom = (unsigned char *) "\xef\xbb\xbf";
>  	char *buf, *entry;
>  
>  	fd = open(fname, O_RDONLY);
> @@ -617,7 +618,12 @@ int add_excludes_from_file_to_list(const char *fname,
>  	}
>  
>  	el->filebuf = buf;
> -	entry = buf;
> +
> +	if (size >= 3 && !memcmp(buf, utf8_bom, 3))
> +		entry = buf + 3;
> +	else
> +		entry = buf;
> +
>  	for (i = 0; i < size; i++) {
>  		if (buf[i] == '\n') {
>  			if (entry != buf + i && entry[0] != '#') {
> diff --git a/t/t7061-wtstatus-ignore.sh b/t/t7061-wtstatus-ignore.sh
> index 460789b..0a06fbf 100755
> --- a/t/t7061-wtstatus-ignore.sh
> +++ b/t/t7061-wtstatus-ignore.sh
> @@ -13,6 +13,8 @@ EOF
>  
>  test_expect_success 'status untracked directory with --ignored' '
>  	echo "ignored" >.gitignore &&
> +	sed -e "s/^/\xef\xbb\xbf/" .gitignore >.gitignore.new &&
> +	mv .gitignore.new .gitignore &&
>  	mkdir untracked &&
>  	: >untracked/ignored &&
>  	: >untracked/uncommitted &&

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] dir: allow a BOM at the beginning of exclude files
  2015-04-16 14:05 [PATCH] dir: allow a BOM at the beginning of exclude files Carlos Martín Nieto
  2015-04-16 15:03 ` Johannes Schindelin
  2015-04-16 15:10 ` Carlos Martín Nieto
@ 2015-04-16 15:39 ` Junio C Hamano
  2015-04-16 15:55   ` Jeff King
  2015-04-16 16:08   ` [PATCH] dir: allow a BOM at the beginning of exclude files Johannes Schindelin
  2015-04-16 16:10 ` Torsten Bögershausen
  3 siblings, 2 replies; 24+ messages in thread
From: Junio C Hamano @ 2015-04-16 15:39 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: git

Carlos Martín Nieto <cmn@elego.de> writes:

> diff --git a/dir.c b/dir.c
> index 0943a81..6368247 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -581,6 +581,7 @@ int add_excludes_from_file_to_list(const char *fname,
>  	struct stat st;
>  	int fd, i, lineno = 1;
>  	size_t size = 0;
> +	static const unsigned char *utf8_bom = (unsigned char *) "\xef\xbb\xbf";
>  	char *buf, *entry;
>  
>  	fd = open(fname, O_RDONLY);
> @@ -617,7 +618,12 @@ int add_excludes_from_file_to_list(const char *fname,
>  	}
>  
>  	el->filebuf = buf;
> -	entry = buf;
> +
> +	if (size >= 3 && !memcmp(buf, utf8_bom, 3))

OK.

> +		entry = buf + 3;
> +	else
> +		entry = buf;
> +
>  	for (i = 0; i < size; i++) {
>  		if (buf[i] == '\n') {
>  			if (entry != buf + i && entry[0] != '#') {
> diff --git a/t/t7061-wtstatus-ignore.sh b/t/t7061-wtstatus-ignore.sh
> index 460789b..0a06fbf 100755
> --- a/t/t7061-wtstatus-ignore.sh
> +++ b/t/t7061-wtstatus-ignore.sh
> @@ -13,6 +13,8 @@ EOF
>  
>  test_expect_success 'status untracked directory with --ignored' '
>  	echo "ignored" >.gitignore &&
> +	sed -e "s/^/\xef\xbb\xbf/" .gitignore >.gitignore.new &&
> +	mv .gitignore.new .gitignore &&

Is this "write literal in \xHEX on the replacement side of sed
substitution" potable?  In any case, replacing the above three with
something like:

	printf "<bom>ignored\n" >.gitignore

may be more sensible, no?

Also do we need a similar change to the attribute side, or are we
already covered and we forgot to do the same for the ignore files?


>  	mkdir untracked &&
>  	: >untracked/ignored &&
>  	: >untracked/uncommitted &&

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] dir: allow a BOM at the beginning of exclude files
  2015-04-16 15:39 ` Junio C Hamano
@ 2015-04-16 15:55   ` Jeff King
  2015-04-16 17:16     ` Junio C Hamano
  2015-04-16 16:08   ` [PATCH] dir: allow a BOM at the beginning of exclude files Johannes Schindelin
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff King @ 2015-04-16 15:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Carlos Martín Nieto, git

On Thu, Apr 16, 2015 at 08:39:55AM -0700, Junio C Hamano wrote:

> >  test_expect_success 'status untracked directory with --ignored' '
> >  	echo "ignored" >.gitignore &&
> > +	sed -e "s/^/\xef\xbb\xbf/" .gitignore >.gitignore.new &&
> > +	mv .gitignore.new .gitignore &&
> 
> Is this "write literal in \xHEX on the replacement side of sed
> substitution" potable?  In any case, replacing the above three with
> something like:
> 
> 	printf "<bom>ignored\n" >.gitignore
> 
> may be more sensible, no?

I'm not sure about sed, but I agree it is suspect. And note that printf
with hex codes is not portable, either You have to use octal:

  printf '\357\273\277ignored\n' >.gitignore

Also, as a nit, I'd much rather see this in its own test rather than
crammed into another test_expect_success. It's much easier to diagnose
failures if the test description mentions the goal, and it is not tied
up with testing other parts that might fail.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] dir: allow a BOM at the beginning of exclude files
  2015-04-16 15:39 ` Junio C Hamano
  2015-04-16 15:55   ` Jeff King
@ 2015-04-16 16:08   ` Johannes Schindelin
  1 sibling, 0 replies; 24+ messages in thread
From: Johannes Schindelin @ 2015-04-16 16:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Carlos Martín Nieto, git, git-owner

Hi,

On 2015-04-16 17:39, Junio C Hamano wrote:

> Also do we need a similar change to the attribute side, or are we
> already covered and we forgot to do the same for the ignore files?

I fear so: https://github.com/git/git/blob/v2.3.5/attr.c#L359-L376

As for the config, we are safe: https://github.com/git/git/blob/v2.3.5/config.c#L419-L439

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] dir: allow a BOM at the beginning of exclude files
  2015-04-16 14:05 [PATCH] dir: allow a BOM at the beginning of exclude files Carlos Martín Nieto
                   ` (2 preceding siblings ...)
  2015-04-16 15:39 ` Junio C Hamano
@ 2015-04-16 16:10 ` Torsten Bögershausen
  3 siblings, 0 replies; 24+ messages in thread
From: Torsten Bögershausen @ 2015-04-16 16:10 UTC (permalink / raw)
  To: Carlos Martín Nieto, git

On 2015-04-16 16.05, Carlos Martín Nieto wrote:
[]
May be it is easier to move this into an own function, like remove_utf8_bom() ?

>  dir.c                      | 8 +++++++-
>  t/t7061-wtstatus-ignore.sh | 2 ++
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/dir.c b/dir.c
> index 0943a81..6368247 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -581,6 +581,7 @@ int add_excludes_from_file_to_list(const char *fname,
>  	struct stat st;
>  	int fd, i, lineno = 1;
>  	size_t size = 0;
> +	static const unsigned char *utf8_bom = (unsigned char *) "\xef\xbb\xbf";
Do we really need to cast here (and if, is the cast dropping the "const" ?)

Another suggestion, see below:
either:
	static const size_t bom_len = 3;
or
	static const size_t bom_len = strlen(utf8_bom);

>  	char *buf, *entry;
>  
>  	fd = open(fname, O_RDONLY);
> @@ -617,7 +618,12 @@ int add_excludes_from_file_to_list(const char *fname,
>  	}
>  
>  	el->filebuf = buf;
> -	entry = buf;
> +
And now we can avoid magic numbers:
	if (size >= bom_len && !memcmp(buf, utf8_bom, bom_len))
		entry = buf + bom_len;
	else
		entry = buf;
[]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] dir: allow a BOM at the beginning of exclude files
  2015-04-16 15:55   ` Jeff King
@ 2015-04-16 17:16     ` Junio C Hamano
  2015-04-16 17:52       ` [PATCH 0/3] UTF8 BOM follow-up Junio C Hamano
                         ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Junio C Hamano @ 2015-04-16 17:16 UTC (permalink / raw)
  To: Jeff King; +Cc: Carlos Martín Nieto, git

Jeff King <peff@peff.net> writes:

> On Thu, Apr 16, 2015 at 08:39:55AM -0700, Junio C Hamano wrote:
>
>> >  test_expect_success 'status untracked directory with --ignored' '
>> >  	echo "ignored" >.gitignore &&
>> > +	sed -e "s/^/\xef\xbb\xbf/" .gitignore >.gitignore.new &&
>> > +	mv .gitignore.new .gitignore &&
>> 
>> Is this "write literal in \xHEX on the replacement side of sed
>> substitution" potable?  In any case, replacing the above three with
>> something like:
>> 
>> 	printf "<bom>ignored\n" >.gitignore
>> 
>> may be more sensible, no?
>
> I'm not sure about sed, but I agree it is suspect. And note that printf
> with hex codes is not portable, either You have to use octal:
>
>   printf '\357\273\277ignored\n' >.gitignore
>
> Also, as a nit, I'd much rather see this in its own test rather than
> crammed into another test_expect_success. It's much easier to diagnose
> failures if the test description mentions the goal, and it is not tied
> up with testing other parts that might fail.

Yeah, I totally agree.

Carlos, something like this squashed in, perhaps?

 t/t7061-wtstatus-ignore.sh | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/t/t7061-wtstatus-ignore.sh b/t/t7061-wtstatus-ignore.sh
index 0a06fbf..cdc0747 100755
--- a/t/t7061-wtstatus-ignore.sh
+++ b/t/t7061-wtstatus-ignore.sh
@@ -13,8 +13,6 @@ EOF
 
 test_expect_success 'status untracked directory with --ignored' '
 	echo "ignored" >.gitignore &&
-	sed -e "s/^/\xef\xbb\xbf/" .gitignore >.gitignore.new &&
-	mv .gitignore.new .gitignore &&
 	mkdir untracked &&
 	: >untracked/ignored &&
 	: >untracked/uncommitted &&
@@ -22,6 +20,15 @@ test_expect_success 'status untracked directory with --ignored' '
 	test_cmp expected actual
 '
 
+test_expect_success 'same with gitignore starting with BOM' '
+	printf "\357\273\277ignored\n" >.gitignore &&
+	mkdir -p untracked &&
+	: >untracked/ignored &&
+	: >untracked/uncommitted &&
+	git status --porcelain --ignored >actual &&
+	test_cmp expected actual
+'
+
 cat >expected <<\EOF
 ?? .gitignore
 ?? actual

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 0/3] UTF8 BOM follow-up
  2015-04-16 17:16     ` Junio C Hamano
@ 2015-04-16 17:52       ` Junio C Hamano
  2015-04-16 17:52         ` [PATCH 1/3] utf8-bom: introduce skip_utf8_bom() helper Junio C Hamano
                           ` (2 more replies)
  2015-04-16 18:27       ` [PATCH] dir: allow a BOM at the beginning of exclude files Carlos Martín Nieto
  2015-04-16 18:39       ` [PATCH v2 0/4] UTF8 BOM follow-up Junio C Hamano
  2 siblings, 3 replies; 24+ messages in thread
From: Junio C Hamano @ 2015-04-16 17:52 UTC (permalink / raw)
  To: git; +Cc: Carlos Martín Nieto, Jeff King

This is on top of the ".gitignore can start with UTF8 BOM" patch
from Carlos.

Junio C Hamano (3):
  utf8-bom: introduce skip_utf8_bom() helper
  config: use utf8_bom[] from utf.[ch] in git_parse_source()
  attr: skip UTF8 BOM at the beginning of the input file

 attr.c   |  9 +++++++--
 config.c |  6 +++---
 dir.c    |  8 +++-----
 utf8.c   | 11 +++++++++++
 utf8.h   |  3 +++
 5 files changed, 27 insertions(+), 10 deletions(-)

-- 
2.4.0-rc2-171-g98ddf7f

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] utf8-bom: introduce skip_utf8_bom() helper
  2015-04-16 17:52       ` [PATCH 0/3] UTF8 BOM follow-up Junio C Hamano
@ 2015-04-16 17:52         ` Junio C Hamano
  2015-04-16 18:14           ` Jeff King
  2015-04-16 17:52         ` [PATCH 2/3] config: use utf8_bom[] from utf.[ch] in git_parse_source() Junio C Hamano
  2015-04-16 17:52         ` [PATCH 3/3] attr: skip UTF8 BOM at the beginning of the input file Junio C Hamano
  2 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2015-04-16 17:52 UTC (permalink / raw)
  To: git; +Cc: Carlos Martín Nieto, Jeff King

With the recent change to ignore the UTF8 BOM at the beginning of
.gitignore files, we now have two codepaths that do such a skipping
(the other one is for reading the configuration files).

Introduce utf8_bom[] constant string and skip_utf8_bom() helper
and teach .gitignore code how to use it.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 dir.c  |  8 +++-----
 utf8.c | 11 +++++++++++
 utf8.h |  3 +++
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/dir.c b/dir.c
index 10c1f90..9e04a23 100644
--- a/dir.c
+++ b/dir.c
@@ -12,6 +12,7 @@
 #include "refs.h"
 #include "wildmatch.h"
 #include "pathspec.h"
+#include "utf8.h"
 
 struct path_simplify {
 	int len;
@@ -538,7 +539,6 @@ int add_excludes_from_file_to_list(const char *fname,
 	struct stat st;
 	int fd, i, lineno = 1;
 	size_t size = 0;
-	static const unsigned char *utf8_bom = (unsigned char *) "\xef\xbb\xbf";
 	char *buf, *entry;
 
 	fd = open(fname, O_RDONLY);
@@ -576,10 +576,8 @@ int add_excludes_from_file_to_list(const char *fname,
 
 	el->filebuf = buf;
 
-	if (size >= 3 && !memcmp(buf, utf8_bom, 3))
-		entry = buf + 3;
-	else
-		entry = buf;
+	entry = buf;
+	skip_utf8_bom(&entry, size);
 
 	for (i = 0; i < size; i++) {
 		if (buf[i] == '\n') {
diff --git a/utf8.c b/utf8.c
index 520fbb4..28e6d76 100644
--- a/utf8.c
+++ b/utf8.c
@@ -633,3 +633,14 @@ int is_hfs_dotgit(const char *path)
 
 	return 1;
 }
+
+const char utf8_bom[] = "\357\273\277";
+
+int skip_utf8_bom(char **text, size_t len)
+{
+	if (len < strlen(utf8_bom) ||
+	    memcmp(*text, utf8_bom, strlen(utf8_bom)))
+		return 0;
+	*text += strlen(utf8_bom);
+	return 1;
+}
diff --git a/utf8.h b/utf8.h
index e4d9183..e7b2aa4 100644
--- a/utf8.h
+++ b/utf8.h
@@ -13,6 +13,9 @@ int same_encoding(const char *, const char *);
 __attribute__((format (printf, 2, 3)))
 int utf8_fprintf(FILE *, const char *, ...);
 
+extern const char utf8_bom[];
+extern int skip_utf8_bom(char **, size_t);
+
 void strbuf_add_wrapped_text(struct strbuf *buf,
 		const char *text, int indent, int indent2, int width);
 void strbuf_add_wrapped_bytes(struct strbuf *buf, const char *data, int len,
-- 
2.4.0-rc2-171-g98ddf7f

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/3] config: use utf8_bom[] from utf.[ch] in git_parse_source()
  2015-04-16 17:52       ` [PATCH 0/3] UTF8 BOM follow-up Junio C Hamano
  2015-04-16 17:52         ` [PATCH 1/3] utf8-bom: introduce skip_utf8_bom() helper Junio C Hamano
@ 2015-04-16 17:52         ` Junio C Hamano
  2015-04-16 17:52         ` [PATCH 3/3] attr: skip UTF8 BOM at the beginning of the input file Junio C Hamano
  2 siblings, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2015-04-16 17:52 UTC (permalink / raw)
  To: git; +Cc: Carlos Martín Nieto, Jeff King

Because the function reads one character at the time, unfortunately
we cannot use the easier skip_utf8_bom() helper, but at least we do
not have to duplicate the constant string this way.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 config.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/config.c b/config.c
index 752e2e2..9618aa4 100644
--- a/config.c
+++ b/config.c
@@ -12,6 +12,7 @@
 #include "quote.h"
 #include "hashmap.h"
 #include "string-list.h"
+#include "utf8.h"
 
 struct config_source {
 	struct config_source *prev;
@@ -412,8 +413,7 @@ static int git_parse_source(config_fn_t fn, void *data)
 	struct strbuf *var = &cf->var;
 
 	/* U+FEFF Byte Order Mark in UTF8 */
-	static const unsigned char *utf8_bom = (unsigned char *) "\xef\xbb\xbf";
-	const unsigned char *bomptr = utf8_bom;
+	const char *bomptr = utf8_bom;
 
 	for (;;) {
 		int c = get_next_char();
@@ -421,7 +421,7 @@ static int git_parse_source(config_fn_t fn, void *data)
 			/* We are at the file beginning; skip UTF8-encoded BOM
 			 * if present. Sane editors won't put this in on their
 			 * own, but e.g. Windows Notepad will do it happily. */
-			if ((unsigned char) c == *bomptr) {
+			if (c == (*bomptr & 0377)) {
 				bomptr++;
 				continue;
 			} else {
-- 
2.4.0-rc2-171-g98ddf7f

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/3] attr: skip UTF8 BOM at the beginning of the input file
  2015-04-16 17:52       ` [PATCH 0/3] UTF8 BOM follow-up Junio C Hamano
  2015-04-16 17:52         ` [PATCH 1/3] utf8-bom: introduce skip_utf8_bom() helper Junio C Hamano
  2015-04-16 17:52         ` [PATCH 2/3] config: use utf8_bom[] from utf.[ch] in git_parse_source() Junio C Hamano
@ 2015-04-16 17:52         ` Junio C Hamano
  2 siblings, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2015-04-16 17:52 UTC (permalink / raw)
  To: git; +Cc: Carlos Martín Nieto, Jeff King

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 attr.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/attr.c b/attr.c
index cd54697..7c530f4 100644
--- a/attr.c
+++ b/attr.c
@@ -12,6 +12,7 @@
 #include "exec_cmd.h"
 #include "attr.h"
 #include "dir.h"
+#include "utf8.h"
 
 const char git_attr__true[] = "(builtin)true";
 const char git_attr__false[] = "\0(builtin)false";
@@ -369,8 +370,12 @@ static struct attr_stack *read_attr_from_file(const char *path, int macro_ok)
 		return NULL;
 	}
 	res = xcalloc(1, sizeof(*res));
-	while (fgets(buf, sizeof(buf), fp))
-		handle_attr_line(res, buf, path, ++lineno, macro_ok);
+	while (fgets(buf, sizeof(buf), fp)) {
+		char *bufp = buf;
+		if (!lineno)
+			skip_utf8_bom(&bufp, strlen(bufp));
+		handle_attr_line(res, bufp, path, ++lineno, macro_ok);
+	}
 	fclose(fp);
 	return res;
 }
-- 
2.4.0-rc2-171-g98ddf7f

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] utf8-bom: introduce skip_utf8_bom() helper
  2015-04-16 17:52         ` [PATCH 1/3] utf8-bom: introduce skip_utf8_bom() helper Junio C Hamano
@ 2015-04-16 18:14           ` Jeff King
  2015-04-16 18:23             ` Junio C Hamano
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff King @ 2015-04-16 18:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Carlos Martín Nieto

On Thu, Apr 16, 2015 at 10:52:52AM -0700, Junio C Hamano wrote:

> @@ -576,10 +576,8 @@ int add_excludes_from_file_to_list(const char *fname,
>  
>  	el->filebuf = buf;
>  
> -	if (size >= 3 && !memcmp(buf, utf8_bom, 3))
> -		entry = buf + 3;
> -	else
> -		entry = buf;
> +	entry = buf;
> +	skip_utf8_bom(&entry, size);
>  
>  	for (i = 0; i < size; i++) {
>  		if (buf[i] == '\n') {

I'm surprised that in both yours and the original that we do not need to
subtract 3 from "size".

It looks like we advance "entry" here, not "buf", and then iterate over
"buf". But I think that makes the later logic weird:

   if (entry != buf + i && entry[0] != '#')

because if there is a BOM, we end up with "entry > buf + i", which I
think this code isn't expecting. I'm not sure it does anything bad, but
I think it might be simpler as just:

  /* save away the "real" copy for later, as we do now */
  el->filebuf = buf;

  /*
   * now pretend as if the BOM was not there at all by advancing
   * the pointer and shrinking the size
   */
  skip_utf8_bom(&buf, &size);

  /*
   * and now we do our usual magic with "entry"
   */
  entry = buf;
  for (i = 0; i < size; i++)
     ...

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] utf8-bom: introduce skip_utf8_bom() helper
  2015-04-16 18:14           ` Jeff King
@ 2015-04-16 18:23             ` Junio C Hamano
  0 siblings, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2015-04-16 18:23 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Carlos Martín Nieto

Jeff King <peff@peff.net> writes:

> On Thu, Apr 16, 2015 at 10:52:52AM -0700, Junio C Hamano wrote:
>
>> @@ -576,10 +576,8 @@ int add_excludes_from_file_to_list(const char *fname,
>>  
>>  	el->filebuf = buf;
>>  
>> -	if (size >= 3 && !memcmp(buf, utf8_bom, 3))
>> -		entry = buf + 3;
>> -	else
>> -		entry = buf;
>> +	entry = buf;
>> +	skip_utf8_bom(&entry, size);
>>  
>>  	for (i = 0; i < size; i++) {
>>  		if (buf[i] == '\n') {
>
> I'm surprised that in both yours and the original that we do not need to
> subtract 3 from "size".

Or we start scanning from the beginning of "buf", i.e.

	for (i = 0; i < size; i++)

After you pointed it out, I wondered why we do not adjust the
initial value of "i" (without futzing with "size").  But...

> It looks like we advance "entry" here, not "buf", and then iterate over
> "buf". But I think that makes the later logic weird:
>
>    if (entry != buf + i && entry[0] != '#')
>
> because if there is a BOM, we end up with "entry > buf + i", which I
> think this code isn't expecting. I'm not sure it does anything bad, but
> I think it might be simpler as just:
>
>   /* save away the "real" copy for later, as we do now */
>   el->filebuf = buf;
>
>   /*
>    * now pretend as if the BOM was not there at all by advancing
>    * the pointer and shrinking the size
>    */
>   skip_utf8_bom(&buf, &size);
>
>   /*
>    * and now we do our usual magic with "entry"
>    */
>   entry = buf;
>   for (i = 0; i < size; i++)
>      ...

... this would work much better for this caller.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] dir: allow a BOM at the beginning of exclude files
  2015-04-16 17:16     ` Junio C Hamano
  2015-04-16 17:52       ` [PATCH 0/3] UTF8 BOM follow-up Junio C Hamano
@ 2015-04-16 18:27       ` Carlos Martín Nieto
  2015-04-16 18:39       ` [PATCH v2 0/4] UTF8 BOM follow-up Junio C Hamano
  2 siblings, 0 replies; 24+ messages in thread
From: Carlos Martín Nieto @ 2015-04-16 18:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On Thu, 2015-04-16 at 10:16 -0700, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
> > On Thu, Apr 16, 2015 at 08:39:55AM -0700, Junio C Hamano wrote:
> >
> >> >  test_expect_success 'status untracked directory with --ignored' '
> >> >  	echo "ignored" >.gitignore &&
> >> > +	sed -e "s/^/\xef\xbb\xbf/" .gitignore >.gitignore.new &&
> >> > +	mv .gitignore.new .gitignore &&
> >> 
> >> Is this "write literal in \xHEX on the replacement side of sed
> >> substitution" potable?  In any case, replacing the above three with
> >> something like:
> >> 
> >> 	printf "<bom>ignored\n" >.gitignore
> >> 
> >> may be more sensible, no?
> >
> > I'm not sure about sed, but I agree it is suspect. And note that printf
> > with hex codes is not portable, either You have to use octal:
> >
> >   printf '\357\273\277ignored\n' >.gitignore
> >
> > Also, as a nit, I'd much rather see this in its own test rather than
> > crammed into another test_expect_success. It's much easier to diagnose
> > failures if the test description mentions the goal, and it is not tied
> > up with testing other parts that might fail.
> 
> Yeah, I totally agree.
> 
> Carlos, something like this squashed in, perhaps?
> 
>  t/t7061-wtstatus-ignore.sh | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/t/t7061-wtstatus-ignore.sh b/t/t7061-wtstatus-ignore.sh
> index 0a06fbf..cdc0747 100755
> --- a/t/t7061-wtstatus-ignore.sh
> +++ b/t/t7061-wtstatus-ignore.sh
> @@ -13,8 +13,6 @@ EOF
>  
>  test_expect_success 'status untracked directory with --ignored' '
>  	echo "ignored" >.gitignore &&
> -	sed -e "s/^/\xef\xbb\xbf/" .gitignore >.gitignore.new &&
> -	mv .gitignore.new .gitignore &&
>  	mkdir untracked &&
>  	: >untracked/ignored &&
>  	: >untracked/uncommitted &&
> @@ -22,6 +20,15 @@ test_expect_success 'status untracked directory with --ignored' '
>  	test_cmp expected actual
>  '
>  
> +test_expect_success 'same with gitignore starting with BOM' '
> +	printf "\357\273\277ignored\n" >.gitignore &&
> +	mkdir -p untracked &&
> +	: >untracked/ignored &&
> +	: >untracked/uncommitted &&
> +	git status --porcelain --ignored >actual &&
> +	test_cmp expected actual
> +'
> +
>  cat >expected <<\EOF
>  ?? .gitignore
>  ?? actual
> 

Yeah, that makes sense. I had something similar in my patch at one point
before going with modifying the current one.

   cmn

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 0/4] UTF8 BOM follow-up
  2015-04-16 17:16     ` Junio C Hamano
  2015-04-16 17:52       ` [PATCH 0/3] UTF8 BOM follow-up Junio C Hamano
  2015-04-16 18:27       ` [PATCH] dir: allow a BOM at the beginning of exclude files Carlos Martín Nieto
@ 2015-04-16 18:39       ` Junio C Hamano
  2015-04-16 18:39         ` [PATCH v2 1/4] add_excludes_from_file: clarify the bom skipping logic Junio C Hamano
                           ` (5 more replies)
  2 siblings, 6 replies; 24+ messages in thread
From: Junio C Hamano @ 2015-04-16 18:39 UTC (permalink / raw)
  To: git; +Cc: Carlos Martín Nieto, Jeff King

This is on top of the ".gitignore can start with UTF8 BOM" patch
from Carlos.

Second try; the first patch is new to clarify the logic in the
codeflow after Carlos's patch, and the second one has been adjusted
accordingly.

Junio C Hamano (4):
  add_excludes_from_file: clarify the bom skipping logic
  utf8-bom: introduce skip_utf8_bom() helper
  config: use utf8_bom[] from utf.[ch] in git_parse_source()
  attr: skip UTF8 BOM at the beginning of the input file

 attr.c   |  9 +++++++--
 config.c |  6 +++---
 dir.c    | 10 +++++-----
 utf8.c   | 11 +++++++++++
 utf8.h   |  3 +++
 5 files changed, 29 insertions(+), 10 deletions(-)

-- 
2.4.0-rc2-171-g98ddf7f

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/4] add_excludes_from_file: clarify the bom skipping logic
  2015-04-16 18:39       ` [PATCH v2 0/4] UTF8 BOM follow-up Junio C Hamano
@ 2015-04-16 18:39         ` Junio C Hamano
  2015-04-16 18:39         ` [PATCH v2 2/4] utf8-bom: introduce skip_utf8_bom() helper Junio C Hamano
                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2015-04-16 18:39 UTC (permalink / raw)
  To: git

Even though the previous step shifts where the "entry" begins, we
still iterate over the original buf[], which may begin with the
UTF-8 BOM we are supposed to be skipping.  At the end of the first
line, the code grabs the contents of it starting at "entry", so
there is nothing wrong per-se, but the logic looks really confused.

Instead, move the buf pointer and shrink its size, to truly
pretend that UTF-8 BOM did not exist in the input.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 dir.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/dir.c b/dir.c
index 10c1f90..b5bb389 100644
--- a/dir.c
+++ b/dir.c
@@ -576,10 +576,11 @@ int add_excludes_from_file_to_list(const char *fname,
 
 	el->filebuf = buf;
 
-	if (size >= 3 && !memcmp(buf, utf8_bom, 3))
-		entry = buf + 3;
-	else
-		entry = buf;
+	if (size >= 3 && !memcmp(buf, utf8_bom, 3)) {
+		buf += 3;
+		size -= 3;
+	}
+	entry = buf;
 
 	for (i = 0; i < size; i++) {
 		if (buf[i] == '\n') {
-- 
2.4.0-rc2-171-g98ddf7f

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 2/4] utf8-bom: introduce skip_utf8_bom() helper
  2015-04-16 18:39       ` [PATCH v2 0/4] UTF8 BOM follow-up Junio C Hamano
  2015-04-16 18:39         ` [PATCH v2 1/4] add_excludes_from_file: clarify the bom skipping logic Junio C Hamano
@ 2015-04-16 18:39         ` Junio C Hamano
  2015-04-16 18:39         ` [PATCH v2 3/4] config: use utf8_bom[] from utf.[ch] in git_parse_source() Junio C Hamano
                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2015-04-16 18:39 UTC (permalink / raw)
  To: git

With the recent change to ignore the UTF8 BOM at the beginning of
.gitignore files, we now have two codepaths that do such a skipping
(the other one is for reading the configuration files).

Introduce utf8_bom[] constant string and skip_utf8_bom() helper
and teach .gitignore code how to use it.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 dir.c  |  9 ++++-----
 utf8.c | 11 +++++++++++
 utf8.h |  3 +++
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/dir.c b/dir.c
index b5bb389..4c4bf91 100644
--- a/dir.c
+++ b/dir.c
@@ -12,6 +12,7 @@
 #include "refs.h"
 #include "wildmatch.h"
 #include "pathspec.h"
+#include "utf8.h"
 
 struct path_simplify {
 	int len;
@@ -538,7 +539,6 @@ int add_excludes_from_file_to_list(const char *fname,
 	struct stat st;
 	int fd, i, lineno = 1;
 	size_t size = 0;
-	static const unsigned char *utf8_bom = (unsigned char *) "\xef\xbb\xbf";
 	char *buf, *entry;
 
 	fd = open(fname, O_RDONLY);
@@ -576,10 +576,9 @@ int add_excludes_from_file_to_list(const char *fname,
 
 	el->filebuf = buf;
 
-	if (size >= 3 && !memcmp(buf, utf8_bom, 3)) {
-		buf += 3;
-		size -= 3;
-	}
+	if (skip_utf8_bom(&buf, size))
+		size -= buf - el->filebuf;
+
 	entry = buf;
 
 	for (i = 0; i < size; i++) {
diff --git a/utf8.c b/utf8.c
index 520fbb4..28e6d76 100644
--- a/utf8.c
+++ b/utf8.c
@@ -633,3 +633,14 @@ int is_hfs_dotgit(const char *path)
 
 	return 1;
 }
+
+const char utf8_bom[] = "\357\273\277";
+
+int skip_utf8_bom(char **text, size_t len)
+{
+	if (len < strlen(utf8_bom) ||
+	    memcmp(*text, utf8_bom, strlen(utf8_bom)))
+		return 0;
+	*text += strlen(utf8_bom);
+	return 1;
+}
diff --git a/utf8.h b/utf8.h
index e4d9183..e7b2aa4 100644
--- a/utf8.h
+++ b/utf8.h
@@ -13,6 +13,9 @@ int same_encoding(const char *, const char *);
 __attribute__((format (printf, 2, 3)))
 int utf8_fprintf(FILE *, const char *, ...);
 
+extern const char utf8_bom[];
+extern int skip_utf8_bom(char **, size_t);
+
 void strbuf_add_wrapped_text(struct strbuf *buf,
 		const char *text, int indent, int indent2, int width);
 void strbuf_add_wrapped_bytes(struct strbuf *buf, const char *data, int len,
-- 
2.4.0-rc2-171-g98ddf7f

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 3/4] config: use utf8_bom[] from utf.[ch] in git_parse_source()
  2015-04-16 18:39       ` [PATCH v2 0/4] UTF8 BOM follow-up Junio C Hamano
  2015-04-16 18:39         ` [PATCH v2 1/4] add_excludes_from_file: clarify the bom skipping logic Junio C Hamano
  2015-04-16 18:39         ` [PATCH v2 2/4] utf8-bom: introduce skip_utf8_bom() helper Junio C Hamano
@ 2015-04-16 18:39         ` Junio C Hamano
  2015-04-16 18:39         ` [PATCH v2 4/4] attr: skip UTF8 BOM at the beginning of the input file Junio C Hamano
                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2015-04-16 18:39 UTC (permalink / raw)
  To: git

Because the function reads one character at the time, unfortunately
we cannot use the easier skip_utf8_bom() helper, but at least we do
not have to duplicate the constant string this way.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 config.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/config.c b/config.c
index 752e2e2..9618aa4 100644
--- a/config.c
+++ b/config.c
@@ -12,6 +12,7 @@
 #include "quote.h"
 #include "hashmap.h"
 #include "string-list.h"
+#include "utf8.h"
 
 struct config_source {
 	struct config_source *prev;
@@ -412,8 +413,7 @@ static int git_parse_source(config_fn_t fn, void *data)
 	struct strbuf *var = &cf->var;
 
 	/* U+FEFF Byte Order Mark in UTF8 */
-	static const unsigned char *utf8_bom = (unsigned char *) "\xef\xbb\xbf";
-	const unsigned char *bomptr = utf8_bom;
+	const char *bomptr = utf8_bom;
 
 	for (;;) {
 		int c = get_next_char();
@@ -421,7 +421,7 @@ static int git_parse_source(config_fn_t fn, void *data)
 			/* We are at the file beginning; skip UTF8-encoded BOM
 			 * if present. Sane editors won't put this in on their
 			 * own, but e.g. Windows Notepad will do it happily. */
-			if ((unsigned char) c == *bomptr) {
+			if (c == (*bomptr & 0377)) {
 				bomptr++;
 				continue;
 			} else {
-- 
2.4.0-rc2-171-g98ddf7f

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 4/4] attr: skip UTF8 BOM at the beginning of the input file
  2015-04-16 18:39       ` [PATCH v2 0/4] UTF8 BOM follow-up Junio C Hamano
                           ` (2 preceding siblings ...)
  2015-04-16 18:39         ` [PATCH v2 3/4] config: use utf8_bom[] from utf.[ch] in git_parse_source() Junio C Hamano
@ 2015-04-16 18:39         ` Junio C Hamano
  2015-04-16 19:26         ` [PATCH v2 0/4] UTF8 BOM follow-up Jeff King
  2015-04-17 22:44         ` Karsten Blees
  5 siblings, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2015-04-16 18:39 UTC (permalink / raw)
  To: git

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 attr.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/attr.c b/attr.c
index cd54697..7c530f4 100644
--- a/attr.c
+++ b/attr.c
@@ -12,6 +12,7 @@
 #include "exec_cmd.h"
 #include "attr.h"
 #include "dir.h"
+#include "utf8.h"
 
 const char git_attr__true[] = "(builtin)true";
 const char git_attr__false[] = "\0(builtin)false";
@@ -369,8 +370,12 @@ static struct attr_stack *read_attr_from_file(const char *path, int macro_ok)
 		return NULL;
 	}
 	res = xcalloc(1, sizeof(*res));
-	while (fgets(buf, sizeof(buf), fp))
-		handle_attr_line(res, buf, path, ++lineno, macro_ok);
+	while (fgets(buf, sizeof(buf), fp)) {
+		char *bufp = buf;
+		if (!lineno)
+			skip_utf8_bom(&bufp, strlen(bufp));
+		handle_attr_line(res, bufp, path, ++lineno, macro_ok);
+	}
 	fclose(fp);
 	return res;
 }
-- 
2.4.0-rc2-171-g98ddf7f

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/4] UTF8 BOM follow-up
  2015-04-16 18:39       ` [PATCH v2 0/4] UTF8 BOM follow-up Junio C Hamano
                           ` (3 preceding siblings ...)
  2015-04-16 18:39         ` [PATCH v2 4/4] attr: skip UTF8 BOM at the beginning of the input file Junio C Hamano
@ 2015-04-16 19:26         ` Jeff King
  2015-04-17 22:44         ` Karsten Blees
  5 siblings, 0 replies; 24+ messages in thread
From: Jeff King @ 2015-04-16 19:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Carlos Martín Nieto

On Thu, Apr 16, 2015 at 11:39:04AM -0700, Junio C Hamano wrote:

> This is on top of the ".gitignore can start with UTF8 BOM" patch
> from Carlos.
> 
> Second try; the first patch is new to clarify the logic in the
> codeflow after Carlos's patch, and the second one has been adjusted
> accordingly.
> 
> Junio C Hamano (4):
>   add_excludes_from_file: clarify the bom skipping logic
>   utf8-bom: introduce skip_utf8_bom() helper
>   config: use utf8_bom[] from utf.[ch] in git_parse_source()
>   attr: skip UTF8 BOM at the beginning of the input file

This one looks OK to me. The manual adjustment of "size" in the second
one is a little funny, but I guess avoiding a pointer for that size
parameter makes the final one (that uses strlen) a lot easier.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/4] UTF8 BOM follow-up
  2015-04-16 18:39       ` [PATCH v2 0/4] UTF8 BOM follow-up Junio C Hamano
                           ` (4 preceding siblings ...)
  2015-04-16 19:26         ` [PATCH v2 0/4] UTF8 BOM follow-up Jeff King
@ 2015-04-17 22:44         ` Karsten Blees
  2015-04-20 21:50           ` Junio C Hamano
  5 siblings, 1 reply; 24+ messages in thread
From: Karsten Blees @ 2015-04-17 22:44 UTC (permalink / raw)
  To: Junio C Hamano, git; +Cc: Carlos Martín Nieto, Jeff King

Am 16.04.2015 um 20:39 schrieb Junio C Hamano:
> This is on top of the ".gitignore can start with UTF8 BOM" patch
> from Carlos.
> 
> Second try; the first patch is new to clarify the logic in the
> codeflow after Carlos's patch, and the second one has been adjusted
> accordingly.
> 
> Junio C Hamano (4):
>   add_excludes_from_file: clarify the bom skipping logic
>   utf8-bom: introduce skip_utf8_bom() helper
>   config: use utf8_bom[] from utf.[ch] in git_parse_source()
>   attr: skip UTF8 BOM at the beginning of the input file
> 


Wouldn't it be better to just strip the BOM on commit, e.g. via a clean filter or pre-commit hook (as suggested in [1])? Or is this patch series only meant to supplement such a solution (i.e. only strip the BOM when reading files from the working-copy rather than the committed tree)?


According to rfc3629 chapter 6 [2], the use of a BOM as encoding signature should be forbidden if the encoding is *known* to be always UTF-8. And .gitignore, .gitattributes and .gitmodules contain path names, which are always UTF-8 as of Git for Windows v1.7.10.

IOW, allowing a BOM would mean that files *without* BOM are *not* UTF-8 and need to be decoded from e.g. system encoding (which unfortunately cannot be set to UTF-8 on Windows). But this makes no sense as the repository would not be portable. E.g. a .gitattributes file created on a Greek Windows, containing greek path names in Cp1253, would not work on platforms with different encoding.

On the other hand, just ignoring the BOM (as this patch series does) leaves us with two alternative binary representations of the same content file...i.e. we'll eventually end up with spurious 1st line changes as users add / remove BOMs from committed .git[ignore|attributes|modules] files, depending on their editor preference...


For local files (.gitconfig, .git/info/exclude, .git/COMMIT_EDITMSG...), auto-detecting encoding based on the presence of a BOM makes somewhat more sense. However, this will most likely break editors that follow the recommendation of the Unicode specification ("Use of a BOM is neither required nor recommended for UTF-8" [3]). So we'd probably need a core.editorEncoding or core.editorUseBom setting to tell git whether "no BOM" means UTF-8 or system encoding...

Just as a reminder: we should update the Git for Windows Unicode document [4] if we improve support for BOM-adamant editors.

Cheers,
Karsten

[1] http://stackoverflow.com/questions/27223985/git-ignore-bom-prevent-git-diff-from-showing-byte-order-mark-changes
[2] https://tools.ietf.org/html/rfc3629
[3] http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf  p.40
[4] https://github.com/msysgit/msysgit/wiki/Git-for-Windows-Unicode-Support#editor

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/4] UTF8 BOM follow-up
  2015-04-17 22:44         ` Karsten Blees
@ 2015-04-20 21:50           ` Junio C Hamano
  0 siblings, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2015-04-20 21:50 UTC (permalink / raw)
  To: Karsten Blees; +Cc: git, Carlos Martín Nieto, Jeff King

Karsten Blees <karsten.blees@gmail.com> writes:

> Wouldn't it be better to just strip the BOM on commit, e.g. via a
> clean filter or pre-commit hook (as suggested in [1])?

The users can do whatever they want and if they think having a BOM
in these files is a bad idea, I'd encourage them to use whatever
means to ensure that.  The code and history hygiene is a good thing.

But you should realize that $HOME/.gitconfig, $GIT_DIR/info/exclude,
$GIT_DIR/config, etc.  are not even committed files in the first
place.  These are not even defined to be "UTF-8 only" by us.  Their
contents is entirely up to the end users.

Here with these changes, we are only being nice to the users by
stripping a well-known two-byte sequence that is known to be left
commonly by some tools users would use. In a sense, this is the same
degree of niceness that we strip the CR at the end of the line
before LF.  Just like you _could_ have said these files must be
encoded in UTF-8 and must not have BOM at the beginning, we _could_
have defined that these files must be recorded with LF end-of-line.
But obviously we don't, as there is no need to make lives of end
users unnecessarily more complex, and it is easy to help users use
both LF and CRLF with simply stripping on our reader's side.  We do
this BOM stripping for the same reason to make it easier for users.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-04-20 21:50 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-16 14:05 [PATCH] dir: allow a BOM at the beginning of exclude files Carlos Martín Nieto
2015-04-16 15:03 ` Johannes Schindelin
2015-04-16 15:09   ` Carlos Martín Nieto
2015-04-16 15:10 ` Carlos Martín Nieto
2015-04-16 15:39 ` Junio C Hamano
2015-04-16 15:55   ` Jeff King
2015-04-16 17:16     ` Junio C Hamano
2015-04-16 17:52       ` [PATCH 0/3] UTF8 BOM follow-up Junio C Hamano
2015-04-16 17:52         ` [PATCH 1/3] utf8-bom: introduce skip_utf8_bom() helper Junio C Hamano
2015-04-16 18:14           ` Jeff King
2015-04-16 18:23             ` Junio C Hamano
2015-04-16 17:52         ` [PATCH 2/3] config: use utf8_bom[] from utf.[ch] in git_parse_source() Junio C Hamano
2015-04-16 17:52         ` [PATCH 3/3] attr: skip UTF8 BOM at the beginning of the input file Junio C Hamano
2015-04-16 18:27       ` [PATCH] dir: allow a BOM at the beginning of exclude files Carlos Martín Nieto
2015-04-16 18:39       ` [PATCH v2 0/4] UTF8 BOM follow-up Junio C Hamano
2015-04-16 18:39         ` [PATCH v2 1/4] add_excludes_from_file: clarify the bom skipping logic Junio C Hamano
2015-04-16 18:39         ` [PATCH v2 2/4] utf8-bom: introduce skip_utf8_bom() helper Junio C Hamano
2015-04-16 18:39         ` [PATCH v2 3/4] config: use utf8_bom[] from utf.[ch] in git_parse_source() Junio C Hamano
2015-04-16 18:39         ` [PATCH v2 4/4] attr: skip UTF8 BOM at the beginning of the input file Junio C Hamano
2015-04-16 19:26         ` [PATCH v2 0/4] UTF8 BOM follow-up Jeff King
2015-04-17 22:44         ` Karsten Blees
2015-04-20 21:50           ` Junio C Hamano
2015-04-16 16:08   ` [PATCH] dir: allow a BOM at the beginning of exclude files Johannes Schindelin
2015-04-16 16:10 ` Torsten Bögershausen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.