git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
@ 2014-04-28 16:16 Jeff King
  2014-04-28 19:17 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Jeff King @ 2014-04-28 16:16 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git

If you have existing decomposed filenames in your git
repository (e.g., that were created with older versions of
git that did not precompose unicode), a modern git with
core.precomposeunicode set does not handle them well.

The problem is that we normalize the paths coming from the
disk into their precomposed form, and then compare them
against the literal bytes in the index. This makes things
better if you have the precomposed form in the index. It
makes things worse if you actually have the decomposed form
in the index.

As a result, paths with decomposed filenames may have their
precomposed variants listed as untracked files (even though
the precomposed variants do not exist on-disk at all).

This patch just adds a test to demonstrate the breakage.
Some possible fixes are:

  1. Tell everyone that NFD in the git repo is wrong, and
     they should make a new commit to normalize all their
     in-repo files to be precomposed.

     This is probably not the right thing to do, because it
     still doesn't fix checkouts of old history. And it
     spreads the problem to people on byte-preserving
     filesystems (like ext4), because now they have to start
     precomposing their filenames as they are adde to git.

  2. Do all index filename comparisons using a UTF-8 aware
     comparison function when core.precomposeunicode is set.
     This would probably have bad performance, and somewhat
     defeats the point of converting the filenames at the
     readdir level in the first place.

  3. Convert index filenames to their precomposed form when
     we read the index from disk. This would be efficient,
     but we would have to be careful not to write the
     precomposed forms back out to disk.

  4. Introduce some infrastructure to efficiently match up
     the precomposed/decomposed forms. We already do
     something similar for case-insensitive files using
     name-hash.c. We might be able to adapt that strategy
     here.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t3910-mac-os-precompose.sh | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh
index e4ba601..23aa61e 100755
--- a/t/t3910-mac-os-precompose.sh
+++ b/t/t3910-mac-os-precompose.sh
@@ -140,6 +140,16 @@ test_expect_success "Add long precomposed filename" '
 	git add * &&
 	git commit -m "Long filename"
 '
+
+test_expect_failure 'handle existing decomposed filenames' '
+	echo content >"verbatim.$Adiarnfd" &&
+	git -c core.precomposeunicode=false add "verbatim.$Adiarnfd" &&
+	git commit -m "existing decomposed file" &&
+	>expect &&
+	git ls-files --exclude-standard -o "verbatim*" >untracked &&
+	test_cmp expect untracked
+'
+
 # Test if the global core.precomposeunicode stops autosensing
 # Must be the last test case
 test_expect_success "respect git config --global core.precomposeunicode" '
-- 
1.9.1.656.ge8a0637

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-28 16:16 [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames Jeff King
@ 2014-04-28 19:17 ` Junio C Hamano
  2014-04-28 19:35   ` Jeff King
  2014-04-29 17:12 ` Junio C Hamano
  2014-05-04  6:13 ` Torsten Bögershausen
  2 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2014-04-28 19:17 UTC (permalink / raw)
  To: Jeff King; +Cc: Torsten Bögershausen, git

Jeff King <peff@peff.net> writes:

> This patch just adds a test to demonstrate the breakage.
> Some possible fixes are:
> ...
>   2. Do all index filename comparisons using a UTF-8 aware
>      comparison function when core.precomposeunicode is set.
>      This would probably have bad performance, and somewhat
>      defeats the point of converting the filenames at the
>      readdir level in the first place.

As we do this only when core.precomposeunicode is set, projects that
use pathnames encoded not in UTF-8 (e.g. 8859-1, EUC, etc.) will not
be affected by getting their path mangled, as long as we won't flip
the default to true (which I am not suggesting to do, by the way).

>   3. Convert index filenames to their precomposed form when
>      we read the index from disk. This would be efficient,
>      but we would have to be careful not to write the
>      precomposed forms back out to disk.

I think this may be the right approach, especially if you are going
to do this only when core.precomposeunicode is set.

the reasoning behind "we would have to be careful not to write"
part, is unclear to me, though.  Don't decomposing filesystems
perform the manglig from the precomposed form without even being
asked to do so, just like a case insensitive filesystem will
overwrite an existing "makefile" on a request to write to
"Makefile"?

>   4. Introduce some infrastructure to efficiently match up
>      the precomposed/decomposed forms. We already do
>      something similar for case-insensitive files using
>      name-hash.c. We might be able to adapt that strategy
>      here.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  t/t3910-mac-os-precompose.sh | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh
> index e4ba601..23aa61e 100755
> --- a/t/t3910-mac-os-precompose.sh
> +++ b/t/t3910-mac-os-precompose.sh
> @@ -140,6 +140,16 @@ test_expect_success "Add long precomposed filename" '
>  	git add * &&
>  	git commit -m "Long filename"
>  '
> +
> +test_expect_failure 'handle existing decomposed filenames' '
> +	echo content >"verbatim.$Adiarnfd" &&
> +	git -c core.precomposeunicode=false add "verbatim.$Adiarnfd" &&
> +	git commit -m "existing decomposed file" &&
> +	>expect &&
> +	git ls-files --exclude-standard -o "verbatim*" >untracked &&
> +	test_cmp expect untracked
> +'
> +
>  # Test if the global core.precomposeunicode stops autosensing
>  # Must be the last test case
>  test_expect_success "respect git config --global core.precomposeunicode" '

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-28 19:17 ` Junio C Hamano
@ 2014-04-28 19:35   ` Jeff King
  2014-04-28 19:52     ` Torsten Bögershausen
  2014-04-29  3:15     ` Jeff King
  0 siblings, 2 replies; 19+ messages in thread
From: Jeff King @ 2014-04-28 19:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Torsten Bögershausen, git

On Mon, Apr 28, 2014 at 12:17:28PM -0700, Junio C Hamano wrote:

> >   3. Convert index filenames to their precomposed form when
> >      we read the index from disk. This would be efficient,
> >      but we would have to be careful not to write the
> >      precomposed forms back out to disk.
> 
> I think this may be the right approach, especially if you are going
> to do this only when core.precomposeunicode is set.
> 
> the reasoning behind "we would have to be careful not to write"
> part, is unclear to me, though.  Don't decomposing filesystems
> perform the manglig from the precomposed form without even being
> asked to do so, just like a case insensitive filesystem will
> overwrite an existing "makefile" on a request to write to
> "Makefile"?

Sorry, I meant "do not write the precomposed forms back out to the
on-disk index". And by extension, do not update cache-tree and write
them out to git trees.

IOW, it is not enough to just set cache_entry->name to the normalized
form. You'd need to store both.

Since such entries are in the minority, and because cache_entry is
already a variable-length struct, I think you could get away with
sticking it after the "name" field, and then comparing like:

  const char *ce_normalized_name(struct cache_entry *ce, size_t *len)
  {
	const char *ret;

	/* Normal, fast path */
	if (!(ce->ce_flags & CE_NORMALIZED_NAME)) {
		len = ce_namelen(ce);
		return ce->name;
	}

	/* Slow path for normalized names */
	ret = ce->name + ce->namelen + 1;
	*len = strlen(name);
	return ret;
  }

The strlen is probably OK since such paths are presumably in the
minority (even for UTF-8 paths, we can avoid storing the extra copy if
they do not need any normalization). Or we could get fancy and encode
the length in front, but I am not sure it is worth the complexity.

Anyway, the tricky part is then making sure that all cache_entry name
comparisons use ce_normalized_name instead of ce->name.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-28 19:35   ` Jeff King
@ 2014-04-28 19:52     ` Torsten Bögershausen
  2014-04-28 20:03       ` Jeff King
  2014-04-29  3:15     ` Jeff King
  1 sibling, 1 reply; 19+ messages in thread
From: Torsten Bögershausen @ 2014-04-28 19:52 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: git


On 28.04.14 21:35, Jeff King wrote:
> On Mon, Apr 28, 2014 at 12:17:28PM -0700, Junio C Hamano wrote:
>
>>>   3. Convert index filenames to their precomposed form when
>>>      we read the index from disk. This would be efficient,
>>>      but we would have to be careful not to write the
>>>      precomposed forms back out to disk.
>> I think this may be the right approach, especially if you are going
>> to do this only when core.precomposeunicode is set.
>>
>> the reasoning behind "we would have to be careful not to write"
>> part, is unclear to me, though.  Don't decomposing filesystems
>> perform the manglig from the precomposed form without even being
>> asked to do so, just like a case insensitive filesystem will
>> overwrite an existing "makefile" on a request to write to
>> "Makefile"?
> Sorry, I meant "do not write the precomposed forms back out to the
> on-disk index". And by extension, do not update cache-tree and write
> them out to git trees.
>
> IOW, it is not enough to just set cache_entry->name to the normalized
> form. You'd need to store both.
>
> Since such entries are in the minority, and because cache_entry is
> already a variable-length struct, I think you could get away with
> sticking it after the "name" field, and then comparing like:
>
>   const char *ce_normalized_name(struct cache_entry *ce, size_t *len)
>   {
> 	const char *ret;
>
> 	/* Normal, fast path */
> 	if (!(ce->ce_flags & CE_NORMALIZED_NAME)) {
> 		len = ce_namelen(ce);
> 		return ce->name;
> 	}
>
> 	/* Slow path for normalized names */
> 	ret = ce->name + ce->namelen + 1;
> 	*len = strlen(name);
> 	return ret;
>   }
>
> The strlen is probably OK since such paths are presumably in the
> minority (even for UTF-8 paths, we can avoid storing the extra copy if
> they do not need any normalization). Or we could get fancy and encode
> the length in front, but I am not sure it is worth the complexity.
>
> Anyway, the tricky part is then making sure that all cache_entry name
> comparisons use ce_normalized_name instead of ce->name.
>
> -Peff
To my knowledge repos with decomposed unicode should be rare in practice.
I only can speak for european (or latin based) or cyrillic languages myself:

- It is difficult (but not impossible) to enter decomposed unicode on the keyboard.
- Some programs under Mac OS X do not handle decomposed code points well,
  an "ä" may be displayed as "¨a" for example.
- Pushing and pulling to Windows or Linux is possible, but the same problems here:
  the keyboard is not prepared to enter the decomposed form, and the display may be wrong.

The only possible use case for decomposed unicode I am aware of is when you use git-bzr,
because bzr does not do the precomposition (and neither hg to my knowledge).

So for me the test case could sense, even if I think that nobody (TM) uses an old Git version
under Mac OS X which is not able to handle precomposed unicode.

Unless I have missed something.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-28 19:52     ` Torsten Bögershausen
@ 2014-04-28 20:03       ` Jeff King
  2014-04-28 20:49         ` Torsten Bögershausen
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2014-04-28 20:03 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Junio C Hamano, git

On Mon, Apr 28, 2014 at 09:52:07PM +0200, Torsten Bögershausen wrote:

> To my knowledge repos with decomposed unicode should be rare in
> practice.  I only can speak for european (or latin based) or cyrillic
> languages myself:

I've run across several cases in the past few months, but only just
figured out what was going on. Most were tickets to GitHub support, but
we actually have such a case in our github/github repository. In most
cases, I think they were created on older versions of git on OS X,
either before core.precomposeunicode existed, or before it was turned on
by default. The decomposed form got baked into the tree (whatever the
user originally typed, git probably found out about it via "git add .").

I think reports are just coming in now because we didn't start turning
on core.precomposeunicode by default until v1.8.5, shipped in November.
And then, a person working on the repository would not notice anything,
since we only set the flag during clone. So it took time for people to
upgrade _and_ to make fresh clones.

> So for me the test case could sense, even if I think that nobody (TM)
> uses an old Git version under Mac OS X which is not able to handle
> precomposed unicode.

Even when they do not, the decomposed values are baked into history from
those old versions. So it is a matter of history created with older
versions not interacting well with newer versions.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-28 20:03       ` Jeff King
@ 2014-04-28 20:49         ` Torsten Bögershausen
  2014-04-29  3:23           ` Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Torsten Bögershausen @ 2014-04-28 20:49 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On 2014-04-28 22.03, Jeff King wrote:
> On Mon, Apr 28, 2014 at 09:52:07PM +0200, Torsten Bögershausen wrote:
>
>> To my knowledge repos with decomposed unicode should be rare in
>> practice.  I only can speak for european (or latin based) or cyrillic
>> languages myself:
> I've run across several cases in the past few months, but only just
> figured out what was going on. Most were tickets to GitHub support, but
> we actually have such a case in our github/github repository. In most
> cases, I think they were created on older versions of git on OS X,
> either before core.precomposeunicode existed, or before it was turned on
> by default. The decomposed form got baked into the tree (whatever the
> user originally typed, git probably found out about it via "git add .").
>
> I think reports are just coming in now because we didn't start turning
> on core.precomposeunicode by default until v1.8.5, shipped in November.
> And then, a person working on the repository would not notice anything,
> since we only set the flag during clone. So it took time for people to
> upgrade _and_ to make fresh clones.
OK, thanks for the description.
In theory we can make Git "composition ignoring" by changing
index_file_exists() in name-hash.c.
(Both names must be precomposed first and compared then)

I don't know how much people are using Git before 1.7.12 (the
first version supporting precomposed unicode).

Could we simply ask them to upgrade ?
The next problem is that people need to agree if the repo should store
names in pre- or decomposed form.
(My voice is for precomposed)
Unfortunatly the core.precomposeunicode is repo-local, so everybody
needs to "agree globally" and "configure locally".

Side note:
I which we had this config variable travelling with the repo, like .gitattributes does
for text dealing with CRLF-LF.

I don't know how many reports you have, reading all this it feels as if the effected users
could "normalize" their repos and run "git config core.precomposeunicode true", followed
by "git config --global core.precomposeunicode true".
Does that sound like a possible way forward ?
>> So for me the test case could sense, even if I think that nobody (TM)
>> uses an old Git version under Mac OS X which is not able to handle
>> precomposed unicode.
> Even when they do not, the decomposed values are baked into history from
> those old versions. So it is a matter of history created with older
> versions not interacting well with newer versions.
I'm not sure if I understood all the details here, but I would be happy to help
with suggestions/tests/reviews.

> -Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-28 19:35   ` Jeff King
  2014-04-28 19:52     ` Torsten Bögershausen
@ 2014-04-29  3:15     ` Jeff King
  1 sibling, 0 replies; 19+ messages in thread
From: Jeff King @ 2014-04-29  3:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Torsten Bögershausen, git

On Mon, Apr 28, 2014 at 03:35:02PM -0400, Jeff King wrote:

> Since such entries are in the minority, and because cache_entry is
> already a variable-length struct, I think you could get away with
> sticking it after the "name" field, and then comparing like:
> 
>   const char *ce_normalized_name(struct cache_entry *ce, size_t *len)
>   {
> 	const char *ret;
> 
> 	/* Normal, fast path */
> 	if (!(ce->ce_flags & CE_NORMALIZED_NAME)) {
> 		len = ce_namelen(ce);
> 		return ce->name;
> 	}
> 
> 	/* Slow path for normalized names */
> 	ret = ce->name + ce->namelen + 1;
> 	*len = strlen(name);
> 	return ret;
>   }

That's the reading half. We would also need to create the normalized
names for each cache_entry. I took a look at that this afternoon. It
turns out we make cache_entry structs in quite a few places.  So I
thought I'd start with converting them all to a function like:

  struct cache_entry *cache_entry_alloc(const char *name, size_t len);

And then once converted, we could teach it to normalize the name as
appropriate. That interface does improve many of the callers, but there
are a few tricky ones.

For example, in checkout.c:update_some (and one or two other spots), we
actually have the path broken into two parts, and we combine them while
writing into the cache_entry. We could obviously combine them into a
single buffer beforehand, but that means extra copying in reasonably hot
code paths. It would be slightly ugly but perhaps reasonable to have:

  cache_entry_alloc_two(const char *one, size_t one_len,
                        const char *two, size_t two_len);

But then I got to unpack-trees. It has the whole path broken down across
a linked list.  I'm not sure what would be least terrible interface
here. Again, we could format to a buffer and copy, but I'm hesitant to
do it on this code path.

I'm not sure if I'm just being paranoid about the extra memcpys (it's
not _that_ tight a loop, since after all, we are generally zlib
inflating to get the tree data, and the filenames are not all _that_
long).

I dunno. I just hate the idea of tradeoffs for this OS-X-only fix
permeating the rest of the code on other platforms. But maybe Knuth
should be hitting me with his premature optimization clue-stick.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-28 20:49         ` Torsten Bögershausen
@ 2014-04-29  3:23           ` Jeff King
  2014-04-29  7:39             ` Torsten Bögershausen
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2014-04-29  3:23 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Junio C Hamano, git

On Mon, Apr 28, 2014 at 10:49:30PM +0200, Torsten Bögershausen wrote:

> OK, thanks for the description.
> In theory we can make Git "composition ignoring" by changing
> index_file_exists() in name-hash.c.
> (Both names must be precomposed first and compared then)

Yeah, we could perhaps get away without storing the extra precomposed
form if we just stored the entries under their precomposed hash, and
then taught same_name to use a slower precompose-aware comparison. But I
also see that we do binary searches in index_name_pos (called by
index_name_is_other). I don't know if we'd have to deal with this
problem there, too.

> I don't know how much people are using Git before 1.7.12 (the
> first version supporting precomposed unicode).
> 
> Could we simply ask them to upgrade ?

I'm not sure what you mean here. Upgrading won't help, because the
values are baked into existing history created with the old versions
forever. Any time I "git checkout v1.0" on the broken project, a modern
git will complain about the ghost untracked file.

> The next problem is that people need to agree if the repo should store
> names in pre- or decomposed form.
> (My voice is for precomposed)
> Unfortunatly the core.precomposeunicode is repo-local, so everybody
> needs to "agree globally" and "configure locally".

Yeah, that was sort of my "point 1" from the patch. I'm a bit worried
that it creates problems for people on other systems, though. Linux
people do not need to care about precomposed/decomposed at all, and any
attempt we make to automatically handle it is going to run afoul of
non-utf8 encodings. Not to mention that it does not solve the "git
checkout v1.0" problem above.

> Side note:
> I which we had this config variable travelling with the repo, like .gitattributes does
> for text dealing with CRLF-LF.

Yeah, I guess if we wanted to enforce it everywhere, you would have to
use .gitattributes since we cannot safely turn on such a feature by
default.

> I don't know how many reports you have, reading all this it feels as if the effected users
> could "normalize" their repos and run "git config core.precomposeunicode true", followed
> by "git config --global core.precomposeunicode true".
> Does that sound like a possible way forward ?

I have just a handful of reports. Maybe 3-4? I didn't dig them all up
today, as it would be a non-trivial effort. But I have no idea how good
a sampling that is.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-29  3:23           ` Jeff King
@ 2014-04-29  7:39             ` Torsten Bögershausen
  0 siblings, 0 replies; 19+ messages in thread
From: Torsten Bögershausen @ 2014-04-29  7:39 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On 04/29/2014 05:23 AM, Jeff King wrote:
> On Mon, Apr 28, 2014 at 10:49:30PM +0200, Torsten Bögershausen wrote:
>
>> OK, thanks for the description.
>> In theory we can make Git "composition ignoring" by changing
>> index_file_exists() in name-hash.c.
>> (Both names must be precomposed first and compared then)
> Yeah, we could perhaps get away without storing the extra precomposed
> form if we just stored the entries under their precomposed hash, and
> then taught same_name to use a slower precompose-aware comparison. But I
> also see that we do binary searches in index_name_pos (called by
> index_name_is_other). I don't know if we'd have to deal with this
> problem there, too.
Just loud thinking:
We precompose whenever we read file names from disc, that's done in 
readdir()
We precompose whenever we get an argv into Git, that's done in 
precompose_argv()
We precompose every time we read file names from the index file on 
disc(?) into memory.
That we don't do today, and my suggestion to hack name-hash.c isn't a 
good one.

Probably we need to go into read-cache.c, or more places. I'm not an 
expert here knowing
all Git internal details.
Basically all places where strings containing file names are put into 
memory are effected,
and I wouldn't be too concerned about CPU cycles.

>> I don't know how much people are using Git before 1.7.12 (the
>> first version supporting precomposed unicode).
>>
>> Could we simply ask them to upgrade ?
> I'm not sure what you mean here. Upgrading won't help, because the
> values are baked into existing history created with the old versions
> forever. Any time I "git checkout v1.0" on the broken project, a modern
> git will complain about the ghost untracked file.
It depends if all file names in a certain repo are stored decomposed,
(in this case everybody can set core.precomposeunicode false)
or if there is a mixture having precomposed and decomposed
in different comits/directories...
In this case we can normalize the master branch.
For older commit the users need to wait for an updated Git version,
until that they need to live with the ghosts as good as they can.

>
>> The next problem is that people need to agree if the repo should store
>> names in pre- or decomposed form.
>> (My voice is for precomposed)
>> Unfortunatly the core.precomposeunicode is repo-local, so everybody
>> needs to "agree globally" and "configure locally".
> Yeah, that was sort of my "point 1" from the patch. I'm a bit worried
> that it creates problems for people on other systems, though. Linux
> people do not need to care about precomposed/decomposed at all, and any
> attempt we make to automatically handle it is going to run afoul of
> non-utf8 encodings. Not to mention that it does not solve the "git
> checkout v1.0" problem above.
Not sure what is meant by non-utf8 encodings.
Mac OS X can only handle Unicode filenames,
and a single ISO-8859-1 will be returned as "%XY" from readdir().
So if you want to share a repo with Mac OS X (and/or Windows)
Unicode should be used.
Are you saying that there is a Linux station feeding in file names in  
e.g. 8859-1, EUC ?
My experience/knowledge is that you can not do that (in a useful way).


>> Side note:
>> I which we had this config variable travelling with the repo, like .gitattributes does
>> for text dealing with CRLF-LF.
> Yeah, I guess if we wanted to enforce it everywhere, you would have to
> use .gitattributes since we cannot safely turn on such a feature by
> default.
>
>> I don't know how many reports you have, reading all this it feels as if the effected users
>> could "normalize" their repos and run "git config core.precomposeunicode true", followed
>> by "git config --global core.precomposeunicode true".
>> Does that sound like a possible way forward ?
> I have just a handful of reports. Maybe 3-4? I didn't dig them all up
> today, as it would be a non-trivial effort. But I have no idea how good
> a sampling that is.
The following could help, may be:
git -c core.quotepath=false ls-files | iconv -f UTF-8-MAC -t UTF-8 >expected
git -c core.quotepath=false ls-files >actual
diff expected actual
>
> -Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-28 16:16 [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames Jeff King
  2014-04-28 19:17 ` Junio C Hamano
@ 2014-04-29 17:12 ` Junio C Hamano
  2014-04-29 18:02   ` Jeff King
  2014-05-04  6:13 ` Torsten Bögershausen
  2 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2014-04-29 17:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Torsten Bögershausen, git

Jeff King <peff@peff.net> writes:

> This patch just adds a test to demonstrate the breakage.
> Some possible fixes are:
>
>   1. Tell everyone that NFD in the git repo is wrong, and
>      they should make a new commit to normalize all their
>      in-repo files to be precomposed.
>
>      This is probably not the right thing to do, because it
>      still doesn't fix checkouts of old history. And it
>      spreads the problem to people on byte-preserving
>      filesystems (like ext4), because now they have to start
>      precomposing their filenames as they are adde to git.

Hmm, have we taught the "compare precomposed" for codepaths that
compare two trees and a tree and the index, too?  Otherwise, we
would have the same issue with commits in the old history.

Do we have a similar issue for older commit in a history under
"ignore-case" as well?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-29 17:12 ` Junio C Hamano
@ 2014-04-29 18:02   ` Jeff King
  2014-04-29 18:49     ` Junio C Hamano
  2014-04-30 14:57     ` Torsten Bögershausen
  0 siblings, 2 replies; 19+ messages in thread
From: Jeff King @ 2014-04-29 18:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Torsten Bögershausen, git

On Tue, Apr 29, 2014 at 10:12:52AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > This patch just adds a test to demonstrate the breakage.
> > Some possible fixes are:
> >
> >   1. Tell everyone that NFD in the git repo is wrong, and
> >      they should make a new commit to normalize all their
> >      in-repo files to be precomposed.
> >
> >      This is probably not the right thing to do, because it
> >      still doesn't fix checkouts of old history. And it
> >      spreads the problem to people on byte-preserving
> >      filesystems (like ext4), because now they have to start
> >      precomposing their filenames as they are adde to git.
> 
> Hmm, have we taught the "compare precomposed" for codepaths that
> compare two trees and a tree and the index, too?  Otherwise, we
> would have the same issue with commits in the old history.

Ugh, yeah, I didn't think about that codepath. I think we would not want
to precompose in that case. IOW, git works byte-wise internally, but it
is only at the filesystem layer that we do such munging. The index
straddles the line between the filesystem and git's internal
representations.

I think my "keep the normalized names alongside index entries" approach
might still work there. But it means that we compare against the "real"
byte-wise names on the tree side, and against the normalized names on
the path side. But that means having two comparison/lookup functions for
the index, and always using the right one. And algorithms that rely on
traversing two sorted lists cannot work in both directions.

> Do we have a similar issue for older commit in a history under
> "ignore-case" as well?

I don't think so, because we handle ignorecase completely differently.
There we use the name-hash with a case-insensitive hash and a
case-insensitive comparison function. And we use strcasecmp liberally
throughout the code.

I don't think we have a "str_utf8_cmp" that ignores normalizations (or
maybe strcoll will do this?). But in theory we could use it everywhere
we use strcasecmp for ignore_case. And then we would not need to have
our readdir wrapper, maybe? I admit I haven't thought that much about
_either_ approach. But aside from some bugs in the hash system, I do not
recall seeing any design problems in the ignorecase code.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-29 18:02   ` Jeff King
@ 2014-04-29 18:49     ` Junio C Hamano
  2014-04-29 19:46       ` Jeff King
  2014-04-30 14:57     ` Torsten Bögershausen
  1 sibling, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2014-04-29 18:49 UTC (permalink / raw)
  To: Jeff King; +Cc: Torsten Bögershausen, git

Jeff King <peff@peff.net> writes:

> I don't think we have a "str_utf8_cmp" that ignores normalizations (or
> maybe strcoll will do this?). But in theory we could use it everywhere
> we use strcasecmp for ignore_case. And then we would not need to have
> our readdir wrapper, maybe? I admit I haven't thought that much about
> _either_ approach. But aside from some bugs in the hash system, I do not
> recall seeing any design problems in the ignorecase code.

Our diffs and merges depend on walking two (or more) sorted lists,
and that sort order is baked in the tree objects when they are
created.  Using "normalized comparison" only when comparing the
earliest elements picked from these sorted lists would not give you
the correct comparison or merge results, would it?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-29 18:49     ` Junio C Hamano
@ 2014-04-29 19:46       ` Jeff King
  0 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2014-04-29 19:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Torsten Bögershausen, git

On Tue, Apr 29, 2014 at 11:49:30AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I don't think we have a "str_utf8_cmp" that ignores normalizations (or
> > maybe strcoll will do this?). But in theory we could use it everywhere
> > we use strcasecmp for ignore_case. And then we would not need to have
> > our readdir wrapper, maybe? I admit I haven't thought that much about
> > _either_ approach. But aside from some bugs in the hash system, I do not
> > recall seeing any design problems in the ignorecase code.
> 
> Our diffs and merges depend on walking two (or more) sorted lists,
> and that sort order is baked in the tree objects when they are
> created.  Using "normalized comparison" only when comparing the
> earliest elements picked from these sorted lists would not give you
> the correct comparison or merge results, would it?

Right, but we do not do normalized comparisons on that side. Not for
precompose, and not for ignorecase. The entry in the index _is_
case-sensitive[1], and we compare it as such to the tree side.

It is only when comparing the filesystem side to the index that we need
to care about such normalizations. And there we have the name-hash code
to handle ignorecase, but nothing to handle precompose.

-Peff

[1] This works because you typically get the case-sensitive entry via
    `git read-tree`, and then further update it from the filesystem. If
    you were to add a new entry "makefile", and somebody else added
    "Makefile", they would conflict.

    When you do "git add makefile" and "Makefile" _does_ exist, I am not
    sure, though, if it is git who handles making sure we find the
    correct entry, or if it is simply the fact that case insensitive
    filesystems are typically case-preserving (so you generally ask for
    "Makefile" anyway). If it is the latter, then that is a problem for
    precompose. HFS's NFD normalization is non-preserving.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-29 18:02   ` Jeff King
  2014-04-29 18:49     ` Junio C Hamano
@ 2014-04-30 14:57     ` Torsten Bögershausen
  2014-05-04 12:04       ` Torsten Bögershausen
  1 sibling, 1 reply; 19+ messages in thread
From: Torsten Bögershausen @ 2014-04-30 14:57 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: Torsten Bögershausen, git

On 29.04.14 20:02, Jeff King wrote:
> On Tue, Apr 29, 2014 at 10:12:52AM -0700, Junio C Hamano wrote:
> 
>> Jeff King <peff@peff.net> writes:
>>
>>> This patch just adds a test to demonstrate the breakage.
>>> Some possible fixes are:
>>>
>>>   1. Tell everyone that NFD in the git repo is wrong, and
>>>      they should make a new commit to normalize all their
>>>      in-repo files to be precomposed.
>>>
>>>      This is probably not the right thing to do, because it
>>>      still doesn't fix checkouts of old history. And it
>>>      spreads the problem to people on byte-preserving
>>>      filesystems (like ext4), because now they have to start
>>>      precomposing their filenames as they are adde to git.
>>
>> Hmm, have we taught the "compare precomposed" for codepaths that
>> compare two trees and a tree and the index, too?  Otherwise, we
>> would have the same issue with commits in the old history.
> 
> Ugh, yeah, I didn't think about that codepath. I think we would not want
> to precompose in that case. IOW, git works byte-wise internally, but it
> is only at the filesystem layer that we do such munging. The index
> straddles the line between the filesystem and git's internal
> representations.
>
[snip]
Please allow me to answer on this post-
I made a suggestion here:
https://github.com/tboegi/git/commit/85305ce306cb88a07dad6350d6ba8c5f2f817af6

The new test in t3910 passes, but the test suite hangs somewhere, there is a whitespace
in precompose_utf8.c, so I don't know what to say.
But in case someone wants to make a code review:


commit 85305ce306cb88a07dad6350d6ba8c5f2f817af6
Author: Torsten Bögershausen <tboegi@web.de>
Date:   Wed Apr 30 10:30:04 2014 +0200

    core.precomposeunicode with decomposed filenames
    
    Commit 750b2e4785e shows a failure of core.precomposeunicode
    when decomposed filenames are in the index.
    
    When decomposed file names are in the index and readdir()
    converts them into the decomposed form, "Git status" will report
    the precomposed file name as untracked.
    
    Solution:
    Precompose file names when reading the index file from disc into memory.

diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index 95fe849..40ebc2e 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -57,6 +57,19 @@ void probe_utf8_pathname_composition(char *path, int len)
 }
 
 
+char *precompose_str_len(const char *in, size_t insz, int *outsz)
+{
+	char *prec_str = NULL;
+	if (precomposed_unicode != 1)
+		return NULL;
+
+	if (has_non_ascii(in, insz, NULL))
+		prec_str = reencode_string_len(in, insz, repo_encoding, path_encoding, outsz);
+
+	return prec_str;
+}
+
+
 void precompose_argv(int argc, const char **argv)
 {
 	int i = 0;
diff --git a/compat/precompose_utf8.h b/compat/precompose_utf8.h
index 3b73585..28f6595 100644
--- a/compat/precompose_utf8.h
+++ b/compat/precompose_utf8.h
@@ -26,6 +26,7 @@ typedef struct {
 	struct dirent_prec_psx *dirent_nfc;
 } PREC_DIR;
 
+char *precompose_str_len(const char *in, size_t insz, int *outsz);
 void precompose_argv(int argc, const char **argv);
 void probe_utf8_pathname_composition(char *, int);
 
diff --git a/git-compat-util.h b/git-compat-util.h
index d493a8c..de117d1 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -180,7 +180,7 @@ typedef unsigned long uintptr_t;
 #ifdef PRECOMPOSE_UNICODE
 #include "compat/precompose_utf8.h"
 #else
-#define precompose_str(in,i_nfd2nfc)
+#define precompose_str_len(s,i,o) NULL
 #define precompose_argv(c,v)
 #define probe_utf8_pathname_composition(a,b)
 #endif
diff --git a/read-cache.c b/read-cache.c
index 4b4effd..0887835 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1330,7 +1330,7 @@ static inline uint32_t ntoh_l_force_align(void *p)
 #define ntoh_l(var) ntoh_l_force_align(&(var))
 #endif
 
-static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk,
+static struct cache_entry *cache_entry_from_ondisk_int(struct ondisk_cache_entry *ondisk,
 						   unsigned int flags,
 						   const char *name,
 						   size_t len)
@@ -1355,6 +1355,22 @@ static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *on
 	return ce;
 }
 
+static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk,
+							 unsigned int flags,
+							 const char *name,
+							 size_t len)
+{
+	int prec_len;
+	char *prec_name = precompose_str_len(name, len, &prec_len);
+	if (prec_name) {
+		struct cache_entry *ce;
+		ce = cache_entry_from_ondisk_int(ondisk, flags, prec_name, prec_len);
+		free(prec_name);
+		return ce;
+	}
+	return cache_entry_from_ondisk_int(ondisk, flags, name, len);
+}
+
 /*
  * Adjacent cache entries tend to share the leading paths, so it makes
  * sense to only store the differences in later entries.  In the v4
diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh
index 23aa61e..d27c018 100755
--- a/t/t3910-mac-os-precompose.sh
+++ b/t/t3910-mac-os-precompose.sh
@@ -141,7 +141,7 @@ test_expect_success "Add long precomposed filename" '
 	git commit -m "Long filename"
 '
 
-test_expect_failure 'handle existing decomposed filenames' '
+test_expect_success 'handle existing decomposed filenames' '
 	echo content >"verbatim.$Adiarnfd" &&
 	git -c core.precomposeunicode=false add "verbatim.$Adiarnfd" &&
 	git commit -m "existing decomposed file" &&
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-28 16:16 [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames Jeff King
  2014-04-28 19:17 ` Junio C Hamano
  2014-04-29 17:12 ` Junio C Hamano
@ 2014-05-04  6:13 ` Torsten Bögershausen
  2014-05-05 21:46   ` Jeff King
  2 siblings, 1 reply; 19+ messages in thread
From: Torsten Bögershausen @ 2014-05-04  6:13 UTC (permalink / raw)
  To: Jeff King; +Cc: git

(Sorry for the late reply, I'm handicapped by some local IT problems)
Peff, do you know if the fix below helps ?

On 2014-04-28 18.16, Jeff King wrote:
> If you have existing decomposed filenames in your git
> repository (e.g., that were created with older versions of
> git that did not precompose unicode), a modern git with
> core.precomposeunicode set does not handle them well.
Yes.
> The problem is that we normalize the paths coming from the
> disk into their precomposed form, and then compare them
> against the literal bytes in the index. This makes things
> better if you have the precomposed form in the index. It
> makes things worse if you actually have the decomposed form
> in the index.
>
> As a result, paths with decomposed filenames may have their
> precomposed variants listed as untracked files (even though
> the precomposed variants do not exist on-disk at all).
Yes
> This patch just adds a test to demonstrate the breakage.
> Some possible fixes are:
>
>   1. Tell everyone that NFD in the git repo is wrong, and
>      they should make a new commit to normalize all their
>      in-repo files to be precomposed.
>      This is probably not the right thing to do, because it
>      still doesn't fix checkouts of old history. And it
>      spreads the problem to people on byte-preserving
>      filesystems (like ext4), because now they have to start
>      precomposing their filenames as they are adde to git.
     (typo:                                                                  ^added)
I'm not sure if I follow. People running ext4 (or Linux in general,
or Windows, or Unix) do not suffer from file system
"feature" of Mac OS, which accepts precomposed/decomposed Unicode
but returns decompomsed.
>
>   2. Do all index filename comparisons using a UTF-8 aware
>      comparison function when core.precomposeunicode is set.
>      This would probably have bad performance, and somewhat
>      defeats the point of converting the filenames at the
>      readdir level in the first place.
>
>   3. Convert index filenames to their precomposed form when
>      we read the index from disk. This would be efficient,
>      but we would have to be careful not to write the
>      precomposed forms back out to disk.
Question:
How could we be careful?
Mac OS writes always decomposed Unicode to disk.
(And all other OS tend to use precomposed forms, mainly because the "keyboard
driver" generates it.)
>
>   4. Introduce some infrastructure to efficiently match up
>      the precomposed/decomposed forms. We already do
>      something similar for case-insensitive files using
>      name-hash.c. We might be able to adapt that strategy
>      here.
--------------------------

This is my understanding:
Some possible fixes are:

  1. Accept that NFD in a Git repo which is shared between Mac OS
     and Linux or Windows is problematic.
     Whenever core.precomposeunicode = true, do the following:
     Let Git under Mac OS change all file names in the index
     into the precomposed form when a new commit is done.
     This is probably not a wrong thing to do.

     When the index file is read into memory, precompose the file names and compare
     them with the precomposed form coming from precompose_utf8_readdir().
     This avoids decomposed file names to be reported as untracked by "git status.

  2. Do all index filename comparisons under Mac OS X using a UTF-8 aware
     comparison function regardless if core.precomposeunicode is set.
     This would probably have bad performance, and somewhat
     defeats the point of converting the filenames at the
     readdir level in the first place.

(The TC is good)

>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  t/t3910-mac-os-precompose.sh | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh
> index e4ba601..23aa61e 100755
> --- a/t/t3910-mac-os-precompose.sh
> +++ b/t/t3910-mac-os-precompose.sh
> @@ -140,6 +140,16 @@ test_expect_success "Add long precomposed filename" '
>  	git add * &&
>  	git commit -m "Long filename"
>  '
> +
> +test_expect_failure 'handle existing decomposed filenames' '
> +	echo content >"verbatim.$Adiarnfd" &&
> +	git -c core.precomposeunicode=false add "verbatim.$Adiarnfd" &&
> +	git commit -m "existing decomposed file" &&
> +	>expect &&
> +	git ls-files --exclude-standard -o "verbatim*" >untracked &&
> +	test_cmp expect untracked
> +'
> +
>  # Test if the global core.precomposeunicode stops autosensing
>  # Must be the last test case
>  test_expect_success "respect git config --global core.precomposeunicode" '
On top of the we need to following:

diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index 95fe849..5ed3d17 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -57,6 +57,19 @@ void probe_utf8_pathname_composition(char *path, int len)
 }
 
 
+char *precompose_str_len(const char *in, size_t insz, int *outsz)
+{
+    char *prec_str = NULL;
+    if (precomposed_unicode != 1)
+        return NULL;
+
+    if (has_non_ascii(in, insz, NULL))
+        prec_str = reencode_string_len(in, insz, repo_encoding, path_encoding, outsz);
+
+    return prec_str;
+}
+
+
 void precompose_argv(int argc, const char **argv)
 {
     int i = 0;
diff --git a/compat/precompose_utf8.h b/compat/precompose_utf8.h
index 3b73585..28f6595 100644
--- a/compat/precompose_utf8.h
+++ b/compat/precompose_utf8.h
@@ -26,6 +26,7 @@ typedef struct {
     struct dirent_prec_psx *dirent_nfc;
 } PREC_DIR;
 
+char *precompose_str_len(const char *in, size_t insz, int *outsz);
 void precompose_argv(int argc, const char **argv);
 void probe_utf8_pathname_composition(char *, int);
 
diff --git a/git-compat-util.h b/git-compat-util.h
index d493a8c..de117d1 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -180,7 +180,7 @@ typedef unsigned long uintptr_t;
 #ifdef PRECOMPOSE_UNICODE
 #include "compat/precompose_utf8.h"
 #else
-#define precompose_str(in,i_nfd2nfc)
+#define precompose_str_len(s,i,o) NULL
 #define precompose_argv(c,v)
 #define probe_utf8_pathname_composition(a,b)
 #endif
diff --git a/read-cache.c b/read-cache.c
index 4b4effd..0887835 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1330,7 +1330,7 @@ static inline uint32_t ntoh_l_force_align(void *p)
 #define ntoh_l(var) ntoh_l_force_align(&(var))
 #endif
 
-static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk,
+static struct cache_entry *cache_entry_from_ondisk_int(struct ondisk_cache_entry *ondisk,
                            unsigned int flags,
                            const char *name,
                            size_t len)
@@ -1355,6 +1355,22 @@ static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *on
     return ce;
 }
 
+static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk,
+                             unsigned int flags,
+                             const char *name,
+                             size_t len)
+{
+    int prec_len;
+    char *prec_name = precompose_str_len(name, len, &prec_len);
+    if (prec_name) {
+        struct cache_entry *ce;
+        ce = cache_entry_from_ondisk_int(ondisk, flags, prec_name, prec_len);
+        free(prec_name);
+        return ce;
+    }
+    return cache_entry_from_ondisk_int(ondisk, flags, name, len);
+}
+
 /*
  * Adjacent cache entries tend to share the leading paths, so it makes
  * sense to only store the differences in later entries.  In the v4
diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh
index 23aa61e..2c3965b 100755
--- a/t/t3910-mac-os-precompose.sh
+++ b/t/t3910-mac-os-precompose.sh
@@ -141,13 +141,14 @@ test_expect_success "Add long precomposed filename" '
     git commit -m "Long filename"
 '
 
-test_expect_failure 'handle existing decomposed filenames' '
+test_expect_success 'handle existing decomposed filenames' '
     echo content >"verbatim.$Adiarnfd" &&
     git -c core.precomposeunicode=false add "verbatim.$Adiarnfd" &&
     git commit -m "existing decomposed file" &&
     >expect &&
     git ls-files --exclude-standard -o "verbatim*" >untracked &&
     test_cmp expect untracked
+    exit 0
 '
 
 # Test if the global core.precomposeunicode stops autosensing

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-04-30 14:57     ` Torsten Bögershausen
@ 2014-05-04 12:04       ` Torsten Bögershausen
  0 siblings, 0 replies; 19+ messages in thread
From: Torsten Bögershausen @ 2014-05-04 12:04 UTC (permalink / raw)
  To: Torsten Bögershausen, Jeff King, Junio C Hamano; +Cc: git

On 2014-04-30 16.57, Torsten Bögershausen wrote:
There is something wrong with the patch, (test suite hangs or TC fail),
so I need to com back later. Sorry for the noise.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-05-04  6:13 ` Torsten Bögershausen
@ 2014-05-05 21:46   ` Jeff King
  2014-05-06 10:11     ` Erik Faye-Lund
  2014-05-07 19:16     ` Torsten Bögershausen
  0 siblings, 2 replies; 19+ messages in thread
From: Jeff King @ 2014-05-05 21:46 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git

On Sun, May 04, 2014 at 08:13:15AM +0200, Torsten Bögershausen wrote:

> >   1. Tell everyone that NFD in the git repo is wrong, and
> >      they should make a new commit to normalize all their
> >      in-repo files to be precomposed.
> >      This is probably not the right thing to do, because it
> >      still doesn't fix checkouts of old history. And it
> >      spreads the problem to people on byte-preserving
> >      filesystems (like ext4), because now they have to start
> >      precomposing their filenames as they are adde to git.
>      (typo:                                                                  ^added)
> I'm not sure if I follow. People running ext4 (or Linux in general,
> or Windows, or Unix) do not suffer from file system
> "feature" of Mac OS, which accepts precomposed/decomposed Unicode
> but returns decompomsed.

What I mean by "spreads the problem" is that git on Linux does not need
to care about utf8 at all. It treats filenames as a byte sequence. But
if we were to start enforcing "filenames should be precomposed utf8",
then people adding files on Linux would want to enforce that, too.

People on Linux could ignore the issue as they do now, but they would
then create problems for OS X users if they add decomposed filenames.
IOW, if the OS X code assumes "all repo filenames are precomposed", then
other systems become a possible vector for violating that assumption.

> >   3. Convert index filenames to their precomposed form when
> >      we read the index from disk. This would be efficient,
> >      but we would have to be careful not to write the
> >      precomposed forms back out to disk.
> Question:
> How could we be careful?
> Mac OS writes always decomposed Unicode to disk.
> (And all other OS tend to use precomposed forms, mainly because the "keyboard
> driver" generates it.)

Sorry, I should have been more clear here. I meant "do not write index
entries using the precomposed forms out to the on-disk index". Because
that would mean that git silently converts your filenames, and it would
look like you have changes to commit whenever you read in a tree with a
decomposed name.

Looking over the patch you sent earlier, I suspect that is part of its
problem (it stores the converted name in the index entry's name field).

> This is my understanding:
> Some possible fixes are:
> 
>   1. Accept that NFD in a Git repo which is shared between Mac OS
>      and Linux or Windows is problematic.
>      Whenever core.precomposeunicode = true, do the following:
>      Let Git under Mac OS change all file names in the index
>      into the precomposed form when a new commit is done.
>      This is probably not a wrong thing to do.
> 
>      When the index file is read into memory, precompose the file names and compare
>      them with the precomposed form coming from precompose_utf8_readdir().
>      This avoids decomposed file names to be reported as untracked by "git status.

This is the case I was specifically thinking of above (and I think what
your patch is doing).

>   2. Do all index filename comparisons under Mac OS X using a UTF-8 aware
>      comparison function regardless if core.precomposeunicode is set.
>      This would probably have bad performance, and somewhat
>      defeats the point of converting the filenames at the
>      readdir level in the first place.

Right, I'm concerned about performance here, but I wonder if we can
reuse the name-hash solutions from ignorecase.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-05-05 21:46   ` Jeff King
@ 2014-05-06 10:11     ` Erik Faye-Lund
  2014-05-07 19:16     ` Torsten Bögershausen
  1 sibling, 0 replies; 19+ messages in thread
From: Erik Faye-Lund @ 2014-05-06 10:11 UTC (permalink / raw)
  To: Jeff King; +Cc: Torsten Bögershausen, GIT Mailing-list

On Mon, May 5, 2014 at 11:46 PM, Jeff King <peff@peff.net> wrote:
> On Sun, May 04, 2014 at 08:13:15AM +0200, Torsten Bögershausen wrote:
>
>> >   1. Tell everyone that NFD in the git repo is wrong, and
>> >      they should make a new commit to normalize all their
>> >      in-repo files to be precomposed.
>> >      This is probably not the right thing to do, because it
>> >      still doesn't fix checkouts of old history. And it
>> >      spreads the problem to people on byte-preserving
>> >      filesystems (like ext4), because now they have to start
>> >      precomposing their filenames as they are adde to git.
>>      (typo:                                                                  ^added)
>> I'm not sure if I follow. People running ext4 (or Linux in general,
>> or Windows, or Unix) do not suffer from file system
>> "feature" of Mac OS, which accepts precomposed/decomposed Unicode
>> but returns decompomsed.
>
> What I mean by "spreads the problem" is that git on Linux does not need
> to care about utf8 at all. It treats filenames as a byte sequence. But
> if we were to start enforcing "filenames should be precomposed utf8",
> then people adding files on Linux would want to enforce that, too.
>
> People on Linux could ignore the issue as they do now, but they would
> then create problems for OS X users if they add decomposed filenames.
> IOW, if the OS X code assumes "all repo filenames are precomposed", then
> other systems become a possible vector for violating that assumption.

FWIW, Git for Windows also doesn't deal with that "filenames are just
a byte-sequence"-notion. We have patches in place that require
filenames to *either* be valid UTF-8 or Windows-1252. Windows itself
treats filenames as Unicode characters, not arbitrary byte sequences.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames
  2014-05-05 21:46   ` Jeff King
  2014-05-06 10:11     ` Erik Faye-Lund
@ 2014-05-07 19:16     ` Torsten Bögershausen
  1 sibling, 0 replies; 19+ messages in thread
From: Torsten Bögershausen @ 2014-05-07 19:16 UTC (permalink / raw)
  To: Jeff King, Torsten Bögershausen; +Cc: git

On 2014-05-05 23.46, Jeff King wrote:
[]
>>   2. Do all index filename comparisons under Mac OS X using a UTF-8 aware
>>      comparison function regardless if core.precomposeunicode is set.
>>      This would probably have bad performance, and somewhat
>>      defeats the point of converting the filenames at the
>>      readdir level in the first place.
> 
> Right, I'm concerned about performance here, but I wonder if we can
> reuse the name-hash solutions from ignorecase.
Some short question, (sorry being short)
What do you think about this:

diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index 95fe849..c807f38 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -57,6 +57,22 @@ void probe_utf8_pathname_composition(char *path, int len)
 }
 
 
+char *inverscompose_str_len(const char *in, size_t insz, int *outsz)
+{
+	char *prec_str = NULL;
+
+	if (has_non_ascii(in, insz, NULL)) {
+		int old_errno = errno;
+		prec_str = reencode_string_len(in, (int)insz,
+					       precomposed_unicode ? path_encoding : repo_encoding,
+					       precomposed_unicode ? repo_encoding : path_encoding,
+					       outsz);
+		errno = old_errno;
+	}
+	return prec_str;
+}
+
+
 void precompose_argv(int argc, const char **argv)
 {
 	int i = 0;
diff --git a/compat/precompose_utf8.h b/compat/precompose_utf8.h
index 3b73585..b9ac099 100644
--- a/compat/precompose_utf8.h
+++ b/compat/precompose_utf8.h
@@ -26,6 +26,7 @@ typedef struct {
 	struct dirent_prec_psx *dirent_nfc;
 } PREC_DIR;
 
+char *inverscompose_str_len(const char *in, size_t insz, int *outsz);
 void precompose_argv(int argc, const char **argv);
 void probe_utf8_pathname_composition(char *, int);
 
diff --git a/git-compat-util.h b/git-compat-util.h
index f6d3a46..5ae5f57 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -177,7 +177,7 @@ typedef unsigned long uintptr_t;
 #ifdef PRECOMPOSE_UNICODE
 #include "compat/precompose_utf8.h"
 #else
-#define precompose_str(in,i_nfd2nfc)
+#define inverscompose_str_len(s,i,o) NULL
 #define precompose_argv(c,v)
 #define probe_utf8_pathname_composition(a,b)
 #endif
diff --git a/name-hash.c b/name-hash.c
index 97444d0..3c4a4ee 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -210,7 +210,7 @@ struct cache_entry *index_dir_exists(struct index_state *istate, const char *nam
 	return NULL;
 }
 
-struct cache_entry *index_file_exists(struct index_state *istate, const char *name, int namelen, int icase)
+static struct cache_entry *index_file_exists_int(struct index_state *istate, const char *name, int namelen, int icase)
 {
 	struct cache_entry *ce;
 	struct hashmap_entry key;
@@ -227,6 +227,25 @@ struct cache_entry *index_file_exists(struct index_state *istate, const char *na
 	return NULL;
 }
 
+
+struct cache_entry *index_file_exists(struct index_state *istate, const char *name, int namelen, int icase)
+{
+	struct cache_entry *ce;
+	ce = index_file_exists_int(istate, name, namelen, icase);
+	if (ce)
+		return ce;
+	else {
+		int prec_len;
+		char *prec_name = inverscompose_str_len(name, namelen, &prec_len);
+		if (prec_name) {
+			ce = index_file_exists_int(istate, prec_name, prec_len, icase);
+			free(prec_name);
+		}
+	}
+	return ce;
+}
+
+
 void free_name_hash(struct index_state *istate)
 {
 	if (!istate->name_hash_initialized)
diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh
index e4ba601..4414658 100755
--- a/t/t3910-mac-os-precompose.sh
+++ b/t/t3910-mac-os-precompose.sh
@@ -140,6 +140,22 @@ test_expect_success "Add long precomposed filename" '
 	git add * &&
 	git commit -m "Long filename"
 '
+
+test_expect_success 'handle existing decomposed filenames' '
+	git checkout master &&
+	git reset --hard &&
+	git checkout -b exist_decomposed &&
+	echo content >"verbatim.$Adiarnfd" &&
+	git -c core.precomposeunicode=false add "verbatim.$Adiarnfd" &&
+	git commit -m "existing decomposed file" &&
+	echo \"verbatim.A\\314\\210\" >expect &&
+	git ls-files >actual &&
+	test_cmp expect actual &&
+	>expect &&
+	git status | grep verbatim || : >actual &&
+	test_cmp expect actual
+'
+
 # Test if the global core.precomposeunicode stops autosensing
 # Must be the last test case
 test_expect_success "respect git config --global core.precomposeunicode" '
diff --git a/utf8.c b/utf8.c
index 77c28d4..fb8a99a 100644
--- a/utf8.c
+++ b/utf8.c
@@ -519,7 +519,7 @@ char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv, int *outs
 	char *out, *outpos;
 	iconv_ibp cp;
 
-	outsz = insz;
+	outsz = 3*insz; /* for decompose */
 	outalloc = outsz + 1; /* for terminating NUL */
 	out = xmalloc(outalloc);
 	outpos = out;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-05-07 19:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-28 16:16 [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames Jeff King
2014-04-28 19:17 ` Junio C Hamano
2014-04-28 19:35   ` Jeff King
2014-04-28 19:52     ` Torsten Bögershausen
2014-04-28 20:03       ` Jeff King
2014-04-28 20:49         ` Torsten Bögershausen
2014-04-29  3:23           ` Jeff King
2014-04-29  7:39             ` Torsten Bögershausen
2014-04-29  3:15     ` Jeff King
2014-04-29 17:12 ` Junio C Hamano
2014-04-29 18:02   ` Jeff King
2014-04-29 18:49     ` Junio C Hamano
2014-04-29 19:46       ` Jeff King
2014-04-30 14:57     ` Torsten Bögershausen
2014-05-04 12:04       ` Torsten Bögershausen
2014-05-04  6:13 ` Torsten Bögershausen
2014-05-05 21:46   ` Jeff King
2014-05-06 10:11     ` Erik Faye-Lund
2014-05-07 19:16     ` Torsten Bögershausen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).