* [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames @ 2014-04-28 16:16 Jeff King 2014-04-28 19:17 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Jeff King @ 2014-04-28 16:16 UTC (permalink / raw) To: Torsten Bögershausen; +Cc: git If you have existing decomposed filenames in your git repository (e.g., that were created with older versions of git that did not precompose unicode), a modern git with core.precomposeunicode set does not handle them well. The problem is that we normalize the paths coming from the disk into their precomposed form, and then compare them against the literal bytes in the index. This makes things better if you have the precomposed form in the index. It makes things worse if you actually have the decomposed form in the index. As a result, paths with decomposed filenames may have their precomposed variants listed as untracked files (even though the precomposed variants do not exist on-disk at all). This patch just adds a test to demonstrate the breakage. Some possible fixes are: 1. Tell everyone that NFD in the git repo is wrong, and they should make a new commit to normalize all their in-repo files to be precomposed. This is probably not the right thing to do, because it still doesn't fix checkouts of old history. And it spreads the problem to people on byte-preserving filesystems (like ext4), because now they have to start precomposing their filenames as they are adde to git. 2. Do all index filename comparisons using a UTF-8 aware comparison function when core.precomposeunicode is set. This would probably have bad performance, and somewhat defeats the point of converting the filenames at the readdir level in the first place. 3. Convert index filenames to their precomposed form when we read the index from disk. This would be efficient, but we would have to be careful not to write the precomposed forms back out to disk. 4. Introduce some infrastructure to efficiently match up the precomposed/decomposed forms. We already do something similar for case-insensitive files using name-hash.c. We might be able to adapt that strategy here. Signed-off-by: Jeff King <peff@peff.net> --- t/t3910-mac-os-precompose.sh | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh index e4ba601..23aa61e 100755 --- a/t/t3910-mac-os-precompose.sh +++ b/t/t3910-mac-os-precompose.sh @@ -140,6 +140,16 @@ test_expect_success "Add long precomposed filename" ' git add * && git commit -m "Long filename" ' + +test_expect_failure 'handle existing decomposed filenames' ' + echo content >"verbatim.$Adiarnfd" && + git -c core.precomposeunicode=false add "verbatim.$Adiarnfd" && + git commit -m "existing decomposed file" && + >expect && + git ls-files --exclude-standard -o "verbatim*" >untracked && + test_cmp expect untracked +' + # Test if the global core.precomposeunicode stops autosensing # Must be the last test case test_expect_success "respect git config --global core.precomposeunicode" ' -- 1.9.1.656.ge8a0637 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-28 16:16 [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames Jeff King @ 2014-04-28 19:17 ` Junio C Hamano 2014-04-28 19:35 ` Jeff King 2014-04-29 17:12 ` Junio C Hamano 2014-05-04 6:13 ` Torsten Bögershausen 2 siblings, 1 reply; 19+ messages in thread From: Junio C Hamano @ 2014-04-28 19:17 UTC (permalink / raw) To: Jeff King; +Cc: Torsten Bögershausen, git Jeff King <peff@peff.net> writes: > This patch just adds a test to demonstrate the breakage. > Some possible fixes are: > ... > 2. Do all index filename comparisons using a UTF-8 aware > comparison function when core.precomposeunicode is set. > This would probably have bad performance, and somewhat > defeats the point of converting the filenames at the > readdir level in the first place. As we do this only when core.precomposeunicode is set, projects that use pathnames encoded not in UTF-8 (e.g. 8859-1, EUC, etc.) will not be affected by getting their path mangled, as long as we won't flip the default to true (which I am not suggesting to do, by the way). > 3. Convert index filenames to their precomposed form when > we read the index from disk. This would be efficient, > but we would have to be careful not to write the > precomposed forms back out to disk. I think this may be the right approach, especially if you are going to do this only when core.precomposeunicode is set. the reasoning behind "we would have to be careful not to write" part, is unclear to me, though. Don't decomposing filesystems perform the manglig from the precomposed form without even being asked to do so, just like a case insensitive filesystem will overwrite an existing "makefile" on a request to write to "Makefile"? > 4. Introduce some infrastructure to efficiently match up > the precomposed/decomposed forms. We already do > something similar for case-insensitive files using > name-hash.c. We might be able to adapt that strategy > here. > > Signed-off-by: Jeff King <peff@peff.net> > --- > t/t3910-mac-os-precompose.sh | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh > index e4ba601..23aa61e 100755 > --- a/t/t3910-mac-os-precompose.sh > +++ b/t/t3910-mac-os-precompose.sh > @@ -140,6 +140,16 @@ test_expect_success "Add long precomposed filename" ' > git add * && > git commit -m "Long filename" > ' > + > +test_expect_failure 'handle existing decomposed filenames' ' > + echo content >"verbatim.$Adiarnfd" && > + git -c core.precomposeunicode=false add "verbatim.$Adiarnfd" && > + git commit -m "existing decomposed file" && > + >expect && > + git ls-files --exclude-standard -o "verbatim*" >untracked && > + test_cmp expect untracked > +' > + > # Test if the global core.precomposeunicode stops autosensing > # Must be the last test case > test_expect_success "respect git config --global core.precomposeunicode" ' ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-28 19:17 ` Junio C Hamano @ 2014-04-28 19:35 ` Jeff King 2014-04-28 19:52 ` Torsten Bögershausen 2014-04-29 3:15 ` Jeff King 0 siblings, 2 replies; 19+ messages in thread From: Jeff King @ 2014-04-28 19:35 UTC (permalink / raw) To: Junio C Hamano; +Cc: Torsten Bögershausen, git On Mon, Apr 28, 2014 at 12:17:28PM -0700, Junio C Hamano wrote: > > 3. Convert index filenames to their precomposed form when > > we read the index from disk. This would be efficient, > > but we would have to be careful not to write the > > precomposed forms back out to disk. > > I think this may be the right approach, especially if you are going > to do this only when core.precomposeunicode is set. > > the reasoning behind "we would have to be careful not to write" > part, is unclear to me, though. Don't decomposing filesystems > perform the manglig from the precomposed form without even being > asked to do so, just like a case insensitive filesystem will > overwrite an existing "makefile" on a request to write to > "Makefile"? Sorry, I meant "do not write the precomposed forms back out to the on-disk index". And by extension, do not update cache-tree and write them out to git trees. IOW, it is not enough to just set cache_entry->name to the normalized form. You'd need to store both. Since such entries are in the minority, and because cache_entry is already a variable-length struct, I think you could get away with sticking it after the "name" field, and then comparing like: const char *ce_normalized_name(struct cache_entry *ce, size_t *len) { const char *ret; /* Normal, fast path */ if (!(ce->ce_flags & CE_NORMALIZED_NAME)) { len = ce_namelen(ce); return ce->name; } /* Slow path for normalized names */ ret = ce->name + ce->namelen + 1; *len = strlen(name); return ret; } The strlen is probably OK since such paths are presumably in the minority (even for UTF-8 paths, we can avoid storing the extra copy if they do not need any normalization). Or we could get fancy and encode the length in front, but I am not sure it is worth the complexity. Anyway, the tricky part is then making sure that all cache_entry name comparisons use ce_normalized_name instead of ce->name. -Peff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-28 19:35 ` Jeff King @ 2014-04-28 19:52 ` Torsten Bögershausen 2014-04-28 20:03 ` Jeff King 2014-04-29 3:15 ` Jeff King 1 sibling, 1 reply; 19+ messages in thread From: Torsten Bögershausen @ 2014-04-28 19:52 UTC (permalink / raw) To: Jeff King, Junio C Hamano; +Cc: git On 28.04.14 21:35, Jeff King wrote: > On Mon, Apr 28, 2014 at 12:17:28PM -0700, Junio C Hamano wrote: > >>> 3. Convert index filenames to their precomposed form when >>> we read the index from disk. This would be efficient, >>> but we would have to be careful not to write the >>> precomposed forms back out to disk. >> I think this may be the right approach, especially if you are going >> to do this only when core.precomposeunicode is set. >> >> the reasoning behind "we would have to be careful not to write" >> part, is unclear to me, though. Don't decomposing filesystems >> perform the manglig from the precomposed form without even being >> asked to do so, just like a case insensitive filesystem will >> overwrite an existing "makefile" on a request to write to >> "Makefile"? > Sorry, I meant "do not write the precomposed forms back out to the > on-disk index". And by extension, do not update cache-tree and write > them out to git trees. > > IOW, it is not enough to just set cache_entry->name to the normalized > form. You'd need to store both. > > Since such entries are in the minority, and because cache_entry is > already a variable-length struct, I think you could get away with > sticking it after the "name" field, and then comparing like: > > const char *ce_normalized_name(struct cache_entry *ce, size_t *len) > { > const char *ret; > > /* Normal, fast path */ > if (!(ce->ce_flags & CE_NORMALIZED_NAME)) { > len = ce_namelen(ce); > return ce->name; > } > > /* Slow path for normalized names */ > ret = ce->name + ce->namelen + 1; > *len = strlen(name); > return ret; > } > > The strlen is probably OK since such paths are presumably in the > minority (even for UTF-8 paths, we can avoid storing the extra copy if > they do not need any normalization). Or we could get fancy and encode > the length in front, but I am not sure it is worth the complexity. > > Anyway, the tricky part is then making sure that all cache_entry name > comparisons use ce_normalized_name instead of ce->name. > > -Peff To my knowledge repos with decomposed unicode should be rare in practice. I only can speak for european (or latin based) or cyrillic languages myself: - It is difficult (but not impossible) to enter decomposed unicode on the keyboard. - Some programs under Mac OS X do not handle decomposed code points well, an "ä" may be displayed as "¨a" for example. - Pushing and pulling to Windows or Linux is possible, but the same problems here: the keyboard is not prepared to enter the decomposed form, and the display may be wrong. The only possible use case for decomposed unicode I am aware of is when you use git-bzr, because bzr does not do the precomposition (and neither hg to my knowledge). So for me the test case could sense, even if I think that nobody (TM) uses an old Git version under Mac OS X which is not able to handle precomposed unicode. Unless I have missed something. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-28 19:52 ` Torsten Bögershausen @ 2014-04-28 20:03 ` Jeff King 2014-04-28 20:49 ` Torsten Bögershausen 0 siblings, 1 reply; 19+ messages in thread From: Jeff King @ 2014-04-28 20:03 UTC (permalink / raw) To: Torsten Bögershausen; +Cc: Junio C Hamano, git On Mon, Apr 28, 2014 at 09:52:07PM +0200, Torsten Bögershausen wrote: > To my knowledge repos with decomposed unicode should be rare in > practice. I only can speak for european (or latin based) or cyrillic > languages myself: I've run across several cases in the past few months, but only just figured out what was going on. Most were tickets to GitHub support, but we actually have such a case in our github/github repository. In most cases, I think they were created on older versions of git on OS X, either before core.precomposeunicode existed, or before it was turned on by default. The decomposed form got baked into the tree (whatever the user originally typed, git probably found out about it via "git add ."). I think reports are just coming in now because we didn't start turning on core.precomposeunicode by default until v1.8.5, shipped in November. And then, a person working on the repository would not notice anything, since we only set the flag during clone. So it took time for people to upgrade _and_ to make fresh clones. > So for me the test case could sense, even if I think that nobody (TM) > uses an old Git version under Mac OS X which is not able to handle > precomposed unicode. Even when they do not, the decomposed values are baked into history from those old versions. So it is a matter of history created with older versions not interacting well with newer versions. -Peff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-28 20:03 ` Jeff King @ 2014-04-28 20:49 ` Torsten Bögershausen 2014-04-29 3:23 ` Jeff King 0 siblings, 1 reply; 19+ messages in thread From: Torsten Bögershausen @ 2014-04-28 20:49 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, git On 2014-04-28 22.03, Jeff King wrote: > On Mon, Apr 28, 2014 at 09:52:07PM +0200, Torsten Bögershausen wrote: > >> To my knowledge repos with decomposed unicode should be rare in >> practice. I only can speak for european (or latin based) or cyrillic >> languages myself: > I've run across several cases in the past few months, but only just > figured out what was going on. Most were tickets to GitHub support, but > we actually have such a case in our github/github repository. In most > cases, I think they were created on older versions of git on OS X, > either before core.precomposeunicode existed, or before it was turned on > by default. The decomposed form got baked into the tree (whatever the > user originally typed, git probably found out about it via "git add ."). > > I think reports are just coming in now because we didn't start turning > on core.precomposeunicode by default until v1.8.5, shipped in November. > And then, a person working on the repository would not notice anything, > since we only set the flag during clone. So it took time for people to > upgrade _and_ to make fresh clones. OK, thanks for the description. In theory we can make Git "composition ignoring" by changing index_file_exists() in name-hash.c. (Both names must be precomposed first and compared then) I don't know how much people are using Git before 1.7.12 (the first version supporting precomposed unicode). Could we simply ask them to upgrade ? The next problem is that people need to agree if the repo should store names in pre- or decomposed form. (My voice is for precomposed) Unfortunatly the core.precomposeunicode is repo-local, so everybody needs to "agree globally" and "configure locally". Side note: I which we had this config variable travelling with the repo, like .gitattributes does for text dealing with CRLF-LF. I don't know how many reports you have, reading all this it feels as if the effected users could "normalize" their repos and run "git config core.precomposeunicode true", followed by "git config --global core.precomposeunicode true". Does that sound like a possible way forward ? >> So for me the test case could sense, even if I think that nobody (TM) >> uses an old Git version under Mac OS X which is not able to handle >> precomposed unicode. > Even when they do not, the decomposed values are baked into history from > those old versions. So it is a matter of history created with older > versions not interacting well with newer versions. I'm not sure if I understood all the details here, but I would be happy to help with suggestions/tests/reviews. > -Peff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-28 20:49 ` Torsten Bögershausen @ 2014-04-29 3:23 ` Jeff King 2014-04-29 7:39 ` Torsten Bögershausen 0 siblings, 1 reply; 19+ messages in thread From: Jeff King @ 2014-04-29 3:23 UTC (permalink / raw) To: Torsten Bögershausen; +Cc: Junio C Hamano, git On Mon, Apr 28, 2014 at 10:49:30PM +0200, Torsten Bögershausen wrote: > OK, thanks for the description. > In theory we can make Git "composition ignoring" by changing > index_file_exists() in name-hash.c. > (Both names must be precomposed first and compared then) Yeah, we could perhaps get away without storing the extra precomposed form if we just stored the entries under their precomposed hash, and then taught same_name to use a slower precompose-aware comparison. But I also see that we do binary searches in index_name_pos (called by index_name_is_other). I don't know if we'd have to deal with this problem there, too. > I don't know how much people are using Git before 1.7.12 (the > first version supporting precomposed unicode). > > Could we simply ask them to upgrade ? I'm not sure what you mean here. Upgrading won't help, because the values are baked into existing history created with the old versions forever. Any time I "git checkout v1.0" on the broken project, a modern git will complain about the ghost untracked file. > The next problem is that people need to agree if the repo should store > names in pre- or decomposed form. > (My voice is for precomposed) > Unfortunatly the core.precomposeunicode is repo-local, so everybody > needs to "agree globally" and "configure locally". Yeah, that was sort of my "point 1" from the patch. I'm a bit worried that it creates problems for people on other systems, though. Linux people do not need to care about precomposed/decomposed at all, and any attempt we make to automatically handle it is going to run afoul of non-utf8 encodings. Not to mention that it does not solve the "git checkout v1.0" problem above. > Side note: > I which we had this config variable travelling with the repo, like .gitattributes does > for text dealing with CRLF-LF. Yeah, I guess if we wanted to enforce it everywhere, you would have to use .gitattributes since we cannot safely turn on such a feature by default. > I don't know how many reports you have, reading all this it feels as if the effected users > could "normalize" their repos and run "git config core.precomposeunicode true", followed > by "git config --global core.precomposeunicode true". > Does that sound like a possible way forward ? I have just a handful of reports. Maybe 3-4? I didn't dig them all up today, as it would be a non-trivial effort. But I have no idea how good a sampling that is. -Peff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-29 3:23 ` Jeff King @ 2014-04-29 7:39 ` Torsten Bögershausen 0 siblings, 0 replies; 19+ messages in thread From: Torsten Bögershausen @ 2014-04-29 7:39 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, git On 04/29/2014 05:23 AM, Jeff King wrote: > On Mon, Apr 28, 2014 at 10:49:30PM +0200, Torsten Bögershausen wrote: > >> OK, thanks for the description. >> In theory we can make Git "composition ignoring" by changing >> index_file_exists() in name-hash.c. >> (Both names must be precomposed first and compared then) > Yeah, we could perhaps get away without storing the extra precomposed > form if we just stored the entries under their precomposed hash, and > then taught same_name to use a slower precompose-aware comparison. But I > also see that we do binary searches in index_name_pos (called by > index_name_is_other). I don't know if we'd have to deal with this > problem there, too. Just loud thinking: We precompose whenever we read file names from disc, that's done in readdir() We precompose whenever we get an argv into Git, that's done in precompose_argv() We precompose every time we read file names from the index file on disc(?) into memory. That we don't do today, and my suggestion to hack name-hash.c isn't a good one. Probably we need to go into read-cache.c, or more places. I'm not an expert here knowing all Git internal details. Basically all places where strings containing file names are put into memory are effected, and I wouldn't be too concerned about CPU cycles. >> I don't know how much people are using Git before 1.7.12 (the >> first version supporting precomposed unicode). >> >> Could we simply ask them to upgrade ? > I'm not sure what you mean here. Upgrading won't help, because the > values are baked into existing history created with the old versions > forever. Any time I "git checkout v1.0" on the broken project, a modern > git will complain about the ghost untracked file. It depends if all file names in a certain repo are stored decomposed, (in this case everybody can set core.precomposeunicode false) or if there is a mixture having precomposed and decomposed in different comits/directories... In this case we can normalize the master branch. For older commit the users need to wait for an updated Git version, until that they need to live with the ghosts as good as they can. > >> The next problem is that people need to agree if the repo should store >> names in pre- or decomposed form. >> (My voice is for precomposed) >> Unfortunatly the core.precomposeunicode is repo-local, so everybody >> needs to "agree globally" and "configure locally". > Yeah, that was sort of my "point 1" from the patch. I'm a bit worried > that it creates problems for people on other systems, though. Linux > people do not need to care about precomposed/decomposed at all, and any > attempt we make to automatically handle it is going to run afoul of > non-utf8 encodings. Not to mention that it does not solve the "git > checkout v1.0" problem above. Not sure what is meant by non-utf8 encodings. Mac OS X can only handle Unicode filenames, and a single ISO-8859-1 will be returned as "%XY" from readdir(). So if you want to share a repo with Mac OS X (and/or Windows) Unicode should be used. Are you saying that there is a Linux station feeding in file names in e.g. 8859-1, EUC ? My experience/knowledge is that you can not do that (in a useful way). >> Side note: >> I which we had this config variable travelling with the repo, like .gitattributes does >> for text dealing with CRLF-LF. > Yeah, I guess if we wanted to enforce it everywhere, you would have to > use .gitattributes since we cannot safely turn on such a feature by > default. > >> I don't know how many reports you have, reading all this it feels as if the effected users >> could "normalize" their repos and run "git config core.precomposeunicode true", followed >> by "git config --global core.precomposeunicode true". >> Does that sound like a possible way forward ? > I have just a handful of reports. Maybe 3-4? I didn't dig them all up > today, as it would be a non-trivial effort. But I have no idea how good > a sampling that is. The following could help, may be: git -c core.quotepath=false ls-files | iconv -f UTF-8-MAC -t UTF-8 >expected git -c core.quotepath=false ls-files >actual diff expected actual > > -Peff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-28 19:35 ` Jeff King 2014-04-28 19:52 ` Torsten Bögershausen @ 2014-04-29 3:15 ` Jeff King 1 sibling, 0 replies; 19+ messages in thread From: Jeff King @ 2014-04-29 3:15 UTC (permalink / raw) To: Junio C Hamano; +Cc: Torsten Bögershausen, git On Mon, Apr 28, 2014 at 03:35:02PM -0400, Jeff King wrote: > Since such entries are in the minority, and because cache_entry is > already a variable-length struct, I think you could get away with > sticking it after the "name" field, and then comparing like: > > const char *ce_normalized_name(struct cache_entry *ce, size_t *len) > { > const char *ret; > > /* Normal, fast path */ > if (!(ce->ce_flags & CE_NORMALIZED_NAME)) { > len = ce_namelen(ce); > return ce->name; > } > > /* Slow path for normalized names */ > ret = ce->name + ce->namelen + 1; > *len = strlen(name); > return ret; > } That's the reading half. We would also need to create the normalized names for each cache_entry. I took a look at that this afternoon. It turns out we make cache_entry structs in quite a few places. So I thought I'd start with converting them all to a function like: struct cache_entry *cache_entry_alloc(const char *name, size_t len); And then once converted, we could teach it to normalize the name as appropriate. That interface does improve many of the callers, but there are a few tricky ones. For example, in checkout.c:update_some (and one or two other spots), we actually have the path broken into two parts, and we combine them while writing into the cache_entry. We could obviously combine them into a single buffer beforehand, but that means extra copying in reasonably hot code paths. It would be slightly ugly but perhaps reasonable to have: cache_entry_alloc_two(const char *one, size_t one_len, const char *two, size_t two_len); But then I got to unpack-trees. It has the whole path broken down across a linked list. I'm not sure what would be least terrible interface here. Again, we could format to a buffer and copy, but I'm hesitant to do it on this code path. I'm not sure if I'm just being paranoid about the extra memcpys (it's not _that_ tight a loop, since after all, we are generally zlib inflating to get the tree data, and the filenames are not all _that_ long). I dunno. I just hate the idea of tradeoffs for this OS-X-only fix permeating the rest of the code on other platforms. But maybe Knuth should be hitting me with his premature optimization clue-stick. -Peff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-28 16:16 [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames Jeff King 2014-04-28 19:17 ` Junio C Hamano @ 2014-04-29 17:12 ` Junio C Hamano 2014-04-29 18:02 ` Jeff King 2014-05-04 6:13 ` Torsten Bögershausen 2 siblings, 1 reply; 19+ messages in thread From: Junio C Hamano @ 2014-04-29 17:12 UTC (permalink / raw) To: Jeff King; +Cc: Torsten Bögershausen, git Jeff King <peff@peff.net> writes: > This patch just adds a test to demonstrate the breakage. > Some possible fixes are: > > 1. Tell everyone that NFD in the git repo is wrong, and > they should make a new commit to normalize all their > in-repo files to be precomposed. > > This is probably not the right thing to do, because it > still doesn't fix checkouts of old history. And it > spreads the problem to people on byte-preserving > filesystems (like ext4), because now they have to start > precomposing their filenames as they are adde to git. Hmm, have we taught the "compare precomposed" for codepaths that compare two trees and a tree and the index, too? Otherwise, we would have the same issue with commits in the old history. Do we have a similar issue for older commit in a history under "ignore-case" as well? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-29 17:12 ` Junio C Hamano @ 2014-04-29 18:02 ` Jeff King 2014-04-29 18:49 ` Junio C Hamano 2014-04-30 14:57 ` Torsten Bögershausen 0 siblings, 2 replies; 19+ messages in thread From: Jeff King @ 2014-04-29 18:02 UTC (permalink / raw) To: Junio C Hamano; +Cc: Torsten Bögershausen, git On Tue, Apr 29, 2014 at 10:12:52AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > This patch just adds a test to demonstrate the breakage. > > Some possible fixes are: > > > > 1. Tell everyone that NFD in the git repo is wrong, and > > they should make a new commit to normalize all their > > in-repo files to be precomposed. > > > > This is probably not the right thing to do, because it > > still doesn't fix checkouts of old history. And it > > spreads the problem to people on byte-preserving > > filesystems (like ext4), because now they have to start > > precomposing their filenames as they are adde to git. > > Hmm, have we taught the "compare precomposed" for codepaths that > compare two trees and a tree and the index, too? Otherwise, we > would have the same issue with commits in the old history. Ugh, yeah, I didn't think about that codepath. I think we would not want to precompose in that case. IOW, git works byte-wise internally, but it is only at the filesystem layer that we do such munging. The index straddles the line between the filesystem and git's internal representations. I think my "keep the normalized names alongside index entries" approach might still work there. But it means that we compare against the "real" byte-wise names on the tree side, and against the normalized names on the path side. But that means having two comparison/lookup functions for the index, and always using the right one. And algorithms that rely on traversing two sorted lists cannot work in both directions. > Do we have a similar issue for older commit in a history under > "ignore-case" as well? I don't think so, because we handle ignorecase completely differently. There we use the name-hash with a case-insensitive hash and a case-insensitive comparison function. And we use strcasecmp liberally throughout the code. I don't think we have a "str_utf8_cmp" that ignores normalizations (or maybe strcoll will do this?). But in theory we could use it everywhere we use strcasecmp for ignore_case. And then we would not need to have our readdir wrapper, maybe? I admit I haven't thought that much about _either_ approach. But aside from some bugs in the hash system, I do not recall seeing any design problems in the ignorecase code. -Peff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-29 18:02 ` Jeff King @ 2014-04-29 18:49 ` Junio C Hamano 2014-04-29 19:46 ` Jeff King 2014-04-30 14:57 ` Torsten Bögershausen 1 sibling, 1 reply; 19+ messages in thread From: Junio C Hamano @ 2014-04-29 18:49 UTC (permalink / raw) To: Jeff King; +Cc: Torsten Bögershausen, git Jeff King <peff@peff.net> writes: > I don't think we have a "str_utf8_cmp" that ignores normalizations (or > maybe strcoll will do this?). But in theory we could use it everywhere > we use strcasecmp for ignore_case. And then we would not need to have > our readdir wrapper, maybe? I admit I haven't thought that much about > _either_ approach. But aside from some bugs in the hash system, I do not > recall seeing any design problems in the ignorecase code. Our diffs and merges depend on walking two (or more) sorted lists, and that sort order is baked in the tree objects when they are created. Using "normalized comparison" only when comparing the earliest elements picked from these sorted lists would not give you the correct comparison or merge results, would it? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-29 18:49 ` Junio C Hamano @ 2014-04-29 19:46 ` Jeff King 0 siblings, 0 replies; 19+ messages in thread From: Jeff King @ 2014-04-29 19:46 UTC (permalink / raw) To: Junio C Hamano; +Cc: Torsten Bögershausen, git On Tue, Apr 29, 2014 at 11:49:30AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > I don't think we have a "str_utf8_cmp" that ignores normalizations (or > > maybe strcoll will do this?). But in theory we could use it everywhere > > we use strcasecmp for ignore_case. And then we would not need to have > > our readdir wrapper, maybe? I admit I haven't thought that much about > > _either_ approach. But aside from some bugs in the hash system, I do not > > recall seeing any design problems in the ignorecase code. > > Our diffs and merges depend on walking two (or more) sorted lists, > and that sort order is baked in the tree objects when they are > created. Using "normalized comparison" only when comparing the > earliest elements picked from these sorted lists would not give you > the correct comparison or merge results, would it? Right, but we do not do normalized comparisons on that side. Not for precompose, and not for ignorecase. The entry in the index _is_ case-sensitive[1], and we compare it as such to the tree side. It is only when comparing the filesystem side to the index that we need to care about such normalizations. And there we have the name-hash code to handle ignorecase, but nothing to handle precompose. -Peff [1] This works because you typically get the case-sensitive entry via `git read-tree`, and then further update it from the filesystem. If you were to add a new entry "makefile", and somebody else added "Makefile", they would conflict. When you do "git add makefile" and "Makefile" _does_ exist, I am not sure, though, if it is git who handles making sure we find the correct entry, or if it is simply the fact that case insensitive filesystems are typically case-preserving (so you generally ask for "Makefile" anyway). If it is the latter, then that is a problem for precompose. HFS's NFD normalization is non-preserving. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-29 18:02 ` Jeff King 2014-04-29 18:49 ` Junio C Hamano @ 2014-04-30 14:57 ` Torsten Bögershausen 2014-05-04 12:04 ` Torsten Bögershausen 1 sibling, 1 reply; 19+ messages in thread From: Torsten Bögershausen @ 2014-04-30 14:57 UTC (permalink / raw) To: Jeff King, Junio C Hamano; +Cc: Torsten Bögershausen, git On 29.04.14 20:02, Jeff King wrote: > On Tue, Apr 29, 2014 at 10:12:52AM -0700, Junio C Hamano wrote: > >> Jeff King <peff@peff.net> writes: >> >>> This patch just adds a test to demonstrate the breakage. >>> Some possible fixes are: >>> >>> 1. Tell everyone that NFD in the git repo is wrong, and >>> they should make a new commit to normalize all their >>> in-repo files to be precomposed. >>> >>> This is probably not the right thing to do, because it >>> still doesn't fix checkouts of old history. And it >>> spreads the problem to people on byte-preserving >>> filesystems (like ext4), because now they have to start >>> precomposing their filenames as they are adde to git. >> >> Hmm, have we taught the "compare precomposed" for codepaths that >> compare two trees and a tree and the index, too? Otherwise, we >> would have the same issue with commits in the old history. > > Ugh, yeah, I didn't think about that codepath. I think we would not want > to precompose in that case. IOW, git works byte-wise internally, but it > is only at the filesystem layer that we do such munging. The index > straddles the line between the filesystem and git's internal > representations. > [snip] Please allow me to answer on this post- I made a suggestion here: https://github.com/tboegi/git/commit/85305ce306cb88a07dad6350d6ba8c5f2f817af6 The new test in t3910 passes, but the test suite hangs somewhere, there is a whitespace in precompose_utf8.c, so I don't know what to say. But in case someone wants to make a code review: commit 85305ce306cb88a07dad6350d6ba8c5f2f817af6 Author: Torsten Bögershausen <tboegi@web.de> Date: Wed Apr 30 10:30:04 2014 +0200 core.precomposeunicode with decomposed filenames Commit 750b2e4785e shows a failure of core.precomposeunicode when decomposed filenames are in the index. When decomposed file names are in the index and readdir() converts them into the decomposed form, "Git status" will report the precomposed file name as untracked. Solution: Precompose file names when reading the index file from disc into memory. diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c index 95fe849..40ebc2e 100644 --- a/compat/precompose_utf8.c +++ b/compat/precompose_utf8.c @@ -57,6 +57,19 @@ void probe_utf8_pathname_composition(char *path, int len) } +char *precompose_str_len(const char *in, size_t insz, int *outsz) +{ + char *prec_str = NULL; + if (precomposed_unicode != 1) + return NULL; + + if (has_non_ascii(in, insz, NULL)) + prec_str = reencode_string_len(in, insz, repo_encoding, path_encoding, outsz); + + return prec_str; +} + + void precompose_argv(int argc, const char **argv) { int i = 0; diff --git a/compat/precompose_utf8.h b/compat/precompose_utf8.h index 3b73585..28f6595 100644 --- a/compat/precompose_utf8.h +++ b/compat/precompose_utf8.h @@ -26,6 +26,7 @@ typedef struct { struct dirent_prec_psx *dirent_nfc; } PREC_DIR; +char *precompose_str_len(const char *in, size_t insz, int *outsz); void precompose_argv(int argc, const char **argv); void probe_utf8_pathname_composition(char *, int); diff --git a/git-compat-util.h b/git-compat-util.h index d493a8c..de117d1 100644 --- a/git-compat-util.h +++ b/git-compat-util.h @@ -180,7 +180,7 @@ typedef unsigned long uintptr_t; #ifdef PRECOMPOSE_UNICODE #include "compat/precompose_utf8.h" #else -#define precompose_str(in,i_nfd2nfc) +#define precompose_str_len(s,i,o) NULL #define precompose_argv(c,v) #define probe_utf8_pathname_composition(a,b) #endif diff --git a/read-cache.c b/read-cache.c index 4b4effd..0887835 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1330,7 +1330,7 @@ static inline uint32_t ntoh_l_force_align(void *p) #define ntoh_l(var) ntoh_l_force_align(&(var)) #endif -static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk, +static struct cache_entry *cache_entry_from_ondisk_int(struct ondisk_cache_entry *ondisk, unsigned int flags, const char *name, size_t len) @@ -1355,6 +1355,22 @@ static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *on return ce; } +static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk, + unsigned int flags, + const char *name, + size_t len) +{ + int prec_len; + char *prec_name = precompose_str_len(name, len, &prec_len); + if (prec_name) { + struct cache_entry *ce; + ce = cache_entry_from_ondisk_int(ondisk, flags, prec_name, prec_len); + free(prec_name); + return ce; + } + return cache_entry_from_ondisk_int(ondisk, flags, name, len); +} + /* * Adjacent cache entries tend to share the leading paths, so it makes * sense to only store the differences in later entries. In the v4 diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh index 23aa61e..d27c018 100755 --- a/t/t3910-mac-os-precompose.sh +++ b/t/t3910-mac-os-precompose.sh @@ -141,7 +141,7 @@ test_expect_success "Add long precomposed filename" ' git commit -m "Long filename" ' -test_expect_failure 'handle existing decomposed filenames' ' +test_expect_success 'handle existing decomposed filenames' ' echo content >"verbatim.$Adiarnfd" && git -c core.precomposeunicode=false add "verbatim.$Adiarnfd" && git commit -m "existing decomposed file" && ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-30 14:57 ` Torsten Bögershausen @ 2014-05-04 12:04 ` Torsten Bögershausen 0 siblings, 0 replies; 19+ messages in thread From: Torsten Bögershausen @ 2014-05-04 12:04 UTC (permalink / raw) To: Torsten Bögershausen, Jeff King, Junio C Hamano; +Cc: git On 2014-04-30 16.57, Torsten Bögershausen wrote: There is something wrong with the patch, (test suite hangs or TC fail), so I need to com back later. Sorry for the noise. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-04-28 16:16 [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames Jeff King 2014-04-28 19:17 ` Junio C Hamano 2014-04-29 17:12 ` Junio C Hamano @ 2014-05-04 6:13 ` Torsten Bögershausen 2014-05-05 21:46 ` Jeff King 2 siblings, 1 reply; 19+ messages in thread From: Torsten Bögershausen @ 2014-05-04 6:13 UTC (permalink / raw) To: Jeff King; +Cc: git (Sorry for the late reply, I'm handicapped by some local IT problems) Peff, do you know if the fix below helps ? On 2014-04-28 18.16, Jeff King wrote: > If you have existing decomposed filenames in your git > repository (e.g., that were created with older versions of > git that did not precompose unicode), a modern git with > core.precomposeunicode set does not handle them well. Yes. > The problem is that we normalize the paths coming from the > disk into their precomposed form, and then compare them > against the literal bytes in the index. This makes things > better if you have the precomposed form in the index. It > makes things worse if you actually have the decomposed form > in the index. > > As a result, paths with decomposed filenames may have their > precomposed variants listed as untracked files (even though > the precomposed variants do not exist on-disk at all). Yes > This patch just adds a test to demonstrate the breakage. > Some possible fixes are: > > 1. Tell everyone that NFD in the git repo is wrong, and > they should make a new commit to normalize all their > in-repo files to be precomposed. > This is probably not the right thing to do, because it > still doesn't fix checkouts of old history. And it > spreads the problem to people on byte-preserving > filesystems (like ext4), because now they have to start > precomposing their filenames as they are adde to git. (typo: ^added) I'm not sure if I follow. People running ext4 (or Linux in general, or Windows, or Unix) do not suffer from file system "feature" of Mac OS, which accepts precomposed/decomposed Unicode but returns decompomsed. > > 2. Do all index filename comparisons using a UTF-8 aware > comparison function when core.precomposeunicode is set. > This would probably have bad performance, and somewhat > defeats the point of converting the filenames at the > readdir level in the first place. > > 3. Convert index filenames to their precomposed form when > we read the index from disk. This would be efficient, > but we would have to be careful not to write the > precomposed forms back out to disk. Question: How could we be careful? Mac OS writes always decomposed Unicode to disk. (And all other OS tend to use precomposed forms, mainly because the "keyboard driver" generates it.) > > 4. Introduce some infrastructure to efficiently match up > the precomposed/decomposed forms. We already do > something similar for case-insensitive files using > name-hash.c. We might be able to adapt that strategy > here. -------------------------- This is my understanding: Some possible fixes are: 1. Accept that NFD in a Git repo which is shared between Mac OS and Linux or Windows is problematic. Whenever core.precomposeunicode = true, do the following: Let Git under Mac OS change all file names in the index into the precomposed form when a new commit is done. This is probably not a wrong thing to do. When the index file is read into memory, precompose the file names and compare them with the precomposed form coming from precompose_utf8_readdir(). This avoids decomposed file names to be reported as untracked by "git status. 2. Do all index filename comparisons under Mac OS X using a UTF-8 aware comparison function regardless if core.precomposeunicode is set. This would probably have bad performance, and somewhat defeats the point of converting the filenames at the readdir level in the first place. (The TC is good) > > Signed-off-by: Jeff King <peff@peff.net> > --- > t/t3910-mac-os-precompose.sh | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh > index e4ba601..23aa61e 100755 > --- a/t/t3910-mac-os-precompose.sh > +++ b/t/t3910-mac-os-precompose.sh > @@ -140,6 +140,16 @@ test_expect_success "Add long precomposed filename" ' > git add * && > git commit -m "Long filename" > ' > + > +test_expect_failure 'handle existing decomposed filenames' ' > + echo content >"verbatim.$Adiarnfd" && > + git -c core.precomposeunicode=false add "verbatim.$Adiarnfd" && > + git commit -m "existing decomposed file" && > + >expect && > + git ls-files --exclude-standard -o "verbatim*" >untracked && > + test_cmp expect untracked > +' > + > # Test if the global core.precomposeunicode stops autosensing > # Must be the last test case > test_expect_success "respect git config --global core.precomposeunicode" ' On top of the we need to following: diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c index 95fe849..5ed3d17 100644 --- a/compat/precompose_utf8.c +++ b/compat/precompose_utf8.c @@ -57,6 +57,19 @@ void probe_utf8_pathname_composition(char *path, int len) } +char *precompose_str_len(const char *in, size_t insz, int *outsz) +{ + char *prec_str = NULL; + if (precomposed_unicode != 1) + return NULL; + + if (has_non_ascii(in, insz, NULL)) + prec_str = reencode_string_len(in, insz, repo_encoding, path_encoding, outsz); + + return prec_str; +} + + void precompose_argv(int argc, const char **argv) { int i = 0; diff --git a/compat/precompose_utf8.h b/compat/precompose_utf8.h index 3b73585..28f6595 100644 --- a/compat/precompose_utf8.h +++ b/compat/precompose_utf8.h @@ -26,6 +26,7 @@ typedef struct { struct dirent_prec_psx *dirent_nfc; } PREC_DIR; +char *precompose_str_len(const char *in, size_t insz, int *outsz); void precompose_argv(int argc, const char **argv); void probe_utf8_pathname_composition(char *, int); diff --git a/git-compat-util.h b/git-compat-util.h index d493a8c..de117d1 100644 --- a/git-compat-util.h +++ b/git-compat-util.h @@ -180,7 +180,7 @@ typedef unsigned long uintptr_t; #ifdef PRECOMPOSE_UNICODE #include "compat/precompose_utf8.h" #else -#define precompose_str(in,i_nfd2nfc) +#define precompose_str_len(s,i,o) NULL #define precompose_argv(c,v) #define probe_utf8_pathname_composition(a,b) #endif diff --git a/read-cache.c b/read-cache.c index 4b4effd..0887835 100644 --- a/read-cache.c +++ b/read-cache.c @@ -1330,7 +1330,7 @@ static inline uint32_t ntoh_l_force_align(void *p) #define ntoh_l(var) ntoh_l_force_align(&(var)) #endif -static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk, +static struct cache_entry *cache_entry_from_ondisk_int(struct ondisk_cache_entry *ondisk, unsigned int flags, const char *name, size_t len) @@ -1355,6 +1355,22 @@ static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *on return ce; } +static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk, + unsigned int flags, + const char *name, + size_t len) +{ + int prec_len; + char *prec_name = precompose_str_len(name, len, &prec_len); + if (prec_name) { + struct cache_entry *ce; + ce = cache_entry_from_ondisk_int(ondisk, flags, prec_name, prec_len); + free(prec_name); + return ce; + } + return cache_entry_from_ondisk_int(ondisk, flags, name, len); +} + /* * Adjacent cache entries tend to share the leading paths, so it makes * sense to only store the differences in later entries. In the v4 diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh index 23aa61e..2c3965b 100755 --- a/t/t3910-mac-os-precompose.sh +++ b/t/t3910-mac-os-precompose.sh @@ -141,13 +141,14 @@ test_expect_success "Add long precomposed filename" ' git commit -m "Long filename" ' -test_expect_failure 'handle existing decomposed filenames' ' +test_expect_success 'handle existing decomposed filenames' ' echo content >"verbatim.$Adiarnfd" && git -c core.precomposeunicode=false add "verbatim.$Adiarnfd" && git commit -m "existing decomposed file" && >expect && git ls-files --exclude-standard -o "verbatim*" >untracked && test_cmp expect untracked + exit 0 ' # Test if the global core.precomposeunicode stops autosensing ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-05-04 6:13 ` Torsten Bögershausen @ 2014-05-05 21:46 ` Jeff King 2014-05-06 10:11 ` Erik Faye-Lund 2014-05-07 19:16 ` Torsten Bögershausen 0 siblings, 2 replies; 19+ messages in thread From: Jeff King @ 2014-05-05 21:46 UTC (permalink / raw) To: Torsten Bögershausen; +Cc: git On Sun, May 04, 2014 at 08:13:15AM +0200, Torsten Bögershausen wrote: > > 1. Tell everyone that NFD in the git repo is wrong, and > > they should make a new commit to normalize all their > > in-repo files to be precomposed. > > This is probably not the right thing to do, because it > > still doesn't fix checkouts of old history. And it > > spreads the problem to people on byte-preserving > > filesystems (like ext4), because now they have to start > > precomposing their filenames as they are adde to git. > (typo: ^added) > I'm not sure if I follow. People running ext4 (or Linux in general, > or Windows, or Unix) do not suffer from file system > "feature" of Mac OS, which accepts precomposed/decomposed Unicode > but returns decompomsed. What I mean by "spreads the problem" is that git on Linux does not need to care about utf8 at all. It treats filenames as a byte sequence. But if we were to start enforcing "filenames should be precomposed utf8", then people adding files on Linux would want to enforce that, too. People on Linux could ignore the issue as they do now, but they would then create problems for OS X users if they add decomposed filenames. IOW, if the OS X code assumes "all repo filenames are precomposed", then other systems become a possible vector for violating that assumption. > > 3. Convert index filenames to their precomposed form when > > we read the index from disk. This would be efficient, > > but we would have to be careful not to write the > > precomposed forms back out to disk. > Question: > How could we be careful? > Mac OS writes always decomposed Unicode to disk. > (And all other OS tend to use precomposed forms, mainly because the "keyboard > driver" generates it.) Sorry, I should have been more clear here. I meant "do not write index entries using the precomposed forms out to the on-disk index". Because that would mean that git silently converts your filenames, and it would look like you have changes to commit whenever you read in a tree with a decomposed name. Looking over the patch you sent earlier, I suspect that is part of its problem (it stores the converted name in the index entry's name field). > This is my understanding: > Some possible fixes are: > > 1. Accept that NFD in a Git repo which is shared between Mac OS > and Linux or Windows is problematic. > Whenever core.precomposeunicode = true, do the following: > Let Git under Mac OS change all file names in the index > into the precomposed form when a new commit is done. > This is probably not a wrong thing to do. > > When the index file is read into memory, precompose the file names and compare > them with the precomposed form coming from precompose_utf8_readdir(). > This avoids decomposed file names to be reported as untracked by "git status. This is the case I was specifically thinking of above (and I think what your patch is doing). > 2. Do all index filename comparisons under Mac OS X using a UTF-8 aware > comparison function regardless if core.precomposeunicode is set. > This would probably have bad performance, and somewhat > defeats the point of converting the filenames at the > readdir level in the first place. Right, I'm concerned about performance here, but I wonder if we can reuse the name-hash solutions from ignorecase. -Peff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-05-05 21:46 ` Jeff King @ 2014-05-06 10:11 ` Erik Faye-Lund 2014-05-07 19:16 ` Torsten Bögershausen 1 sibling, 0 replies; 19+ messages in thread From: Erik Faye-Lund @ 2014-05-06 10:11 UTC (permalink / raw) To: Jeff King; +Cc: Torsten Bögershausen, GIT Mailing-list On Mon, May 5, 2014 at 11:46 PM, Jeff King <peff@peff.net> wrote: > On Sun, May 04, 2014 at 08:13:15AM +0200, Torsten Bögershausen wrote: > >> > 1. Tell everyone that NFD in the git repo is wrong, and >> > they should make a new commit to normalize all their >> > in-repo files to be precomposed. >> > This is probably not the right thing to do, because it >> > still doesn't fix checkouts of old history. And it >> > spreads the problem to people on byte-preserving >> > filesystems (like ext4), because now they have to start >> > precomposing their filenames as they are adde to git. >> (typo: ^added) >> I'm not sure if I follow. People running ext4 (or Linux in general, >> or Windows, or Unix) do not suffer from file system >> "feature" of Mac OS, which accepts precomposed/decomposed Unicode >> but returns decompomsed. > > What I mean by "spreads the problem" is that git on Linux does not need > to care about utf8 at all. It treats filenames as a byte sequence. But > if we were to start enforcing "filenames should be precomposed utf8", > then people adding files on Linux would want to enforce that, too. > > People on Linux could ignore the issue as they do now, but they would > then create problems for OS X users if they add decomposed filenames. > IOW, if the OS X code assumes "all repo filenames are precomposed", then > other systems become a possible vector for violating that assumption. FWIW, Git for Windows also doesn't deal with that "filenames are just a byte-sequence"-notion. We have patches in place that require filenames to *either* be valid UTF-8 or Windows-1252. Windows itself treats filenames as Unicode characters, not arbitrary byte sequences. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames 2014-05-05 21:46 ` Jeff King 2014-05-06 10:11 ` Erik Faye-Lund @ 2014-05-07 19:16 ` Torsten Bögershausen 1 sibling, 0 replies; 19+ messages in thread From: Torsten Bögershausen @ 2014-05-07 19:16 UTC (permalink / raw) To: Jeff King, Torsten Bögershausen; +Cc: git On 2014-05-05 23.46, Jeff King wrote: [] >> 2. Do all index filename comparisons under Mac OS X using a UTF-8 aware >> comparison function regardless if core.precomposeunicode is set. >> This would probably have bad performance, and somewhat >> defeats the point of converting the filenames at the >> readdir level in the first place. > > Right, I'm concerned about performance here, but I wonder if we can > reuse the name-hash solutions from ignorecase. Some short question, (sorry being short) What do you think about this: diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c index 95fe849..c807f38 100644 --- a/compat/precompose_utf8.c +++ b/compat/precompose_utf8.c @@ -57,6 +57,22 @@ void probe_utf8_pathname_composition(char *path, int len) } +char *inverscompose_str_len(const char *in, size_t insz, int *outsz) +{ + char *prec_str = NULL; + + if (has_non_ascii(in, insz, NULL)) { + int old_errno = errno; + prec_str = reencode_string_len(in, (int)insz, + precomposed_unicode ? path_encoding : repo_encoding, + precomposed_unicode ? repo_encoding : path_encoding, + outsz); + errno = old_errno; + } + return prec_str; +} + + void precompose_argv(int argc, const char **argv) { int i = 0; diff --git a/compat/precompose_utf8.h b/compat/precompose_utf8.h index 3b73585..b9ac099 100644 --- a/compat/precompose_utf8.h +++ b/compat/precompose_utf8.h @@ -26,6 +26,7 @@ typedef struct { struct dirent_prec_psx *dirent_nfc; } PREC_DIR; +char *inverscompose_str_len(const char *in, size_t insz, int *outsz); void precompose_argv(int argc, const char **argv); void probe_utf8_pathname_composition(char *, int); diff --git a/git-compat-util.h b/git-compat-util.h index f6d3a46..5ae5f57 100644 --- a/git-compat-util.h +++ b/git-compat-util.h @@ -177,7 +177,7 @@ typedef unsigned long uintptr_t; #ifdef PRECOMPOSE_UNICODE #include "compat/precompose_utf8.h" #else -#define precompose_str(in,i_nfd2nfc) +#define inverscompose_str_len(s,i,o) NULL #define precompose_argv(c,v) #define probe_utf8_pathname_composition(a,b) #endif diff --git a/name-hash.c b/name-hash.c index 97444d0..3c4a4ee 100644 --- a/name-hash.c +++ b/name-hash.c @@ -210,7 +210,7 @@ struct cache_entry *index_dir_exists(struct index_state *istate, const char *nam return NULL; } -struct cache_entry *index_file_exists(struct index_state *istate, const char *name, int namelen, int icase) +static struct cache_entry *index_file_exists_int(struct index_state *istate, const char *name, int namelen, int icase) { struct cache_entry *ce; struct hashmap_entry key; @@ -227,6 +227,25 @@ struct cache_entry *index_file_exists(struct index_state *istate, const char *na return NULL; } + +struct cache_entry *index_file_exists(struct index_state *istate, const char *name, int namelen, int icase) +{ + struct cache_entry *ce; + ce = index_file_exists_int(istate, name, namelen, icase); + if (ce) + return ce; + else { + int prec_len; + char *prec_name = inverscompose_str_len(name, namelen, &prec_len); + if (prec_name) { + ce = index_file_exists_int(istate, prec_name, prec_len, icase); + free(prec_name); + } + } + return ce; +} + + void free_name_hash(struct index_state *istate) { if (!istate->name_hash_initialized) diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh index e4ba601..4414658 100755 --- a/t/t3910-mac-os-precompose.sh +++ b/t/t3910-mac-os-precompose.sh @@ -140,6 +140,22 @@ test_expect_success "Add long precomposed filename" ' git add * && git commit -m "Long filename" ' + +test_expect_success 'handle existing decomposed filenames' ' + git checkout master && + git reset --hard && + git checkout -b exist_decomposed && + echo content >"verbatim.$Adiarnfd" && + git -c core.precomposeunicode=false add "verbatim.$Adiarnfd" && + git commit -m "existing decomposed file" && + echo \"verbatim.A\\314\\210\" >expect && + git ls-files >actual && + test_cmp expect actual && + >expect && + git status | grep verbatim || : >actual && + test_cmp expect actual +' + # Test if the global core.precomposeunicode stops autosensing # Must be the last test case test_expect_success "respect git config --global core.precomposeunicode" ' diff --git a/utf8.c b/utf8.c index 77c28d4..fb8a99a 100644 --- a/utf8.c +++ b/utf8.c @@ -519,7 +519,7 @@ char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv, int *outs char *out, *outpos; iconv_ibp cp; - outsz = insz; + outsz = 3*insz; /* for decompose */ outalloc = outsz + 1; /* for terminating NUL */ out = xmalloc(outalloc); outpos = out; ^ permalink raw reply related [flat|nested] 19+ messages in thread
end of thread, other threads:[~2014-05-07 19:16 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-04-28 16:16 [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames Jeff King 2014-04-28 19:17 ` Junio C Hamano 2014-04-28 19:35 ` Jeff King 2014-04-28 19:52 ` Torsten Bögershausen 2014-04-28 20:03 ` Jeff King 2014-04-28 20:49 ` Torsten Bögershausen 2014-04-29 3:23 ` Jeff King 2014-04-29 7:39 ` Torsten Bögershausen 2014-04-29 3:15 ` Jeff King 2014-04-29 17:12 ` Junio C Hamano 2014-04-29 18:02 ` Jeff King 2014-04-29 18:49 ` Junio C Hamano 2014-04-29 19:46 ` Jeff King 2014-04-30 14:57 ` Torsten Bögershausen 2014-05-04 12:04 ` Torsten Bögershausen 2014-05-04 6:13 ` Torsten Bögershausen 2014-05-05 21:46 ` Jeff King 2014-05-06 10:11 ` Erik Faye-Lund 2014-05-07 19:16 ` Torsten Bögershausen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).