* Bugs in git filter-branch (git replace related) @ 2016-01-28 14:46 Anatoly Borodin 2016-01-29 6:18 ` Jeff King 0 siblings, 1 reply; 6+ messages in thread From: Anatoly Borodin @ 2016-01-28 14:46 UTC (permalink / raw) To: git Hi All! There are two bugs in `git filter-branch`, present in the most recent versions (d10e2cb9d0299a26f43d57dd5bdcf2b3f86a30b3), as well as in the old ones (I couldn't find a version where it works properly). The script: #!/bin/sh set -e GIT_EXEC_PATH=/tmp/git export GIT_EXEC_PATH GIT=$GIT_EXEC_PATH/git rm -rf a mkdir a cd a $GIT init echo aaa > a.txt && $GIT add a.txt && $GIT commit -a -m a echo bbb > a.txt && $GIT add a.txt && $GIT commit -a -m b echo ccc > a.txt && $GIT add a.txt && $GIT commit -a -m c $GIT replace f761ec192d9f0dca3329044b96ebdb12839dbff6 72943a16fb2c8f38f9dde202b7a70ccc19c52f34 echo && echo One && $GIT filter-branch --prune-empty -- master echo && echo Two && $GIT filter-branch --prune-empty -- --all The output is: ... One Rewrite 98af0305b778bf56e25a0d4f85acdf82f435f9b3 (3/3) (0 seconds passed, remaining 0 predicted) WARNING: Ref 'refs/heads/master' is unchanged Two Rewrite 98af0305b778bf56e25a0d4f85acdf82f435f9b3 (3/3) (0 seconds passed, remaining 0 predicted) WARNING: Ref 'refs/heads/master' is unchanged error: object 72943a16fb2c8f38f9dde202b7a70ccc19c52f34 is a blob, not a commit error: object 72943a16fb2c8f38f9dde202b7a70ccc19c52f34 is a blob, not a commit fatal: ambiguous argument 'refs/replace/f761ec192d9f0dca3329044b96ebdb12839dbff6^0': unknown revision or path not in the working tree. Use '--' to separate paths from revisions, like this: 'git <command> [<revision>...] -- [<file>...]' WARNING: Ref 'refs/replace/f761ec192d9f0dca3329044b96ebdb12839dbff6' is unchanged The `git replace` makes the second commit empty (use the file content from the first commit). It should disappear after `git filter-branch`, but it doesn't happen. Bug 1: the empty commit stays. Bug 2: the replace refs are not ignored (they can epresent blobs, trees etc, but even if they represent commits - should they be rewritten?). Any ideas? PS I've found http://article.gmane.org/gmane.comp.version-control.git/220931 , seems like the bug 1. But it's old! ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Bugs in git filter-branch (git replace related) 2016-01-28 14:46 Bugs in git filter-branch (git replace related) Anatoly Borodin @ 2016-01-29 6:18 ` Jeff King 2016-01-29 18:24 ` Anatoly Borodin 0 siblings, 1 reply; 6+ messages in thread From: Jeff King @ 2016-01-29 6:18 UTC (permalink / raw) To: Anatoly Borodin; +Cc: git On Thu, Jan 28, 2016 at 02:46:40PM +0000, Anatoly Borodin wrote: > The `git replace` makes the second commit empty (use the file content from > the first commit). It should disappear after `git filter-branch`, but it > doesn't happen. > > Bug 1: the empty commit stays. I'm not sure if this is a bug or not. The "empty commit" check works by checking the tree sha1s, without doing a full diff respecting replace refs. You're expecting git to notice a tree change, even though it never even examined the tree in the first place (because you didn't give it a tree or index filter). Try: git filter-branch --prune-empty --tree-filter true master which will force git to go through the motions of checking out the replaced content and re-examining it. This will run much more slowly, as it actually touches the filesystem. In the general case, it would be interesting if filter-branch (or a similar tool) could "cement" replacement objects into place as efficiently as possible. But I'm not sure whether that should be the default mode for filter-branch. > Bug 2: the replace refs are not ignored (they can epresent blobs, trees etc, > but even if they represent commits - should they be rewritten?). You told it "--all", which is passed to rev-list, where it means "all refs". I agree that running filter-branch on refs/replace is probably not going to yield useful results, but I'm not sure if it is filter-branch's responsibility to second-guess the rev-list options. Probably the documentation for filter-branch should recommend "--branches --tags" instead of "--all", though. -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Bugs in git filter-branch (git replace related) 2016-01-29 6:18 ` Jeff King @ 2016-01-29 18:24 ` Anatoly Borodin 2016-01-29 23:11 ` Jeff King 0 siblings, 1 reply; 6+ messages in thread From: Anatoly Borodin @ 2016-01-29 18:24 UTC (permalink / raw) To: git Hi Jeff, I've created a gist with the script https://gist.github.com/anatolyborodin/6505a364a68584f13846 I've added some changes and a second test (will be discussed in the comments). Jeff King <peff@peff.net> wrote: > I'm not sure if this is a bug or not. The "empty commit" check works by > checking the tree sha1s, without doing a full diff respecting replace > refs. > > You're expecting git to notice a tree change, even though it never even > examined the tree in the first place (because you didn't give it a tree > or index filter). When git-filter-branch(1) says "If you have any grafts or replacement refs defined, running this command will make them permanent.", and it doesn't work like that because of some optimization, it *is* a bug. > Try: > > git filter-branch --prune-empty --tree-filter true master > > which will force git to go through the motions of checking out the > replaced content and re-examining it. Thank you, I've added this command to the script, and it works! I think it should be added to git-filter-branch(1), if there is no other way to rewrite the optimization. >> Bug 2: the replace refs are not ignored (they can epresent blobs, trees etc, >> but even if they represent commits - should they be rewritten?). > > You told it "--all", which is passed to rev-list, where it means "all > refs". I agree that running filter-branch on refs/replace is probably > not going to yield useful results, but I'm not sure if it is > filter-branch's responsibility to second-guess the rev-list options. Look how `git log --all` works (see the second test in the script): it ignores (without any messages) the blobs, and writes only the commits. For example, the same commit log message is printed twice in the second testcase. Maybe it makes sence, I don't know. I would suggest that all refs/replace/* heads should be ignored by `git log`. `git rev-list --no-replace` maybe? > Probably the documentation for filter-branch should recommend > "--branches --tags" instead of "--all", though. Or redefine `--all` as "all refs excepting refs/replace/*". Who would really want to run `--all` the way it works now? The blobs replaces should be ignored, as in `git log --all`. Is there any reason to rewrite refs/rebase/hash if it's a replace commit? -- Mit freundlichen Grüßen, Anatoly Borodin ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Bugs in git filter-branch (git replace related) 2016-01-29 18:24 ` Anatoly Borodin @ 2016-01-29 23:11 ` Jeff King 2016-02-08 23:55 ` A different bug in git-filter-branch (v2.7.0) Anatoly Borodin 0 siblings, 1 reply; 6+ messages in thread From: Jeff King @ 2016-01-29 23:11 UTC (permalink / raw) To: Anatoly Borodin; +Cc: git On Fri, Jan 29, 2016 at 06:24:07PM +0000, Anatoly Borodin wrote: > > You're expecting git to notice a tree change, even though it never even > > examined the tree in the first place (because you didn't give it a tree > > or index filter). > > When git-filter-branch(1) says "If you have any grafts or replacement > refs defined, running this command will make them permanent.", and it > doesn't work like that because of some optimization, it *is* a bug. I think the bug is in the documentation. That has always worked for commit grafts and replacements, but never for blob and tree replacements (and in fact, filter-branch quite predates refs/replace). I don't think this is just an optimization; filter-branch does not touch or rewrite bits that you did not ask it to touch, and that is a user-visible behavior. > Thank you, I've added this command to the script, and it works! I think > it should be added to git-filter-branch(1), if there is no other way to > rewrite the optimization. Yeah, I agree. Documentation patches are welcome. I think with an "--index-filter", you could cement a replacement tree into place, but you need a "--tree-filter" to do a blob replacement (because otherwise we insert only the name of the blob sha1 into the index). > > You told it "--all", which is passed to rev-list, where it means "all > > refs". I agree that running filter-branch on refs/replace is probably > > not going to yield useful results, but I'm not sure if it is > > filter-branch's responsibility to second-guess the rev-list options. > > Look how `git log --all` works (see the second test in the script): it > ignores (without any messages) the blobs, and writes only the commits. > For example, the same commit log message is printed twice in the second > testcase. I'm not sure what you mean here. "git log --all" is not looking at blobs at all (you did not ask it do any diffs, nor simplify history based on TREESAME commits). The "--all" here means "start traversing from commits found at all ref tips". It _does_ look at refs/replace, but there are no commits to traverse there. Likewise, "git rev-list --all" would not show any. But "git rev-list --objects --all" would (both the refs/replace tips, as well as objects reachable from the other commits). > Maybe it makes sence, I don't know. I would suggest that all > refs/replace/* heads should be ignored by `git log`. `git rev-list > --no-replace` maybe? It is too late for that without an extra option. "rev-list --all" already has a well-established meaning, and changing it would break other uses (e.g., commit reachability done during object transfer and repacking, not to mention any third-party scripts). But... > > Probably the documentation for filter-branch should recommend > > "--branches --tags" instead of "--all", though. > > Or redefine `--all` as "all refs excepting refs/replace/*". Who would > really want to run `--all` the way it works now? If you mean "rev-list --all", then: lots of things that aren't filter-branch. :) If we were to change anything, it would be to intercept "--all" in filter-branch and rewrite it to "--exclude=refs/replace/* --all". I'm slightly negative on that, just because we advertise filter-branch as taking arbitrary rev-list args, and this is violating that. I think I'd be more in favor if it were done robustly and clearly, like: - teach rev-list --all-does-not-include-refs-replace (but with a less horrible name) - advertise that filter-branch always passes that option to rev-list Then at least it is robust (we are not mucking with rev-list options, which we cannot do completely accurately without having the full knowledge of all of its options), and simple to explain to the user (if they want to work around it, they simple add "--glob=refs/replace/*" to their invocation). -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* A different bug in git-filter-branch (v2.7.0) 2016-01-29 23:11 ` Jeff King @ 2016-02-08 23:55 ` Anatoly Borodin 2016-02-22 21:13 ` Jeff King 0 siblings, 1 reply; 6+ messages in thread From: Anatoly Borodin @ 2016-02-08 23:55 UTC (permalink / raw) To: git Hi Jeff, unfortunately, `--tree-filter true` doesn't solve the problem with the repo at my work: not all old blobs are replaced with the new ones. I've made a test repository to demonstrate it; it's a huge one (115M), but I couldn't make it much smaller, because the bug fails to reproduce if the repo is not big enough: https://github.com/anatolyborodin/git-filter-branch-bug There are some description and instructions in `README.md`. I hope you will be able to reproduce it on your machine, if not - just add more files :) I've debugged the test case and found the place where `git diff-index` behaves differently regarding `aa/bb.dat`: read-cache.c +351 ie_match_stat(): ... if (!changed && is_racy_timestamp(istate, ce)) { if (assume_racy_is_modified) changed |= DATA_CHANGED; else changed |= ce_modified_check_fs(ce, st); } ... After `git-checkout-index` the blob hash for `aa/bb.dat` in the index is 88a0f09b9b2e4ccf2faec89ab37d416fba4ee79d (the huge binary), but the file on disk is a text file "This file was to big, and it has been removed." with the hash 16e0939430610600301680d5bf8a24a22ff8b6c4. In the case of a "good behaving" commit, the timestamps of the index and the cache entry are the same, is_racy_timestamp() returns 1, and ce_modified_check_fs() finds that the content of the file has changed. `git diff-index` lists the file `aa/bb.dat`. In the case of a "bad behaving" commit, the timestamps of the index and the cache entry are different (the index is 1 second newer), is_racy_timestamp() returns 0, and the file is assumed unchanged; `git diff-index` prints nothing. I don't know if it should be considered to be a bug in in the logic of `git checkout-index`, or `git diff-index` / `git update-index`. -- Mit freundlichen Grüßen, Anatoly Borodin ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: A different bug in git-filter-branch (v2.7.0) 2016-02-08 23:55 ` A different bug in git-filter-branch (v2.7.0) Anatoly Borodin @ 2016-02-22 21:13 ` Jeff King 0 siblings, 0 replies; 6+ messages in thread From: Jeff King @ 2016-02-22 21:13 UTC (permalink / raw) To: Anatoly Borodin; +Cc: git On Mon, Feb 08, 2016 at 11:55:37PM +0000, Anatoly Borodin wrote: > unfortunately, `--tree-filter true` doesn't solve the problem with the > repo at my work: not all old blobs are replaced with the new ones. I've > made a test repository to demonstrate it; it's a huge one (115M), but I > couldn't make it much smaller, because the bug fails to reproduce if the > repo is not big enough: > > https://github.com/anatolyborodin/git-filter-branch-bug > > There are some description and instructions in `README.md`. I hope you > will be able to reproduce it on your machine, if not - just add more > files :) > > I've debugged the test case and found the place where `git diff-index` > behaves differently regarding `aa/bb.dat`: > > read-cache.c +351 ie_match_stat(): > ... > if (!changed && is_racy_timestamp(istate, ce)) { > if (assume_racy_is_modified) > changed |= DATA_CHANGED; > else > changed |= ce_modified_check_fs(ce, st); > } > ... > > After `git-checkout-index` the blob hash for `aa/bb.dat` in the index is > 88a0f09b9b2e4ccf2faec89ab37d416fba4ee79d (the huge binary), but the file > on disk is a text file "This file was to big, and it has been removed." > with the hash 16e0939430610600301680d5bf8a24a22ff8b6c4. Right, that makes sense. The index doesn't know anything about the replace mechanism. So it assumes that what it wrote matches what is in the stat cache (i.e., some sha1 and the matching stat info). Later, when git wants to know "should I bother reading this file back in and computing its sha1", the stat cache says no. And then as you noticed, it sometimes "works" because if the file and index timestamps are the same, we err on the side of re-reading. So more files means more likelihood of crossing the one-second boundary. > I don't know if it should be considered to be a bug in in the logic of > `git checkout-index`, or `git diff-index` / `git update-index`. I'd say if any, it is the fault of checkout-index for writing out content that does not match the sha1 in the index, but writing out the stat information as if it is clean. For your case, you'd obviously prefer that the file be marked dirty and re-read later. But I'm not sure whether other users of replace refs would want the same behavior. They may want to silently pretend as if the data is unmodified. I'm not sure if anyone is doing that in practice, though; as you noted, the results are not deterministic, so it probably ends up just being a huge pain. So perhaps it would make sense to make the checkout operation aware of replaced blobs. As a workaround for your filter-branch, I suspect you could do `--tree-filter 'git ls-files | xargs touch'` or something, but that is going to be rather inefficient. By now you've probably realized that replaced blobs are not widely used, and there are a lot of corner cases around checking them out. So let's take a step back for a moment. What are you trying to accomplish with your filter-branch? If you just want to replace blob A with blob B, I think it might be easier to skip refs/replace entirely and just do so explicitly in an index-filter, like: --index-filter ' git ls-files -s | sed "s/$old_sha1/$new_sha1/" | git update-index --index-info ' -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-02-22 21:13 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-01-28 14:46 Bugs in git filter-branch (git replace related) Anatoly Borodin 2016-01-29 6:18 ` Jeff King 2016-01-29 18:24 ` Anatoly Borodin 2016-01-29 23:11 ` Jeff King 2016-02-08 23:55 ` A different bug in git-filter-branch (v2.7.0) Anatoly Borodin 2016-02-22 21:13 ` Jeff King
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.