All of lore.kernel.org
 help / color / mirror / Atom feed
* Bugs in git filter-branch (git replace related)
@ 2016-01-28 14:46 Anatoly Borodin
  2016-01-29  6:18 ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Anatoly Borodin @ 2016-01-28 14:46 UTC (permalink / raw)
  To: git

Hi All!

There are two bugs in `git filter-branch`, present in the most recent
versions (d10e2cb9d0299a26f43d57dd5bdcf2b3f86a30b3), as well as in the old
ones (I couldn't find a version where it works properly).

The script:

#!/bin/sh

set -e

GIT_EXEC_PATH=/tmp/git
export GIT_EXEC_PATH
GIT=$GIT_EXEC_PATH/git

rm -rf a
mkdir a
cd a
$GIT init
echo aaa > a.txt && $GIT add a.txt && $GIT commit -a -m a
echo bbb > a.txt && $GIT add a.txt && $GIT commit -a -m b
echo ccc > a.txt && $GIT add a.txt && $GIT commit -a -m c
$GIT replace f761ec192d9f0dca3329044b96ebdb12839dbff6
72943a16fb2c8f38f9dde202b7a70ccc19c52f34
echo && echo One && $GIT filter-branch --prune-empty -- master
echo && echo Two && $GIT filter-branch --prune-empty -- --all

The output is:

...
One
Rewrite 98af0305b778bf56e25a0d4f85acdf82f435f9b3 (3/3) (0 seconds passed,
remaining 0 predicted)    
WARNING: Ref 'refs/heads/master' is unchanged

Two
Rewrite 98af0305b778bf56e25a0d4f85acdf82f435f9b3 (3/3) (0 seconds passed,
remaining 0 predicted)    
WARNING: Ref 'refs/heads/master' is unchanged
error: object 72943a16fb2c8f38f9dde202b7a70ccc19c52f34 is a blob, not a commit
error: object 72943a16fb2c8f38f9dde202b7a70ccc19c52f34 is a blob, not a commit
fatal: ambiguous argument
'refs/replace/f761ec192d9f0dca3329044b96ebdb12839dbff6^0': unknown revision
or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
WARNING: Ref 'refs/replace/f761ec192d9f0dca3329044b96ebdb12839dbff6' is
unchanged


The `git replace` makes the second commit empty (use the file content from
the first commit). It should disappear after `git filter-branch`, but it
doesn't happen.

Bug 1: the empty commit stays.
Bug 2: the replace refs are not ignored (they can epresent blobs, trees etc,
but even if they represent commits - should they be rewritten?).

Any ideas?

PS I've found http://article.gmane.org/gmane.comp.version-control.git/220931
, seems like the bug 1. But it's old!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Bugs in git filter-branch (git replace related)
  2016-01-28 14:46 Bugs in git filter-branch (git replace related) Anatoly Borodin
@ 2016-01-29  6:18 ` Jeff King
  2016-01-29 18:24   ` Anatoly Borodin
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2016-01-29  6:18 UTC (permalink / raw)
  To: Anatoly Borodin; +Cc: git

On Thu, Jan 28, 2016 at 02:46:40PM +0000, Anatoly Borodin wrote:

> The `git replace` makes the second commit empty (use the file content from
> the first commit). It should disappear after `git filter-branch`, but it
> doesn't happen.
> 
> Bug 1: the empty commit stays.

I'm not sure if this is a bug or not. The "empty commit" check works by
checking the tree sha1s, without doing a full diff respecting replace
refs.

You're expecting git to notice a tree change, even though it never even
examined the tree in the first place (because you didn't give it a tree
or index filter).

Try:

  git filter-branch --prune-empty --tree-filter true master

which will force git to go through the motions of checking out the
replaced content and re-examining it.

This will run much more slowly, as it actually touches the filesystem.
In the general case, it would be interesting if filter-branch (or a
similar tool) could "cement" replacement objects into place as
efficiently as possible. But I'm not sure whether that should be the
default mode for filter-branch.

> Bug 2: the replace refs are not ignored (they can epresent blobs, trees etc,
> but even if they represent commits - should they be rewritten?).

You told it "--all", which is passed to rev-list, where it means "all
refs". I agree that running filter-branch on refs/replace is probably
not going to yield useful results, but I'm not sure if it is
filter-branch's responsibility to second-guess the rev-list options.

Probably the documentation for filter-branch should recommend
"--branches --tags" instead of "--all", though.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Bugs in git filter-branch (git replace related)
  2016-01-29  6:18 ` Jeff King
@ 2016-01-29 18:24   ` Anatoly Borodin
  2016-01-29 23:11     ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Anatoly Borodin @ 2016-01-29 18:24 UTC (permalink / raw)
  To: git

Hi Jeff,


I've created a gist with the script
https://gist.github.com/anatolyborodin/6505a364a68584f13846

I've added some changes and a second test (will be discussed in the
comments).


Jeff King <peff@peff.net> wrote:
> I'm not sure if this is a bug or not. The "empty commit" check works by
> checking the tree sha1s, without doing a full diff respecting replace
> refs.
> 
> You're expecting git to notice a tree change, even though it never even
> examined the tree in the first place (because you didn't give it a tree
> or index filter).

When git-filter-branch(1) says "If you have any grafts or replacement
refs defined, running this command will make them permanent.", and it
doesn't work like that because of some optimization, it *is* a bug.

> Try:
> 
>   git filter-branch --prune-empty --tree-filter true master
> 
> which will force git to go through the motions of checking out the
> replaced content and re-examining it.

Thank you, I've added this command to the script, and it works! I think
it should be added to git-filter-branch(1), if there is no other way to
rewrite the optimization.

>> Bug 2: the replace refs are not ignored (they can epresent blobs, trees etc,
>> but even if they represent commits - should they be rewritten?).
> 
> You told it "--all", which is passed to rev-list, where it means "all
> refs". I agree that running filter-branch on refs/replace is probably
> not going to yield useful results, but I'm not sure if it is
> filter-branch's responsibility to second-guess the rev-list options.

Look how `git log --all` works (see the second test in the script): it
ignores (without any messages) the blobs, and writes only the commits.
For example, the same commit log message is printed twice in the second
testcase.

Maybe it makes sence, I don't know. I would suggest that all
refs/replace/* heads should be ignored by `git log`. `git rev-list
--no-replace` maybe?

> Probably the documentation for filter-branch should recommend
> "--branches --tags" instead of "--all", though.

Or redefine `--all` as "all refs excepting refs/replace/*". Who would
really want to run `--all` the way it works now?

The blobs replaces should be ignored, as in `git log --all`. Is there
any reason to rewrite refs/rebase/hash if it's a replace commit?


-- 
Mit freundlichen Grüßen,
Anatoly Borodin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Bugs in git filter-branch (git replace related)
  2016-01-29 18:24   ` Anatoly Borodin
@ 2016-01-29 23:11     ` Jeff King
  2016-02-08 23:55       ` A different bug in git-filter-branch (v2.7.0) Anatoly Borodin
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2016-01-29 23:11 UTC (permalink / raw)
  To: Anatoly Borodin; +Cc: git

On Fri, Jan 29, 2016 at 06:24:07PM +0000, Anatoly Borodin wrote:

> > You're expecting git to notice a tree change, even though it never even
> > examined the tree in the first place (because you didn't give it a tree
> > or index filter).
> 
> When git-filter-branch(1) says "If you have any grafts or replacement
> refs defined, running this command will make them permanent.", and it
> doesn't work like that because of some optimization, it *is* a bug.

I think the bug is in the documentation. That has always worked for
commit grafts and replacements, but never for blob and tree replacements
(and in fact, filter-branch quite predates refs/replace).

I don't think this is just an optimization; filter-branch does not touch
or rewrite bits that you did not ask it to touch, and that is a
user-visible behavior.

> Thank you, I've added this command to the script, and it works! I think
> it should be added to git-filter-branch(1), if there is no other way to
> rewrite the optimization.

Yeah, I agree.  Documentation patches are welcome.

I think with an "--index-filter", you could cement a replacement tree
into place, but you need a "--tree-filter" to do a blob replacement
(because otherwise we insert only the name of the blob sha1 into the
index).

> > You told it "--all", which is passed to rev-list, where it means "all
> > refs". I agree that running filter-branch on refs/replace is probably
> > not going to yield useful results, but I'm not sure if it is
> > filter-branch's responsibility to second-guess the rev-list options.
> 
> Look how `git log --all` works (see the second test in the script): it
> ignores (without any messages) the blobs, and writes only the commits.
> For example, the same commit log message is printed twice in the second
> testcase.

I'm not sure what you mean here. "git log --all" is not looking at blobs
at all (you did not ask it do any diffs, nor simplify history based on
TREESAME commits). The "--all" here means "start traversing from commits
found at all ref tips". It _does_ look at refs/replace, but there are no
commits to traverse there.

Likewise, "git rev-list --all" would not show any. But "git rev-list
--objects --all" would (both the refs/replace tips, as well as objects
reachable from the other commits).

> Maybe it makes sence, I don't know. I would suggest that all
> refs/replace/* heads should be ignored by `git log`. `git rev-list
> --no-replace` maybe?

It is too late for that without an extra option. "rev-list --all"
already has a well-established meaning, and changing it would break
other uses (e.g., commit reachability done during object transfer and
repacking, not to mention any third-party scripts). But...

> > Probably the documentation for filter-branch should recommend
> > "--branches --tags" instead of "--all", though.
> 
> Or redefine `--all` as "all refs excepting refs/replace/*". Who would
> really want to run `--all` the way it works now?

If you mean "rev-list --all", then: lots of things that aren't
filter-branch. :)

If we were to change anything, it would be to intercept "--all" in
filter-branch and rewrite it to "--exclude=refs/replace/* --all". I'm
slightly negative on that, just because we advertise filter-branch as
taking arbitrary rev-list args, and this is violating that. I think I'd
be more in favor if it were done robustly and clearly, like:

  - teach rev-list --all-does-not-include-refs-replace (but with a less
    horrible name)

  - advertise that filter-branch always passes that option to rev-list

Then at least it is robust (we are not mucking with rev-list options,
which we cannot do completely accurately without having the full
knowledge of all of its options), and simple to explain to the user (if
they want to work around it, they simple add "--glob=refs/replace/*" to
their invocation).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* A different bug in git-filter-branch (v2.7.0)
  2016-01-29 23:11     ` Jeff King
@ 2016-02-08 23:55       ` Anatoly Borodin
  2016-02-22 21:13         ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Anatoly Borodin @ 2016-02-08 23:55 UTC (permalink / raw)
  To: git

Hi Jeff,


unfortunately, `--tree-filter true` doesn't solve the problem with the
repo at my work: not all old blobs are replaced with the new ones. I've
made a test repository to demonstrate it; it's a huge one (115M), but I
couldn't make it much smaller, because the bug fails to reproduce if the
repo is not big enough:

https://github.com/anatolyborodin/git-filter-branch-bug

There are some description and instructions in `README.md`. I hope you
will be able to reproduce it on your machine, if not - just add more
files :)

I've debugged the test case and found the place where `git diff-index`
behaves differently regarding `aa/bb.dat`:

read-cache.c +351	ie_match_stat():
...
	if (!changed && is_racy_timestamp(istate, ce)) {
		if (assume_racy_is_modified)
			changed |= DATA_CHANGED;
		else
			changed |= ce_modified_check_fs(ce, st);
	}
...

After `git-checkout-index` the blob hash for `aa/bb.dat` in the index is
88a0f09b9b2e4ccf2faec89ab37d416fba4ee79d (the huge binary), but the file
on disk is a text file "This file was to big, and it has been removed."
with the hash 16e0939430610600301680d5bf8a24a22ff8b6c4.

In the case of a "good behaving" commit, the timestamps of the index and
the cache entry are the same, is_racy_timestamp() returns 1, and
ce_modified_check_fs() finds that the content of the file has changed.
`git diff-index` lists the file `aa/bb.dat`.

In the case of a "bad behaving" commit, the timestamps of the index and
the cache entry are different (the index is 1 second newer),
is_racy_timestamp() returns 0, and the file is assumed unchanged; `git
diff-index` prints nothing.

I don't know if it should be considered to be a bug in in the logic of
`git checkout-index`, or `git diff-index` / `git update-index`.


-- 
Mit freundlichen Grüßen,
Anatoly Borodin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A different bug in git-filter-branch (v2.7.0)
  2016-02-08 23:55       ` A different bug in git-filter-branch (v2.7.0) Anatoly Borodin
@ 2016-02-22 21:13         ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2016-02-22 21:13 UTC (permalink / raw)
  To: Anatoly Borodin; +Cc: git

On Mon, Feb 08, 2016 at 11:55:37PM +0000, Anatoly Borodin wrote:

> unfortunately, `--tree-filter true` doesn't solve the problem with the
> repo at my work: not all old blobs are replaced with the new ones. I've
> made a test repository to demonstrate it; it's a huge one (115M), but I
> couldn't make it much smaller, because the bug fails to reproduce if the
> repo is not big enough:
> 
> https://github.com/anatolyborodin/git-filter-branch-bug
> 
> There are some description and instructions in `README.md`. I hope you
> will be able to reproduce it on your machine, if not - just add more
> files :)
> 
> I've debugged the test case and found the place where `git diff-index`
> behaves differently regarding `aa/bb.dat`:
> 
> read-cache.c +351	ie_match_stat():
> ...
> 	if (!changed && is_racy_timestamp(istate, ce)) {
> 		if (assume_racy_is_modified)
> 			changed |= DATA_CHANGED;
> 		else
> 			changed |= ce_modified_check_fs(ce, st);
> 	}
> ...
> 
> After `git-checkout-index` the blob hash for `aa/bb.dat` in the index is
> 88a0f09b9b2e4ccf2faec89ab37d416fba4ee79d (the huge binary), but the file
> on disk is a text file "This file was to big, and it has been removed."
> with the hash 16e0939430610600301680d5bf8a24a22ff8b6c4.

Right, that makes sense. The index doesn't know anything about the
replace mechanism. So it assumes that what it wrote matches what is in
the stat cache (i.e., some sha1 and the matching stat info). Later, when
git wants to know "should I bother reading this file back in and
computing its sha1", the stat cache says no.

And then as you noticed, it sometimes "works" because if the file and
index timestamps are the same, we err on the side of re-reading. So more
files means more likelihood of crossing the one-second boundary.

> I don't know if it should be considered to be a bug in in the logic of
> `git checkout-index`, or `git diff-index` / `git update-index`.

I'd say if any, it is the fault of checkout-index for writing out
content that does not match the sha1 in the index, but writing out the
stat information as if it is clean. For your case, you'd obviously
prefer that the file be marked dirty and re-read later.

But I'm not sure whether other users of replace refs would want the same
behavior. They may want to silently pretend as if the data is
unmodified. I'm not sure if anyone is doing that in practice, though; as
you noted, the results are not deterministic, so it probably ends up
just being a huge pain. So perhaps it would make sense to make the
checkout operation aware of replaced blobs.

As a workaround for your filter-branch, I suspect you could do
`--tree-filter 'git ls-files | xargs touch'` or something, but that is
going to be rather inefficient.

By now you've probably realized that replaced blobs are not widely used,
and there are a lot of corner cases around checking them out. So let's
take a step back for a moment. What are you trying to accomplish with
your filter-branch? If you just want to replace blob A with blob B, I
think it might be easier to skip refs/replace entirely and just do so
explicitly in an index-filter, like:

  --index-filter '
    git ls-files -s |
    sed "s/$old_sha1/$new_sha1/" |
    git update-index --index-info
  '

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-02-22 21:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-28 14:46 Bugs in git filter-branch (git replace related) Anatoly Borodin
2016-01-29  6:18 ` Jeff King
2016-01-29 18:24   ` Anatoly Borodin
2016-01-29 23:11     ` Jeff King
2016-02-08 23:55       ` A different bug in git-filter-branch (v2.7.0) Anatoly Borodin
2016-02-22 21:13         ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.