All of lore.kernel.org
 help / color / mirror / Atom feed
* Blobs not referenced by file (anymore) are not removed by GC
@ 2014-12-08 16:22 Martin Scherer
       [not found] ` <CAFY1edaEq1zYV0vgSfiPAXU6bqVBzaA-apVnSn8DBMbzcAa2tQ@mail.gmail.com>
  2014-12-09 14:14 ` Jeff King
  0 siblings, 2 replies; 9+ messages in thread
From: Martin Scherer @ 2014-12-08 16:22 UTC (permalink / raw)
  To: git

Hi,

after using BFG on a repo given certain directory globs, all of those
files(names) are gone from history, but can not be collected by garbage
collection anymore. So the blobs of the underlying files are not deleted
and only the file names are not associated with the blob anymore. I
wonder, if I discovered a bug (at least in bfg). But I expect git to
discover that this blobs are not used in any way (so they have to
associated to something right?)

# invoke bfg --delete-folders something multiple times with different
pattern.

# try to cleanup

git gc --aggressive --prune=now # big blobs still in history
git fsck # no results
git fsck --full  --unreachable --dangling # no results

to verify if the blobs are still there, see the output of

git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+
blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects
.txt

head bigobjects.txt # outputs 9451427d7335395779b91864418630d2f0af780a
blob   7895212 1869047 7657491


Also if bfg is being told to remove the biggest blob (bfg -B 1) with
no-blob-protection, it does not succeed in removing it.

--- output of bfg -B 1

Found 1 blob ids for large blobs - biggest=7895212 smallest=7895212
....

BFG aborting: No refs to update - no dirty commits found??
---

The repo can be found here.

https://github.com/marscher/stallone_stale_objects

I will restart all over to cleanup the history, but I guess this might
be interesting for git developers.


Best,
Martin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Blobs not referenced by file (anymore) are not removed by GC
       [not found] ` <CAFY1edaEq1zYV0vgSfiPAXU6bqVBzaA-apVnSn8DBMbzcAa2tQ@mail.gmail.com>
@ 2014-12-08 16:47   ` Roberto Tyley
  0 siblings, 0 replies; 9+ messages in thread
From: Roberto Tyley @ 2014-12-08 16:47 UTC (permalink / raw)
  To: git; +Cc: Martin Scherer

Hi Martin, I'm the developer of the BFG - I'd guess that there
probably isn't a bug for Git developers here, so you might want to
open one or more issues at
https://github.com/rtyley/bfg-repo-cleaner/issues, where I'd be happy
to take a look.

best regards,
Roberto

> On 8 Dec 2014 16:35, "Martin Scherer" <m.scherer@fu-berlin.de> wrote:
>>
>> Hi,
>>
>> after using BFG on a repo given certain directory globs, all of those
>> files(names) are gone from history, but can not be collected by garbage
>> collection anymore. So the blobs of the underlying files are not deleted
>> and only the file names are not associated with the blob anymore. I
>> wonder, if I discovered a bug (at least in bfg). But I expect git to
>> discover that this blobs are not used in any way (so they have to
>> associated to something right?)
>>
>> # invoke bfg --delete-folders something multiple times with different
>> pattern.
>>
>> # try to cleanup
>>
>> git gc --aggressive --prune=now # big blobs still in history
>> git fsck # no results
>> git fsck --full  --unreachable --dangling # no results
>>
>> to verify if the blobs are still there, see the output of
>>
>> git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+
>> blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects
>> .txt
>>
>> head bigobjects.txt # outputs 9451427d7335395779b91864418630d2f0af780a
>> blob   7895212 1869047 7657491
>>
>>
>> Also if bfg is being told to remove the biggest blob (bfg -B 1) with
>> no-blob-protection, it does not succeed in removing it.
>>
>> --- output of bfg -B 1
>>
>> Found 1 blob ids for large blobs - biggest=7895212 smallest=7895212
>> ....
>>
>> BFG aborting: No refs to update - no dirty commits found??
>> ---
>>
>> The repo can be found here.
>>
>> https://github.com/marscher/stallone_stale_objects
>>
>> I will restart all over to cleanup the history, but I guess this might
>> be interesting for git developers.
>>
>>
>> Best,
>> Martin
>> --
>> To unsubscribe from this list: send the line "unsubscribe git" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Blobs not referenced by file (anymore) are not removed by GC
  2014-12-08 16:22 Blobs not referenced by file (anymore) are not removed by GC Martin Scherer
       [not found] ` <CAFY1edaEq1zYV0vgSfiPAXU6bqVBzaA-apVnSn8DBMbzcAa2tQ@mail.gmail.com>
@ 2014-12-09 14:14 ` Jeff King
  2014-12-09 16:01   ` Roberto Tyley
  1 sibling, 1 reply; 9+ messages in thread
From: Jeff King @ 2014-12-09 14:14 UTC (permalink / raw)
  To: Martin Scherer; +Cc: git

On Mon, Dec 08, 2014 at 05:22:23PM +0100, Martin Scherer wrote:

> # invoke bfg --delete-folders something multiple times with different
> pattern.
> 
> # try to cleanup
> 
> git gc --aggressive --prune=now # big blobs still in history
> git fsck # no results
> git fsck --full  --unreachable --dangling # no results

Might you still have reflogs pointing to the objects? Try:

  git reflog expire --expire-unreachable=now --all

I also don't know if BFG keeps backup refs around (filter-branch, for
example, writes a copy of the original refs into refs/original; you
would want to delete that if you're trying to slim down the repo).

In general, you can see the on-disk size of the objects required for a
particular ref with something like:

  size() {
    git rev-list --objects "$@" |
    cut -d' ' -f1 |
    git cat-file --batch-check='%(objectsize:disk)' |
    perl -lne '$t += $_; END { print $t }'
  }

  # size of master branch
  size master

  # size of each ref on top of what is in the master branch
  git for-each-ref --format='%(refname)' |
  while read ref; do
    echo "$(size master..$ref) $ref"
  done | sort -rn


Note that these sizes are somewhat approximate. We may store object X
needed by one ref as a delta against Y used by another ref. The
accounting shows X as tiny compared to Y. And then a repack may find the
delta in the opposite direction. But if you're talking about rewriting
history to drop a bunch of gigantic objects, the output of the final
loop is a good way to see which refs are still referring to the old
history.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Blobs not referenced by file (anymore) are not removed by GC
  2014-12-09 14:14 ` Jeff King
@ 2014-12-09 16:01   ` Roberto Tyley
  2014-12-09 16:11     ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Roberto Tyley @ 2014-12-09 16:01 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Scherer, git

On 9 December 2014 at 14:14, Jeff King <peff@peff.net> wrote:
> On Mon, Dec 08, 2014 at 05:22:23PM +0100, Martin Scherer wrote:
>
>> # invoke bfg --delete-folders something multiple times with different
>> pattern.
>>
>> # try to cleanup
>>
>> git gc --aggressive --prune=now # big blobs still in history
>> git fsck # no results
>> git fsck --full  --unreachable --dangling # no results
>
> Might you still have reflogs pointing to the objects? Try:
>
>   git reflog expire --expire-unreachable=now --all

Yeah, we figured that's what it was!

https://github.com/rtyley/bfg-repo-cleaner/issues/62#issuecomment-66152559

> I also don't know if BFG keeps backup refs around (filter-branch, for
> example, writes a copy of the original refs into refs/original; you
> would want to delete that if you're trying to slim down the repo).

The BFG reports the ref changes to the command line (and outputs a
full list of changed object-ids in
repo-name.git.bfg-report/[datetime]/object-id-map.old-new.txt) but
doesn't keep refs (like refs/original) around because that would get
in the way of the BFG's explicit intended use-case of removing
unwanted data.

Thanks for the object-size checking scripts, very useful.

Roberto

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Blobs not referenced by file (anymore) are not removed by GC
  2014-12-09 16:01   ` Roberto Tyley
@ 2014-12-09 16:11     ` Jeff King
  2014-12-09 22:15       ` Roberto Tyley
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff King @ 2014-12-09 16:11 UTC (permalink / raw)
  To: Roberto Tyley; +Cc: Martin Scherer, git

On Tue, Dec 09, 2014 at 04:01:50PM +0000, Roberto Tyley wrote:

> > I also don't know if BFG keeps backup refs around (filter-branch, for
> > example, writes a copy of the original refs into refs/original; you
> > would want to delete that if you're trying to slim down the repo).
> 
> The BFG reports the ref changes to the command line (and outputs a
> full list of changed object-ids in
> repo-name.git.bfg-report/[datetime]/object-id-map.old-new.txt) but
> doesn't keep refs (like refs/original) around because that would get
> in the way of the BFG's explicit intended use-case of removing
> unwanted data.

Thanks for explaining; that information may come in handy.

I actually think filter-branch's "refs/original" is a bit outdated at
this point. The information is there in the reflogs already, and
dealing with refs/original often causes confusion in my experience. It
could probably use a "git filter-branch --restore" or something to
switch each $ref to $ref@{1} (after making sure that the reflog entry
was from filter-branch, of course).

Not that I expect you to want to work on filter-branch. :) But maybe
food for thought for a BFG feature.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Blobs not referenced by file (anymore) are not removed by GC
  2014-12-09 16:11     ` Jeff King
@ 2014-12-09 22:15       ` Roberto Tyley
  2014-12-10  7:11         ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Roberto Tyley @ 2014-12-09 22:15 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Scherer, git

On Tuesday, 9 December 2014, Jeff King <peff@peff.net> wrote:
> I actually think filter-branch's "refs/original" is a bit outdated at
> this point. The information is there in the reflogs already, and
> dealing with refs/original often causes confusion in my experience. It
> could probably use a "git filter-branch --restore" or something to
> switch each $ref to $ref@{1} (after making sure that the reflog entry
> was from filter-branch, of course).

Yeah, I'd agree that refs/original can cause confusion.


> Not that I expect you to want to work on filter-branch. :) But maybe
> food for thought for a BFG feature.

I haven't heard much demand for a recover/restore feature on the BFG
(I think by the time people get to the BFG, they're pretty sure they
want to go ahead with the procedure!) but I'll bear it in mind. Mind
you, to make the post-rewrite clean-up easier, I'd be happy to
contribute a patch that gives 'gc' a flag to do the equivalent of:

git reflog expire --expire=now --all && git gc --prune=now --aggressive

Maybe:

git gc --purge

??

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Blobs not referenced by file (anymore) are not removed by GC
  2014-12-09 22:15       ` Roberto Tyley
@ 2014-12-10  7:11         ` Jeff King
  2014-12-10 16:07           ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff King @ 2014-12-10  7:11 UTC (permalink / raw)
  To: Roberto Tyley; +Cc: Martin Scherer, git

On Tue, Dec 09, 2014 at 10:15:31PM +0000, Roberto Tyley wrote:

> > Not that I expect you to want to work on filter-branch. :) But maybe
> > food for thought for a BFG feature.
> 
> I haven't heard much demand for a recover/restore feature on the BFG
> (I think by the time people get to the BFG, they're pretty sure they
> want to go ahead with the procedure!) but I'll bear it in mind. Mind
> you, to make the post-rewrite clean-up easier, I'd be happy to
> contribute a patch that gives 'gc' a flag to do the equivalent of:
> 
> git reflog expire --expire=now --all && git gc --prune=now --aggressive
> 
> Maybe:
> 
> git gc --purge

Yeah, that is common enough that it might be worthwhile (you probably
want --expire-unreachable in the reflog invocation, though).

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Blobs not referenced by file (anymore) are not removed by GC
  2014-12-10  7:11         ` Jeff King
@ 2014-12-10 16:07           ` Junio C Hamano
  2014-12-10 23:41             ` Roberto Tyley
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2014-12-10 16:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Roberto Tyley, Martin Scherer, git

Jeff King <peff@peff.net> writes:

>> ... I'd be happy to
>> contribute a patch that gives 'gc' a flag to do the equivalent of:
>> 
>> git reflog expire --expire=now --all && git gc --prune=now --aggressive
>> 
>> Maybe:
>> 
>> git gc --purge
>
> Yeah, that is common enough that it might be worthwhile (you probably
> want --expire-unreachable in the reflog invocation, though).

Also you would not want an unconditional --aggressive.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Blobs not referenced by file (anymore) are not removed by GC
  2014-12-10 16:07           ` Junio C Hamano
@ 2014-12-10 23:41             ` Roberto Tyley
  0 siblings, 0 replies; 9+ messages in thread
From: Roberto Tyley @ 2014-12-10 23:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Martin Scherer, git

On 10 December 2014 at 16:07, Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
>>> git reflog expire --expire=now --all && git gc --prune=now --aggressive
>>>
>>> Maybe:
>>>
>>> git gc --purge
>>
>> Yeah, that is common enough that it might be worthwhile (you probably
>> want --expire-unreachable in the reflog invocation, though).
>
> Also you would not want an unconditional --aggressive.

After a big rewrite deleting files the re-optimisation of --aggressive
can make a big difference to packsize - for instance 1.2GB to 768MB in
a test I just ran - but of course it is *much* slower, so I suspect
you're right about not including it.

I wasn't aware of the '--expire-unreachable=all' switch, though it
seems like a 'milder' version of the '--expire=now' switch? - in that
it would keep reflog entries if they haven't been changed, which is
fair enough and compatible with the 'purge' goal.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-12-10 23:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-08 16:22 Blobs not referenced by file (anymore) are not removed by GC Martin Scherer
     [not found] ` <CAFY1edaEq1zYV0vgSfiPAXU6bqVBzaA-apVnSn8DBMbzcAa2tQ@mail.gmail.com>
2014-12-08 16:47   ` Roberto Tyley
2014-12-09 14:14 ` Jeff King
2014-12-09 16:01   ` Roberto Tyley
2014-12-09 16:11     ` Jeff King
2014-12-09 22:15       ` Roberto Tyley
2014-12-10  7:11         ` Jeff King
2014-12-10 16:07           ` Junio C Hamano
2014-12-10 23:41             ` Roberto Tyley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.