git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Simon Holmberg <simon.holmberg@avalanchestudios.se>
Cc: git@vger.kernel.org
Subject: Re: Partial Clone garbage collection?
Date: Wed, 30 Oct 2019 16:37:14 -0400	[thread overview]
Message-ID: <20191030203714.GC29013@sigill.intra.peff.net> (raw)
In-Reply-To: <CA+M_GG1SfxGW=p_=418hdR1ypB3v-4GrooK6_75UUNJDb+kk2Q@mail.gmail.com>

On Wed, Oct 30, 2019 at 06:08:18PM +0100, Simon Holmberg wrote:

> I've been experimenting with the new Partial Clone feature, attempting
> to use it to filter out the otherwise full history of the large binary
> resources in our repos. It works really well on the initial clone. But
> once you start jumping around in history a lot, the repo will grow out
> of proportion again as promised pack files are fetched.
> 
> Are there any plans to add a --filter parameter to git gc as well,
> that would be able to prune past history of objects and convert them
> back into pack promises? Or am I wrong in assuming that this could
> ever act as a native replacement for LFS? Without this, a repo would
> only continue to grow ad infinitum, resulting in the same issues as
> before unless you actively chose to delete your entire clone and
> re-clone it from upstream once in a while.

I don't recall seeing anybody actively working on this, but I think it
would be a good idea. You'd probably want to be able to specify it in
your config somehow, so that subsequent repacks pruned as necessary
without you having to remember to do it each time.

You could naively just drop everything that matches the filter, and then
re-fetch it as needed. But for efficiency, you may want to keep some
other objects:

  - objects mentioned directly in the index, or the tree of HEAD; you'd
    end up re-fetching these next time you "git checkout"

  - perhaps objects fetched recently are more worth keeping (e.g., ones
    with an mtime less than a day or two). I don't know if that helps,
    though. What you really care about is how recently they were
    accessed (assuming there's some locality there), not written. A
    frequently-accessed object may have been fetched immediately after
    you cloned, giving it an old mtime.

    Since we can get any of the objects again if we want and we're just
    optimizing, this is really just a cache-expiration problem. But it
    may be hard to implement any of the stock algorithms without having
    logs of which objects were accessed.

-Peff

  parent reply	other threads:[~2019-10-30 20:37 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CA+M_GG35V3yNCfQ247PSrpP-R_f8bWNcBcmrnTWbrn1Nap_A4A@mail.gmail.com>
2019-10-30 17:08 ` Partial Clone garbage collection? Simon Holmberg
2019-10-30 18:17   ` без имени
2019-10-30 20:38     ` Jeff King
2019-10-30 20:37   ` Jeff King [this message]
2019-10-30 20:45     ` Jonathan Tan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191030203714.GC29013@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=simon.holmberg@avalanchestudios.se \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).