All of lore.kernel.org
 help / color / mirror / Atom feed
From: Son Luong Ngoc <sluongng@gmail.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git <git@vger.kernel.org>,
	sandals@crustytoothpaste.net, steadmon@google.com,
	jrnieder@gmail.com, Jeff King <peff@peff.net>,
	congdanhqx@gmail.com, phillip.wood123@gmail.com,
	Emily Shaffer <emilyshaffer@google.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Derrick Stolee <derrickstolee@github.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 8/9] maintenance: auto-size incremental-repack batch
Date: Thu, 6 Aug 2020 19:02:30 +0200	[thread overview]
Message-ID: <6FA8F54A-C92D-497B-895F-AC6E8287AACD@gmail.com> (raw)
In-Reply-To: <c3487fb8e3e72949080f51f341655f37b5f2f03f.1596731425.git.gitgitgadget@gmail.com>

Hi Derrick,

> On Aug 6, 2020, at 18:30, Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote:
> 
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When repacking during the 'incremental-repack' task, we use the
> --batch-size option in 'git multi-pack-index repack'. The initial setting
> used --batch-size=0 to repack everything into a single pack-file. This is
> not sustainable for a large repository. The amount of work required is
> also likely to use too many system resources for a background job.
> 
> Update the 'incremental-repack' task by dynamically computing a
> --batch-size option based on the current pack-file structure.
> 
> The dynamic default size is computed with this idea in mind for a client
> repository that was cloned from a very large remote: there is likely one
> "big" pack-file that was created at clone time. Thus, do not try
> repacking it as it is likely packed efficiently by the server.
> 
> Instead, we select the second-largest pack-file, and create a batch size
> that is one larger than that pack-file. If there are three or more
> pack-files, then this guarantees that at least two will be combined into
> a new pack-file.

I have been using this strategy with git-care.sh [1] with large success.
However it worth to note that there are still edge case where I observed that
pack count keep increasing because using '--batch-size=<second-biggest-pack>+1'
did not resulted in any repacking.
In one case, I have observed a local copy went up to 160+ packs without being able
to repack.

I have been considering whether a strategy such as falling back to the '(3rd biggest
pack size) + 1' and 4th and 5th and so on... when midx repack call resulted in no-op,
as that was how I fixed my repo when the edge case happen.

Such strategy would require a way to detect midx repack to signal when no-op happen,
so something like 'git multi-pack-index repack --batch-size=123456 --exit-code' would
be much desirable.

> 
> Of course, this means that the second-largest pack-file size is likely
> to grow over time and may eventually surpass the initially-cloned
> pack-file. Recall that the pack-file batch is selected in a greedy
> manner: the packs are considered from oldest to newest and are selected
> if they have size smaller than the batch size until the total selected
> size is larger than the batch size. Thus, that oldest "clone" pack will
> be first to repack after the new data creates a pack larger than that.
> 
> We also want to place some limits on how large these pack-files become,
> in order to bound the amount of time spent repacking. A maximum
> batch-size of two gigabytes means that large repositories will never be
> packed into a single pack-file using this job, but also that repack is
> rather expensive. This is a trade-off that is valuable to have if the
> maintenance is being run automatically or in the background. Users who
> truly want to optimize for space and performance (and are willing to pay
> the upfront cost of a full repack) can use the 'gc' task to do so.
> 
> Create a test for this two gigabyte limit by creating an EXPENSIVE test
> that generates two pack-files of roughly 2.5 gigabytes in size, then
> performs an incremental repack. Check that the --batch-size argument in
> the subcommand uses the hard-coded maximum.
> 
> Helped-by: Chris Torek <chris.torek@gmail.com>
> Reported-by: Son Luong Ngoc <sluongng@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Generally, I have found working with '--batch-size' to be a bit unpredictable.
I wonder if we could tweak the behavior somewhat so that its more consistent
to use and test?

Thanks a lot for making this happen.
Hope this patch would make it in stable soon

Cheers,
Son Luong.

[1]: https://github.com/sluongng/git-care

  reply	other threads:[~2020-08-06 17:09 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-06 16:30 [PATCH 0/9] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 1/9] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
2020-08-12 23:10   ` Emily Shaffer
2020-08-13  0:03     ` Junio C Hamano
2020-08-13  1:45       ` Jonathan Nieder
2020-08-13  4:37       ` [PATCH v3] " Junio C Hamano
2020-08-14  1:13         ` Derrick Stolee
2020-08-14  1:32           ` Junio C Hamano
2020-08-06 16:30 ` [PATCH 2/9] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-08-12 23:10   ` Emily Shaffer
2020-08-14  1:28     ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 3/9] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-08-12 23:10   ` Emily Shaffer
2020-08-14  1:46     ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 4/9] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 5/9] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 6/9] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 7/9] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 8/9] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-06 17:02   ` Son Luong Ngoc [this message]
2020-08-06 18:13     ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 9/9] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 0/9] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 1/9] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 2/9] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 3/9] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 4/9] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 5/9] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 6/9] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 7/9] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 8/9] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 9/9] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-08-25 18:36   ` [PATCH v3 0/8] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-25 18:36     ` [PATCH v3 1/8] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-09-22 23:05       ` Jonathan Tan
2020-08-25 18:36     ` [PATCH v3 2/8] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-09-22 23:09       ` Jonathan Tan
2020-09-24 13:45         ` Derrick Stolee
2020-08-25 18:36     ` [PATCH v3 3/8] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-09-22 23:15       ` Jonathan Tan
2020-09-24 13:51         ` Derrick Stolee
2020-08-25 18:36     ` [PATCH v3 4/8] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-09-22 23:16       ` Jonathan Tan
2020-09-24 13:53         ` Derrick Stolee
2020-08-25 18:36     ` [PATCH v3 5/8] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-25 18:36     ` [PATCH v3 6/8] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-09-22 23:26       ` Jonathan Tan
2020-09-24 14:05         ` Derrick Stolee
2020-09-24 22:01           ` Jonathan Tan
2020-08-25 18:36     ` [PATCH v3 7/8] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-25 18:36     ` [PATCH v3 8/8] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-09-22 23:52       ` Jonathan Tan
2020-08-25 20:59     ` [PATCH v3 0/8] Maintenance II: prefetch, loose-objects, incremental-repack tasks Junio C Hamano
2020-08-26 15:15     ` Son Luong Ngoc
2020-08-26 16:21       ` Derrick Stolee
2020-09-25 12:33     ` [PATCH v4 " Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 1/8] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 2/8] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 3/8] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-09-25 18:00         ` Junio C Hamano
2020-09-25 18:43           ` Derrick Stolee
2020-09-25 12:33       ` [PATCH v4 4/8] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 5/8] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 6/8] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 7/8] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 8/8] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6FA8F54A-C92D-497B-895F-AC6E8287AACD@gmail.com \
    --to=sluongng@gmail.com \
    --cc=congdanhqx@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dstolee@microsoft.com \
    --cc=emilyshaffer@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    --cc=phillip.wood123@gmail.com \
    --cc=sandals@crustytoothpaste.net \
    --cc=steadmon@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.