All of lore.kernel.org
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Johannes Berg <johannes@sipsolutions.net>, git@vger.kernel.org
Subject: Re: [PATCH] pack-format: correct multi-pack-index description
Date: Mon, 10 Feb 2020 10:02:01 -0500	[thread overview]
Message-ID: <08dbc3be-34a7-fb8d-e0bd-56a79ab5b65a@gmail.com> (raw)
In-Reply-To: <c077a2100038edf2b0c486c0d364bd00f3921074.camel@sipsolutions.net>

On 2/10/2020 9:50 AM, Johannes Berg wrote:
> On Mon, 2020-02-10 at 09:46 -0500, Derrick Stolee wrote:
> 
>> Part of my initial plan was to have this incremental file format.
>> The commit-graph uses a very similar mechanism. The difference may
>> be that you likely allow multiple .midx files found by scanning the
>> pack directory, 
> 
> Right, just scan and use any midx that exist, then compare the packs in
> there against all the packs found, and then remove any packs that
> actually *are* in an midx from the search list. That leaves you with all
> information, but optimised by midx where possible.
> 
>> but I would expect something like the
>> "commit-graph-chain" file that provides an ordered list of the
>> incremental files. This can be important for deciding when to merge
>> layers or delete old files, and would be critical to the possibility
>> of converting reachability bitmaps to rely on a stable object order
>> stored in the multi-pack-index instead of pack-order.
> 
> Right, if we delete then we have to also remove any midx covering the
> deleted pack, that's pretty rare in bup as a backup tool though.
> 
>> The reason the multi-pack-index has not become incremental is that
>> VFS for Git no longer needs to write it very often. We write the
>> entire multi-pack-index during a background job that triggers once
>> per day. If we needed to write it more frequently, then the incremental
>> format would be more important to us.
> 
> So, wait, what if a new pack is created? Does it just get used in
> addition to the multi-pack-index, if it's not covered by it, like I
> described above?
> 
> If so, I guess it wouldn't actually really matter here. I was afraid
> (but didn't check yet) that git would always use only the single multi-
> pack-index file, and not also search additional packs, so that it always
> has to be maintained in "perfect order" ...

Git loads the multi-pack-index file, which includes a sorted list of
the packs it covers. It then scans the "pack" directory for pack-indexes
and checks if they are covered by the multi-pack-index. If not, then
Git will add them to the packed_git struct and use them as normal.
The hope is that this list of "uncovered" packs is small compared to
the data covered by the multi-pack-index.

This allows Git to continue functioning after an action like "git fetch"
that adds a new pack but may not want to rewrite the multi-pack-index.

Our background maintenance essentially runs these commands:

 1. git multi-pack-index write
 2. git multi-pack-index expire
 3. git multi-pack-index repack

Step 1 ensures all packs are pulled into the multi-pack-index. Step 2
deletes any pack-files whose objects are contained in newer pack-files.
Step 3 creates a new pack-file containing all objects from a set of
small pack-files (using the --batch-size=X option). This process helps
incrementally reduce the size and number of packs. That may be helpful
for your backup took, too.

Perhaps after an incremental multi-pack-index is added, then Git could
(optionally) have a mode that only checks the multi-pack-index to
avoid scanning the packs directory. It would require inserting a
multi-pack-index write into the index-pack logic so Git.

I'm not sure if that mode would be helpful, since the pack directory
scan is typically done once per command and is relatively fast.

>> That said: if someone wanted to contribute an incremental format,
>> then I would be happy to review it!
> 
> I might still get motivated to do so :-)

YOU CAN DO IT! (Did that help?)

-Stolee

  reply	other threads:[~2020-02-10 15:02 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-07 22:16 [PATCH] pack-format: correct multi-pack-index description Johannes Berg
2020-02-10 14:18 ` Derrick Stolee
2020-02-10 14:22   ` Johannes Berg
2020-02-10 14:46     ` Derrick Stolee
2020-02-10 14:50       ` Johannes Berg
2020-02-10 15:02         ` Derrick Stolee [this message]
2020-02-10 15:06           ` Johannes Berg
2020-02-10 17:02   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=08dbc3be-34a7-fb8d-e0bd-56a79ab5b65a@gmail.com \
    --to=stolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=johannes@sipsolutions.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.