git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Taylor Blau <me@ttaylorr.com>
Cc: Junio C Hamano <gitster@pobox.com>,
	git@vger.kernel.org, Patrick Steinhardt <ps@pks.im>
Subject: Re: [PATCH 05/24] midx: implement `DISP` chunk
Date: Tue, 12 Dec 2023 03:03:32 -0500	[thread overview]
Message-ID: <20231212080332.GC1117953@coredump.intra.peff.net> (raw)
In-Reply-To: <ZXPRL5yCTQKr0k7e@nand.local>

On Fri, Dec 08, 2023 at 09:30:07PM -0500, Taylor Blau wrote:

> > I did wonder how expensive to recompute and validate the "distinct"
> > information (in other words, is it too expensive for the consumers
> > of an existing midx file to see which packs are distinct on demand
> > before they stream contents out of the underlying packs?), as the
> > way the packs are marked as distinct looked rather error prone (you
> > can very easily list packfiles with overlaps with "+" prefix and the
> > DISK chunk writer does not even notice that you lied to it).  As long
> > as "git fsck" catches when two packs that are marked as distinct share
> > an object, that is OK, but the arrangement did look rather brittle
> > to me.
> 
> It's likely too expensive to do on the reading side for every
> pack-objects operation or MIDX load.

This made me think a bit. Obviously we can check for disjointedness in
O(n log k), where n is the total number of objects and k is the number
of packs, by doing a k-merge of the sorted lists. But that's a
non-starter, because we may be serving a request that is much smaller
than all "n" objects (e.g., any small fetch, but also even clones when
the pack contains a bunch of irrelevant objects).

But we can relax our condition a bit. The packs only need to be disjoint
with respect to the set of objects that we are going to output (we'll
call that "m"). So at the very least, you can do O(mk) lookups (each one
itself "log n", of course). We know that the work is already going to
be linear in "m". In theory we want to generally keep "k" small, but
part of the point of using midx in this way is to let "k" grow a bit.
So that might turn out pretty bad in practice.

So let's take one more step back. One thing I didn't feel like I saw
either in this patch or the cover letter is exactly why we care about
disjointedness. IIRC, the main issue is making sure that for any object
X we send verbatim, it is either a non-delta or its delta base is
viable. And the two reasons I can think of for the base to be non-viable
are:

  1. We are not sending the base at all.

  2. The base is in another pack, and we are worried about creating a
     cycle (i.e., in pack A we have X as a delta against Y, and in pack
     B we have Y as a delta against X, and we send both deltas).

We already deal with (1) for the single-pack case by finding the base
object offset, converting it to a pack position, and then making sure
that position is also marked for verbatim reuse.

The interesting one is (2), which is new for midx (since a single pack
cannot have two copies of an object). But I'm not sure if it's possible.
The verbatim reuse code depends on using bitmaps in the first place. And
there is already a preference-order to the packs in the midx bitmaps.

That is, we'll choose all of the objects for pack A over any duplicates
in B, and duplicates from B over C, and so on. If we likewise try
verbatim reuse in that order, then we know that an object in pack A can
never have a base that is selected from pack B or C (because if they do
have duplicates, we'd have selected A's copy to put in the midx bitmap).
And likewise, a copy of an object in pack B will always have its base
either in A or B, but never in C.

So it kind of seems to me that this would "just work" if
try_partial_reuse() did something like for the midx case:

  - convert offset of base object into an object id hash using the pack
    revindex (similar to offset_to_pack_pos)

  - look up the object id in the midx to get a pack/offset combo

  - use the midx revindex to convert that into a bit position

  - check if that bit is marked for verbatim reuse

The assumption there is that in the second step, the midx has resolved
duplicates by putting all of pack A before pack B, and so on, as above.
It also assumes that we are trying verbatim reuse in the same order
(though a different order would not produce wrong results, it would only
result in less verbatim reuse).

All of which makes me think I'm missing some other case that is a
problem. While I wait for you to explain it, though, let's continue our
thought experiment for a moment.

If we assume that any cross-pack deltas are a problem, we could always
just skip them for verbatim reuse. That is, once we look up the object
id in the midx, we can see if it's in the same pack we're currently
processing. If not, we could punt and let the non-verbatim code paths
handle it as usual.

That still leaves the problem of realizing we skipped over a chunk of
the packfile (say we are pulling object X from pack B, and it is a delta
of Y, but we already decided to reuse Y from pack A). But the reuse code
already has to accommodate gaps. I think it happens naturally in
write_reused_pack_one(), where we feed the actual byte offsets into
record_reused_object(). You'd have to take some care not to go past gaps
in the blind copy that happens write_reused_pack_verbatim(). So you
might need to mark the first such gap somehow (if it's hard, I'd suggest
just skipping write_reused_pack_verbatim() entirely; it is a fairly
minor optimization compared to the rest of it).

And of course there's a bunch of hard work teaching all of those
functions about midx's and multiple packs in the first place, but you
already had to do all that in the latter part of your series, and it
would still be applicable.

OK, this is the part where you tell me what I'm missing. ;)

-Peff

  reply	other threads:[~2023-12-12  8:03 UTC|newest]

Thread overview: 107+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-28 19:07 [PATCH 00/24] pack-objects: multi-pack verbatim reuse Taylor Blau
2023-11-28 19:07 ` [PATCH 01/24] pack-objects: free packing_data in more places Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:08     ` Taylor Blau
2023-11-28 19:07 ` [PATCH 02/24] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:11     ` Taylor Blau
2023-12-12  7:04   ` Jeff King
2023-12-12 16:48     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 03/24] pack-bitmap: plug leak in find_objects() Taylor Blau
2023-12-12  7:04   ` Jeff King
2023-11-28 19:08 ` [PATCH 04/24] midx: factor out `fill_pack_info()` Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:19     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 05/24] midx: implement `DISP` chunk Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:27     ` Taylor Blau
2023-12-03 13:15   ` Junio C Hamano
2023-12-05 19:26     ` Taylor Blau
2023-12-09  1:40       ` Junio C Hamano
2023-12-09  2:30         ` Taylor Blau
2023-12-12  8:03           ` Jeff King [this message]
2023-12-13 18:28             ` Taylor Blau
2023-12-13 19:20               ` Junio C Hamano
2023-11-28 19:08 ` [PATCH 06/24] midx: implement `midx_locate_pack()` Taylor Blau
2023-11-28 19:08 ` [PATCH 07/24] midx: implement `--retain-disjoint` mode Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:29     ` Taylor Blau
2023-12-01  8:02       ` Patrick Steinhardt
2023-11-28 19:08 ` [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode Taylor Blau
2023-11-30 10:18   ` Patrick Steinhardt
2023-11-30 19:32     ` Taylor Blau
2023-12-01  8:17       ` Patrick Steinhardt
2023-12-01 19:58         ` Taylor Blau
2023-11-28 19:08 ` [PATCH 09/24] repack: implement `--extend-disjoint` mode Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 20:28     ` Taylor Blau
2023-12-08  8:19       ` Patrick Steinhardt
2023-12-08 22:48         ` Taylor Blau
2023-12-11  8:18           ` Patrick Steinhardt
2023-12-11 19:59             ` Taylor Blau
2023-11-28 19:08 ` [PATCH 10/24] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 20:34     ` Taylor Blau
2023-12-08  8:19       ` Patrick Steinhardt
2023-11-28 19:08 ` [PATCH 11/24] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 14:36     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 12/24] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
2023-11-28 19:08 ` [PATCH 13/24] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
2023-11-28 19:08 ` [PATCH 14/24] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 20:43     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 15/24] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
2023-11-28 19:08 ` [PATCH 16/24] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
2023-12-07 13:13   ` Patrick Steinhardt
2023-12-07 20:47     ` Taylor Blau
2023-11-28 19:08 ` [PATCH 17/24] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
2023-11-28 19:08 ` [PATCH 18/24] pack-objects: include number of packs reused in output Taylor Blau
2023-11-28 19:08 ` [PATCH 19/24] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
2023-11-28 19:08 ` [PATCH 20/24] pack-objects: add tracing for various packfile metrics Taylor Blau
2023-11-28 19:08 ` [PATCH 21/24] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
2023-11-28 19:08 ` [PATCH 22/24] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
2023-11-28 19:08 ` [PATCH 23/24] pack-bitmap: reuse objects from all disjoint packs Taylor Blau
2023-11-28 19:08 ` [PATCH 24/24] t/perf: add performance tests for multi-pack reuse Taylor Blau
2023-11-30 10:18 ` [PATCH 00/24] pack-objects: multi-pack verbatim reuse Patrick Steinhardt
2023-11-30 19:39   ` Taylor Blau
2023-12-01  8:31     ` Patrick Steinhardt
2023-12-01 20:02       ` Taylor Blau
2023-12-04  8:49         ` Patrick Steinhardt
2023-12-12  8:12 ` Jeff King
2023-12-15 15:37   ` Taylor Blau
2023-12-21 11:13     ` Jeff King
2024-01-04 22:22       ` Taylor Blau
2023-12-14 22:23 ` [PATCH v2 00/26] " Taylor Blau
2023-12-14 22:23   ` [PATCH v2 01/26] pack-objects: free packing_data in more places Taylor Blau
2023-12-14 22:23   ` [PATCH v2 02/26] pack-bitmap-write: deep-clear the `bb_commit` slab Taylor Blau
2023-12-14 22:23   ` [PATCH v2 03/26] pack-bitmap: plug leak in find_objects() Taylor Blau
2023-12-14 22:23   ` [PATCH v2 04/26] midx: factor out `fill_pack_info()` Taylor Blau
2023-12-14 22:23   ` [PATCH v2 05/26] midx: implement `BTMP` chunk Taylor Blau
2023-12-14 22:23   ` [PATCH v2 06/26] midx: implement `midx_locate_pack()` Taylor Blau
2023-12-14 22:23   ` [PATCH v2 07/26] pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions Taylor Blau
2023-12-14 22:23   ` [PATCH v2 08/26] ewah: implement `bitmap_is_empty()` Taylor Blau
2023-12-14 22:24   ` [PATCH v2 09/26] pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature Taylor Blau
2023-12-14 22:24   ` [PATCH v2 10/26] pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` Taylor Blau
2023-12-14 22:24   ` [PATCH v2 11/26] pack-objects: parameterize pack-reuse routines over a single pack Taylor Blau
2023-12-14 22:24   ` [PATCH v2 12/26] pack-objects: keep track of `pack_start` for each reuse pack Taylor Blau
2023-12-14 22:24   ` [PATCH v2 13/26] pack-objects: pass `bitmapped_pack`'s to pack-reuse functions Taylor Blau
2023-12-14 22:24   ` [PATCH v2 14/26] pack-objects: prepare `write_reused_pack()` for multi-pack reuse Taylor Blau
2023-12-14 22:24   ` [PATCH v2 15/26] pack-objects: prepare `write_reused_pack_verbatim()` " Taylor Blau
2023-12-14 22:24   ` [PATCH v2 16/26] pack-objects: include number of packs reused in output Taylor Blau
2023-12-14 22:24   ` [PATCH v2 17/26] git-compat-util.h: implement checked size_t to uint32_t conversion Taylor Blau
2023-12-14 22:24   ` [PATCH v2 18/26] midx: implement `midx_preferred_pack()` Taylor Blau
2023-12-14 22:24   ` [PATCH v2 19/26] pack-revindex: factor out `midx_key_to_pack_pos()` helper Taylor Blau
2023-12-14 22:24   ` [PATCH v2 20/26] pack-revindex: implement `midx_pair_to_pack_pos()` Taylor Blau
2023-12-14 22:24   ` [PATCH v2 21/26] pack-bitmap: prepare to mark objects from multiple packs for reuse Taylor Blau
2023-12-14 22:24   ` [PATCH v2 22/26] pack-objects: add tracing for various packfile metrics Taylor Blau
2023-12-14 22:24   ` [PATCH v2 23/26] t/test-lib-functions.sh: implement `test_trace2_data` helper Taylor Blau
2023-12-14 22:24   ` [PATCH v2 24/26] pack-objects: allow setting `pack.allowPackReuse` to "single" Taylor Blau
2023-12-14 22:24   ` [PATCH v2 25/26] pack-bitmap: enable reuse from all bitmapped packs Taylor Blau
2023-12-14 22:24   ` [PATCH v2 26/26] t/perf: add performance tests for multi-pack reuse Taylor Blau
2023-12-15  0:06   ` [PATCH v2 00/26] pack-objects: multi-pack verbatim reuse Junio C Hamano
2023-12-15  0:38     ` Taylor Blau
2023-12-15  0:40     ` Junio C Hamano
2023-12-15  1:25       ` Taylor Blau
2023-12-21 10:51         ` Jeff King
2024-01-04 22:24           ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231212080332.GC1117953@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).