All of lore.kernel.org
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Robert Coup <robert.coup@koordinates.com>
Cc: Taylor Blau <me@ttaylorr.com>, John Cai <johncai86@gmail.com>,
	John Cai via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, Derrick Stolee <stolee@gmail.com>,
	Christian Couder <christian.couder@gmail.com>
Subject: Re: [PATCH v2 0/4] [RFC] repack: add --filter=
Date: Mon, 21 Feb 2022 12:57:03 -0500	[thread overview]
Message-ID: <YhPSb74x7davprsz@nand.local> (raw)
In-Reply-To: <CAFLLRp+JHi6B-RTeaWVPy2bZVHJ-y7EyMpymQy2LBynbZ8RzNA@mail.gmail.com>

On Mon, Feb 21, 2022 at 03:38:03PM +0000, Robert Coup wrote:
> Hi,
>
> On Mon, 21 Feb 2022 at 03:11, Taylor Blau <me@ttaylorr.com> wrote:
> >
> > we would still be leaving repository
> > corruption on the table, just making it marginally more difficult to
> > achieve.
>
> While reviewing John's patch I initially wondered if a better approach
> might be something like `git repack -a -d --exclude-stdin`, passing a
> list of specific objects to exclude from the new pack (sourced from
> rev-list via a filter, etc). To me this seems like a less dangerous
> approach, but my concern was it doesn't use the existing filter
> capabilities of pack-objects, and we end up generating and passing
> around a huge list of oids. And of course any mistakes in the list
> generation aren't visible until it's too late.

Yeah; I think the most elegant approach would have pack-objects do as
much work as possible, and have repack be in charge of coordinating what
all the pack-objects invocation(s) have to do.

> I also wonder whether there's a race condition if the repository gets
> updated? If you're moving large objects out in advance, then filtering
> the remainder there's nothing to stop a new large object being pushed
> between those two steps and getting dropped.

Yeah, we will want to make sure that we're operating on a consistent
view of the repository. If this is all done in-process, it won't be a
problem since we'll capture an atomic snapshot of the reference states
once. If this is done across multiple processes, we'll need to make sure
we're passing around that snapshot where appropriate.

See the `--refs-snapshot`-related code in git-repack for when we write a
multi-pack bitmap for an example of the latter.

> My other idea, which is growing on me, is whether repack could
> generate two valid packs: one for the included objects via the filter
> (as John's change does now), and one containing the filtered-out
> objects. `git repack -a -d --split-filter=<filter>` Then a user could
> then move/extract the second packfile to object storage, but there'd
> be no way to *accidentally* corrupt the repository by using a bad
> option. With this approach the race condition above goes away too.
>
>     $ git repack -a -d -q --split-filter=blob:limit=1m
>     pack-37b7443e3123549a2ddee31f616ae272c51cae90
>     pack-10789d94fcd99ffe1403b63b167c181a9df493cd
>
> First pack identifier being the objects that match the filter (ie:
> commits/trees/blobs <1m), and the second pack identifier being the
> objects that are excluded by the filter (blobs >1m).

I like this idea quite a bit. We also have a lot of existing tools that
would make an implementation fairly lightweight, namely pack-objects'
`--stdin-packs` mode.

Using that would look something like first having `repack` generate the
filtered pack first, remembering its name [1]. After that, we would run
`pack-objects` again, this time with `--stdin-packs`, where the positive
packs are the ones we're going to delete, and the negative pack(s)
is/are the filtered one generated in the last step.

The second invocation would leave us with a single pack which represents
all of the objects in packs we are about to delete, skipping any objects
that are in the filtered pack we just generated. In other words, it
would leave the repository with two packs: one with all of the objects
that met the filter criteria, and one with all objects that don't meet
the filter criteria.

A client could then upload the "doesn't meet the filter criteria" pack
off elsewhere, and then delete it locally. (I'm assuming this last part
in particular is orchestrated by some other command, and we aren't
encouraging users to run "rm" inside of .git/objects/pack!)

> An astute --i-know-what-im-doing reader could spot that you could just
> delete the second packfile and achieve the same outcome as the current
> proposed patch, subject to being confident the race condition hadn't
> happened to you.

Yeah, and I think this goes to my joking remark in the last paragraph.
If we allow users to delete packs at will, all bets are off regardless
of how safely we generate those packs. But I think splitting the
repository into two packs and _then_ dealing with one of them separately
as opposed to deleting objects which don't meet the filter criteria
immediately is moving in a safer direction.

> Thanks,
> Rob :)

Thanks,
Taylor

[1]: Optionally "name_s_", if we passed the `--max-pack-size` option to
`git pack-objects`, which we can trigger via a `git repack` option of
the same name.

  reply	other threads:[~2022-02-21 18:09 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-27  1:49 [PATCH 0/2] repack: add --filter= John Cai via GitGitGadget
2022-01-27  1:49 ` [PATCH 1/2] pack-objects: allow --filter without --stdout John Cai via GitGitGadget
2022-01-27  1:49 ` [PATCH 2/2] repack: add --filter=<filter-spec> option John Cai via GitGitGadget
2022-01-27 15:03   ` Derrick Stolee
2022-01-29 19:14     ` John Cai
2022-01-30  8:16       ` Christian Couder
2022-01-30 13:02       ` John Cai
2022-02-09  2:10 ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai via GitGitGadget
2022-02-09  2:10   ` [PATCH v2 1/4] pack-objects: allow --filter without --stdout John Cai via GitGitGadget
2022-02-09  2:10   ` [PATCH v2 2/4] repack: add --filter=<filter-spec> option John Cai via GitGitGadget
2022-02-09  2:10   ` [PATCH v2 3/4] upload-pack: allow missing promisor objects John Cai via GitGitGadget
2022-02-09  2:10   ` [PATCH v2 4/4] tests for repack --filter mode John Cai via GitGitGadget
2022-02-17 16:14     ` Robert Coup
2022-02-17 20:36       ` John Cai
2022-02-09  2:27   ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai
2022-02-16 15:39   ` Robert Coup
2022-02-16 21:07     ` John Cai
2022-02-21  3:11       ` Taylor Blau
2022-02-21 15:38         ` Robert Coup
2022-02-21 17:57           ` Taylor Blau [this message]
2022-02-21 21:10         ` Christian Couder
2022-02-21 21:42           ` Taylor Blau
2022-02-22 17:11             ` Christian Couder
2022-02-22 17:33               ` Taylor Blau
2022-02-23 15:40               ` Robert Coup
2022-02-23 19:31               ` Junio C Hamano
2022-02-26 16:01                 ` John Cai
2022-02-26 17:29                   ` Taylor Blau
2022-02-26 20:19                     ` John Cai
2022-02-26 20:30                       ` Taylor Blau
2022-02-26 21:05                         ` John Cai
2022-02-26 21:44                           ` Taylor Blau
2022-02-22 18:52             ` John Cai
2022-02-22 19:35               ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YhPSb74x7davprsz@nand.local \
    --to=me@ttaylorr.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=johncai86@gmail.com \
    --cc=robert.coup@koordinates.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.