git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* rather slow 'git repack' in 'blob:none' partial clones
@ 2021-04-03  9:04 SZEDER Gábor
  2021-04-05  1:02 ` Rafael Silva
  2021-04-14 19:14 ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Rafael Silva
  0 siblings, 2 replies; 46+ messages in thread
From: SZEDER Gábor @ 2021-04-03  9:04 UTC (permalink / raw)
  To: Git Mailing List

Hi,

here are trace timings of running 'git gc' in a "normal" and in a
'blob:none' partial clone:

  $ git clone --bare https://github.com/git/git git-full.git
  $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-full.git/ gc
  10:35:24.007277 trace.c:487             performance: 0.001550225 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
  10:35:24.044641 trace.c:487             performance: 0.035631270 s: git command: /usr/local/libexec/git-core/git reflog expire --all
  10:35:24.061070 read-cache.c:2315       performance: 0.000008506 s:  read cache ./index
  Enumerating objects: 305283, done.
  Counting objects: 100% (305283/305283), done.
  Delta compression using up to 4 threads
  Compressing objects: 100% (75016/75016), done.
  Writing objects: 100% (305283/305283), done.
  Total 305283 (delta 227928), reused 305283 (delta 227928), pack-reused 0
  10:35:32.604546 trace.c:487             performance: 8.555651283 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946975-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --unpack-unreachable=2.weeks.ago
  10:35:32.680597 trace.c:487             performance: 8.633068356 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
  10:35:32.683130 trace.c:487             performance: 0.000959377 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago
  10:35:32.684401 trace.c:487             performance: 0.000180173 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
  10:35:32.685730 trace.c:487             performance: 0.000263898 s: git command: /usr/local/libexec/git-core/git rerere gc
  10:35:33.514816 trace.c:487             performance: 9.511597988 s: git command: git -C git-full.git/ gc
  elapsed: 0:09.51  max RSS: 358964k

  $ git clone --bare --filter=blob:none https://github.com/git/git git-partial.git
  $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-partial.git/ gc
  10:35:47.637735 trace.c:487             performance: 0.000872539 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
  10:35:47.675498 trace.c:487             performance: 0.036246403 s: git command: /usr/local/libexec/git-core/git reflog expire --all
  Enumerating objects: 188205, done.
  Counting objects: 100% (188205/188205), done.
  Delta compression using up to 4 threads
  Compressing objects: 100% (66520/66520), done.
  Writing objects: 100% (188205/188205), done.
  Total 188205 (delta 119967), reused 188205 (delta 119967), pack-reused 0
  10:35:50.081709 trace.c:487             performance: 2.402625839 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack
  10:35:50.100131 read-cache.c:2315       performance: 0.000009979 s:  read cache ./index
  10:37:04.973541 trace.c:487             performance: 74.885793630 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --exclude-promisor-objects --unpack-unreachable=2.weeks.ago
  Removing duplicate objects: 100% (256/256), done.
  10:37:07.482791 trace.c:487             performance: 79.804973525 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
  10:37:07.549333 trace.c:487             performance: 0.008025426 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago --exclude-promisor-objects
  10:37:07.552499 trace.c:487             performance: 0.000362981 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
  10:37:07.554521 trace.c:487             performance: 0.000273834 s: git command: /usr/local/libexec/git-core/git rerere gc
  10:37:10.168233 trace.c:487             performance: 82.533331484 s: git command: git -C git-partial.git/ gc
  elapsed: 1:22.54  max RSS: 1891832k

Notice the ~9s vs. 82s runtime and ~350M vs. 1.9G memory consumption
increase.  What's going on here?

Also note that that second 'git pack-objects' invocation doesn't show
any progress for ~75s.

FWIW, doing the same in a 'tree:0' partial clone is fast.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-03  9:04 rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
@ 2021-04-05  1:02 ` Rafael Silva
  2021-04-07 21:17   ` Jeff King
  2021-04-11 10:59   ` SZEDER Gábor
  2021-04-14 19:14 ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Rafael Silva
  1 sibling, 2 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-05  1:02 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Git Mailing List


SZEDER Gábor <szeder.dev@gmail.com> writes:

> Hi,
>
> here are trace timings of running 'git gc' in a "normal" and in a
> 'blob:none' partial clone:
>
>   $ git clone --bare https://github.com/git/git git-full.git
>   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-full.git/ gc
>   10:35:24.007277 trace.c:487             performance: 0.001550225 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
>   10:35:24.044641 trace.c:487             performance: 0.035631270 s: git command: /usr/local/libexec/git-core/git reflog expire --all
>   10:35:24.061070 read-cache.c:2315       performance: 0.000008506 s:  read cache ./index
>   Enumerating objects: 305283, done.
>   Counting objects: 100% (305283/305283), done.
>   Delta compression using up to 4 threads
>   Compressing objects: 100% (75016/75016), done.
>   Writing objects: 100% (305283/305283), done.
>   Total 305283 (delta 227928), reused 305283 (delta 227928), pack-reused 0
>   10:35:32.604546 trace.c:487             performance: 8.555651283 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946975-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --unpack-unreachable=2.weeks.ago
>   10:35:32.680597 trace.c:487             performance: 8.633068356 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
>   10:35:32.683130 trace.c:487             performance: 0.000959377 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago
>   10:35:32.684401 trace.c:487             performance: 0.000180173 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
>   10:35:32.685730 trace.c:487             performance: 0.000263898 s: git command: /usr/local/libexec/git-core/git rerere gc
>   10:35:33.514816 trace.c:487             performance: 9.511597988 s: git command: git -C git-full.git/ gc
>   elapsed: 0:09.51  max RSS: 358964k
>
>   $ git clone --bare --filter=blob:none https://github.com/git/git git-partial.git
>   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-partial.git/ gc
>   10:35:47.637735 trace.c:487             performance: 0.000872539 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
>   10:35:47.675498 trace.c:487             performance: 0.036246403 s: git command: /usr/local/libexec/git-core/git reflog expire --all
>   Enumerating objects: 188205, done.
>   Counting objects: 100% (188205/188205), done.
>   Delta compression using up to 4 threads
>   Compressing objects: 100% (66520/66520), done.
>   Writing objects: 100% (188205/188205), done.
>   Total 188205 (delta 119967), reused 188205 (delta 119967), pack-reused 0
>   10:35:50.081709 trace.c:487             performance: 2.402625839 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack
>   10:35:50.100131 read-cache.c:2315       performance: 0.000009979 s:  read cache ./index
>   10:37:04.973541 trace.c:487             performance: 74.885793630 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --exclude-promisor-objects --unpack-unreachable=2.weeks.ago
>   Removing duplicate objects: 100% (256/256), done.
>   10:37:07.482791 trace.c:487             performance: 79.804973525 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
>   10:37:07.549333 trace.c:487             performance: 0.008025426 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago --exclude-promisor-objects
>   10:37:07.552499 trace.c:487             performance: 0.000362981 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
>   10:37:07.554521 trace.c:487             performance: 0.000273834 s: git command: /usr/local/libexec/git-core/git rerere gc
>   10:37:10.168233 trace.c:487             performance: 82.533331484 s: git command: git -C git-partial.git/ gc
>   elapsed: 1:22.54  max RSS: 1891832k
>
> Notice the ~9s vs. 82s runtime and ~350M vs. 1.9G memory consumption
> increase.  What's going on here?
>
> Also note that that second 'git pack-objects' invocation doesn't show
> any progress for ~75s.
>
> FWIW, doing the same in a 'tree:0' partial clone is fast.

I'm not expert on the area - by "area": the entire git code base :).
However, I was intrigued by this performance numbers and decided to give
it a try on the investigation, mostly for learning. While I'm not sure
about the solution of the problem, I decided to share it here with the
hope that at least I'll be saving someone else time.

When I was digging into the code and adding trace2_region_*() calls, I
notice most of the time spent on the `git gc` (for the reported
situation) was in:

       # In builtin/pack-objects.c
       static void get_object_list(int ac, const char **av)
       {
               ...
               if (unpack_unreachable)
                       loosen_unused_packed_objects();
               ...
       }

The loosen_unused_packed_objects() will unpack unreachable objects as
loose objects, and given that the partial cloned .pack file is
incomplete, this result in writing a lot of loose objects in $GIT_DIR
increasing the execution time and memory consumption. This can be seen
by watching the $GIT_DIR/objects/ during the `git gc` execution on the
partial cloned repo.  On the fully clone repository all the objects
exist, at least on the fresh clone like in your report thus no object is
loose from the .pack file.

To provide some insight in the magnitude of the written loose objects,
I counted the number of objects that was being feed into
force_object_loose() with the following patch:

-- >8 --
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 525c2d8552..f912b54a5f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3478,7 +3478,7 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
 static void loosen_unused_packed_objects(void)
 {
 	struct packed_git *p;
-	uint32_t i;
+	uint32_t i, loosen_obj_counter = 0;
 	struct object_id oid;
 
 	for (p = get_all_packs(the_repository); p; p = p->next) {
@@ -3492,10 +3492,13 @@ static void loosen_unused_packed_objects(void)
 			nth_packed_object_id(&oid, p, i);
 			if (!packlist_find(&to_pack, &oid) &&
 			    !has_sha1_pack_kept_or_nonlocal(&oid) &&
-			    !loosened_object_can_be_discarded(&oid, p->mtime))
+			    !loosened_object_can_be_discarded(&oid, p->mtime) &&
+			    ++loosen_obj_counter)
 				if (force_object_loose(&oid, p->mtime))
 					die(_("unable to force loose object"));
 		}
+		fprintf(stderr, "loosen_unused_packed_objects() total of objects: %d\n", p->num_objects);
+		fprintf(stderr, "loosen_unused_packed_objects() objects that is being loosed: %d\n", loosen_obj_counter);
 	}
 }
 
-- >8 --

Running on a fresh and fully cloned git.git repo, there are (obviously)
0 unreachable objects that are being written as loose objects:

    $ bin-wrappers/git clone --bare https://github.com/git/git git-full.git
    $ time --format='elapsed: %E | max RSS: %Mk' bin-wrappers/git -C git-full.git gc
    loosen_unused_packed_objects() total of objects: 305292
    loosen_unused_packed_objects() objects that is being loosed: 0
    Enumerating objects: 305292, done.
    Counting objects: 100% (305292/305292), done.
    Delta compression using up to 8 threads
    Compressing objects: 100% (75035/75035), done.
    Writing objects: 100% (305292/305292), done.
    Selecting bitmap commits: 63006, done.
    Building bitmaps: 100% (317/317), done.
    Total 305292 (delta 227918), reused 305292 (delta 227918), pack-reused 0
    elapsed: 0:09.23 | max RSS: 438628k

On the other hand, when running on a fresh and partial cloned repo, we
can see that all the objects (at least according to my findings) are
being move out of .pack file into a loose object.

    $ bin-wrappers/git clone --bare --filter=blob:none https://github.com/git/git git-partial.git
    $ time --format='elapsed: %E | max RSS: %Mk' bin-wrappers/git -C git-partial.git gc
    Enumerating objects: 188213, done.
    Counting objects: 100% (188213/188213), done.
    Delta compression using up to 8 threads
    Compressing objects: 100% (66524/66524), done.
    Writing objects: 100% (188213/188213), done.
    Total 188213 (delta 119971), reused 188213 (delta 119971), pack-reused 0
    loosen_unused_packed_objects() total of objects: 188213
    loosen_unused_packed_objects() objects that is being loosed: 188213
    loosen_unused_packed_objects() total of objects: 188213
    loosen_unused_packed_objects() objects that is being loosed: 376426
    Removing duplicate objects: 100% (256/256), done.
    elapsed: 3:24.86 | max RSS: 2085552k

Another interesting thing is, the loosen_unused_packed_objects()
function is being called twice because the function loads all packs
files, via get_all_packs(), which will return the .temp-*pack file that
is created by the `git pack-objects` child process from `git gc`:

    git pack-objects ... --delta-base-offset objects/pack/.tmp-82853-pack ...

For this specific case, this make the situation worse as we end up
processing the object from both packfiles.  Also, the second
execution, which operates on the normal .pack file, is processing 2x
"objects" (in terms of counting the execution of
force_object_loose()), I couldn't quite figure out why.

I'm not entirely sure about this (not this late in the day), but it seems to
me that we should simply skip the "missing" (promisor) files when
operating on a partial clone.

Perhaps something like:

--- >8 ---
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 525c2d8552..fedf58323d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
 {
        if (!unpack_unreachable_expiration)
                return 0;
+       if (exclude_promisor_objects && is_promisor_object(oid))
+               return 1;
        if (mtime > unpack_unreachable_expiration)
                return 0;
        if (oid_array_lookup(&recent_objects, oid) >= 0)
--- >8 ---

I'll try to prepare a patch for this change with proper testing, if this
turns out to be proper way to handle partial clone repository.

A quick benchmark did show some promising result:

    # built from: 2e36527f23 (The sixth batch, 2021-04-02)
    Benchmark #1: ./bin-wrappers/git -C git.git gc
          Time (mean ± σ):     135.669 s ±  0.665 s    [User: 42.789 s, System: 91.332 s]
          Range (min … max):   134.905 s … 136.115 s    3 runs

    # built from: 2e36527f23 + minor patch (from above)
    Benchmark #2: ./bin-wrappers/git -C git.git gc
          Time (mean ± σ):     12.586 s ±  0.031 s    [User: 11.462 s, System: 1.365 s]
          Range (min … max):   12.553 s … 12.616 s    3 runs

    Summary:
          'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'


-- 
Thanks
Rafael

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-05  1:02 ` Rafael Silva
@ 2021-04-07 21:17   ` Jeff King
  2021-04-08  0:02     ` Jonathan Tan
                       ` (2 more replies)
  2021-04-11 10:59   ` SZEDER Gábor
  1 sibling, 3 replies; 46+ messages in thread
From: Jeff King @ 2021-04-07 21:17 UTC (permalink / raw)
  To: Rafael Silva; +Cc: Jonathan Tan, SZEDER Gábor, Git Mailing List

On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:

> When I was digging into the code and adding trace2_region_*() calls, I
> notice most of the time spent on the `git gc` (for the reported
> situation) was in:
> 
>        # In builtin/pack-objects.c
>        static void get_object_list(int ac, const char **av)
>        {
>                ...
>                if (unpack_unreachable)
>                        loosen_unused_packed_objects();
>                ...
>        }

Yeah, good find.

This is my first time looking at the repacking strategy for partial
clones. It looks like we run an initial pack-objects to cover all the
promisor objects, and then do the "real" repack for everything else,
with "--exclude-promisor-objects".

The purpose of loosen_unused_packed_objects() is to catch any objects
that will be lost when our caller deletes all of the packs. But in this
case, those promisor objects are in a pack which won't be deleted, so
they should not be included.

> Another interesting thing is, the loosen_unused_packed_objects()
> function is being called twice because the function loads all packs
> files, via get_all_packs(), which will return the .temp-*pack file that
> is created by the `git pack-objects` child process from `git gc`:
> 
>     git pack-objects ... --delta-base-offset objects/pack/.tmp-82853-pack ...

Yes, this is the "promisor" pack created by git-repack. It seems like
git-repack should tell pack-objects about the new pack with --keep-pack,
so that we know it is not going to be deleted.

That would also solve the rest of the problem, I _think_. In your
suggestion here:

> I'm not entirely sure about this (not this late in the day), but it seems to
> me that we should simply skip the "missing" (promisor) files when
> operating on a partial clone.
> 
> Perhaps something like:
> 
> --- >8 ---
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 525c2d8552..fedf58323d 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
>  {
>         if (!unpack_unreachable_expiration)
>                 return 0;
> +       if (exclude_promisor_objects && is_promisor_object(oid))
> +               return 1;
>         if (mtime > unpack_unreachable_expiration)
>                 return 0;
>         if (oid_array_lookup(&recent_objects, oid) >= 0)
> --- >8 ---

you are avoiding writing out the file. But we should realize much
earlier that it is not something we need to even consider loosening.

In the loop in loosen_unused_packed_objects(), we skip packs that are
marked as "keep", so we'd skip the new promisor pack entirely. But we'd
still see all these objects in the _old_ promisor pack. However, for
each object there, we call has_sha1_pack_kept_or_nonlocal(), so that
would likewise realize that each object is already being kept in the
other pack.

Something like this seems to work, but I only lightly tested it, and it
could probably use some refactoring to make it less horrible:

diff --git a/builtin/repack.c b/builtin/repack.c
index fdee8e4578..457525953a 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -574,6 +574,23 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		repack_promisor_objects(&po_args, &names);
 
 		if (existing_packs.nr && delete_redundant) {
+			/*
+			 * tell pack-objects about our new promisor pack, which
+			 * we will also be keeping
+			 */
+			for_each_string_list_item(item, &names) {
+				/*
+				 * yuck, we seem to only have the name with the
+				 * packdir prefixed
+				 */
+				const char *prefix;
+				if (!skip_prefix(packtmp, packdir, &prefix) ||
+				    *prefix++ != '/')
+					BUG("confused by packtmp");
+				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
+					     prefix, item->string);
+			}
+
 			if (unpack_unreachable) {
 				strvec_pushf(&cmd.args,
 					     "--unpack-unreachable=%s",

Do you want to try to work with that?

> A quick benchmark did show some promising result:
> 
>     # built from: 2e36527f23 (The sixth batch, 2021-04-02)
>     Benchmark #1: ./bin-wrappers/git -C git.git gc
>           Time (mean ± σ):     135.669 s ±  0.665 s    [User: 42.789 s, System: 91.332 s]
>           Range (min … max):   134.905 s … 136.115 s    3 runs
> 
>     # built from: 2e36527f23 + minor patch (from above)
>     Benchmark #2: ./bin-wrappers/git -C git.git gc
>           Time (mean ± σ):     12.586 s ±  0.031 s    [User: 11.462 s, System: 1.365 s]
>           Range (min … max):   12.553 s … 12.616 s    3 runs
> 
>     Summary:
>           'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'

It's still quite a bit slower than a non-partial clone because the
traversal with --exclude-promisor-objects is slow. I think that's
because it has to open up all of the objects in the promisor pack to see
what they refer to. I don't know if we can do better (and it's largely
an orthogonal problem to what you're solving here, so it probably makes
sense to just punt on it for now).

-Peff

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-07 21:17   ` Jeff King
@ 2021-04-08  0:02     ` Jonathan Tan
  2021-04-08  0:35       ` Jeff King
  2021-04-12  7:09     ` Rafael Silva
  2021-04-12 21:36     ` SZEDER Gábor
  2 siblings, 1 reply; 46+ messages in thread
From: Jonathan Tan @ 2021-04-08  0:02 UTC (permalink / raw)
  To: peff; +Cc: rafaeloliveira.cs, jonathantanmy, szeder.dev, git

> On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:
> 
> > When I was digging into the code and adding trace2_region_*() calls, I
> > notice most of the time spent on the `git gc` (for the reported
> > situation) was in:
> > 
> >        # In builtin/pack-objects.c
> >        static void get_object_list(int ac, const char **av)
> >        {
> >                ...
> >                if (unpack_unreachable)
> >                        loosen_unused_packed_objects();
> >                ...
> >        }
> 
> Yeah, good find.

Agreed!

> This is my first time looking at the repacking strategy for partial
> clones. It looks like we run an initial pack-objects to cover all the
> promisor objects, and then do the "real" repack for everything else,
> with "--exclude-promisor-objects".

That is correct - we need two separate packs because one of them needs
to be accompanied by a separate ".promisor" file to show that those
objects are from the promisor remote.

> The purpose of loosen_unused_packed_objects() is to catch any objects
> that will be lost when our caller deletes all of the packs. But in this
> case, those promisor objects are in a pack which won't be deleted, so
> they should not be included.

Makes sense.

> > I'm not entirely sure about this (not this late in the day), but it seems to
> > me that we should simply skip the "missing" (promisor) files when
> > operating on a partial clone.
> > 
> > Perhaps something like:
> > 
> > --- >8 ---
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index 525c2d8552..fedf58323d 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
> >  {
> >         if (!unpack_unreachable_expiration)
> >                 return 0;
> > +       if (exclude_promisor_objects && is_promisor_object(oid))
> > +               return 1;
> >         if (mtime > unpack_unreachable_expiration)
> >                 return 0;
> >         if (oid_array_lookup(&recent_objects, oid) >= 0)
> > --- >8 ---

I think this will work. The only thing we might need to watch out for is
that if we repack with --exclude-promisor-objects, some might say that
every excluded object should be loosened. I don't think that's true in
this case, because (1) the object is not unreachable and the argument
talks about loosening unreachable objects, (2) promisor objects can be
re-obtained from the promisor remote anyway. So I think we're fine.

But I think Peff suggested something better below.

> In the loop in loosen_unused_packed_objects(), we skip packs that are
> marked as "keep", so we'd skip the new promisor pack entirely. But we'd
> still see all these objects in the _old_ promisor pack. However, for
> each object there, we call has_sha1_pack_kept_or_nonlocal(), so that
> would likewise realize that each object is already being kept in the
> other pack.
> 
> Something like this seems to work, but I only lightly tested it, and it
> could probably use some refactoring to make it less horrible:
> 
> diff --git a/builtin/repack.c b/builtin/repack.c
> index fdee8e4578..457525953a 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -574,6 +574,23 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		repack_promisor_objects(&po_args, &names);
>  
>  		if (existing_packs.nr && delete_redundant) {
> +			/*
> +			 * tell pack-objects about our new promisor pack, which
> +			 * we will also be keeping
> +			 */
> +			for_each_string_list_item(item, &names) {
> +				/*
> +				 * yuck, we seem to only have the name with the
> +				 * packdir prefixed
> +				 */
> +				const char *prefix;
> +				if (!skip_prefix(packtmp, packdir, &prefix) ||
> +				    *prefix++ != '/')
> +					BUG("confused by packtmp");
> +				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
> +					     prefix, item->string);
> +			}
> +
>  			if (unpack_unreachable) {
>  				strvec_pushf(&cmd.args,
>  					     "--unpack-unreachable=%s",
> 
> Do you want to try to work with that?

It seems to me that this would work. The only part I was confused about
is "packtmp", but that is just the pack directory plus a specific
prefix, so "pack-objects" will indeed see that packfile as being part of
the repository - no problem.

repack_promisor_objects() might be able to be refactored to provide the
names in a format that we want, but looking at it, I don't think it's
possible (it just uses "packtmp", so we have the same "packtmp"
problem).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-08  0:02     ` Jonathan Tan
@ 2021-04-08  0:35       ` Jeff King
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2021-04-08  0:35 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: rafaeloliveira.cs, szeder.dev, git

On Wed, Apr 07, 2021 at 05:02:41PM -0700, Jonathan Tan wrote:

> It seems to me that this would work. The only part I was confused about
> is "packtmp", but that is just the pack directory plus a specific
> prefix, so "pack-objects" will indeed see that packfile as being part of
> the repository - no problem.
> 
> repack_promisor_objects() might be able to be refactored to provide the
> names in a format that we want, but looking at it, I don't think it's
> possible (it just uses "packtmp", so we have the same "packtmp"
> problem).

Yeah, that was the refactoring I alluded to. I think the earlier code
should keep the ".tmp-%d" portion in a separate string, and then
construct packtmp from that. And then we don't have to try to recover it
from the concatenated string.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-05  1:02 ` Rafael Silva
  2021-04-07 21:17   ` Jeff King
@ 2021-04-11 10:59   ` SZEDER Gábor
  2021-04-12  7:53     ` Rafael Silva
  1 sibling, 1 reply; 46+ messages in thread
From: SZEDER Gábor @ 2021-04-11 10:59 UTC (permalink / raw)
  To: Rafael Silva; +Cc: Git Mailing List

On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:
> > here are trace timings of running 'git gc' in a "normal" and in a
> > 'blob:none' partial clone:
> >
> >   $ git clone --bare https://github.com/git/git git-full.git
> >   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-full.git/ gc
[...]
> >   elapsed: 0:09.51  max RSS: 358964k
> >
> >   $ git clone --bare --filter=blob:none https://github.com/git/git git-partial.git
> >   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-partial.git/ gc
> >   10:35:47.637735 trace.c:487             performance: 0.000872539 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
> >   10:35:47.675498 trace.c:487             performance: 0.036246403 s: git command: /usr/local/libexec/git-core/git reflog expire --all
> >   Enumerating objects: 188205, done.
> >   Counting objects: 100% (188205/188205), done.
> >   Delta compression using up to 4 threads
> >   Compressing objects: 100% (66520/66520), done.
> >   Writing objects: 100% (188205/188205), done.
> >   Total 188205 (delta 119967), reused 188205 (delta 119967), pack-reused 0
> >   10:35:50.081709 trace.c:487             performance: 2.402625839 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack
> >   10:35:50.100131 read-cache.c:2315       performance: 0.000009979 s:  read cache ./index
> >   10:37:04.973541 trace.c:487             performance: 74.885793630 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --exclude-promisor-objects --unpack-unreachable=2.weeks.ago
> >   Removing duplicate objects: 100% (256/256), done.
> >   10:37:07.482791 trace.c:487             performance: 79.804973525 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
> >   10:37:07.549333 trace.c:487             performance: 0.008025426 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago --exclude-promisor-objects
> >   10:37:07.552499 trace.c:487             performance: 0.000362981 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
> >   10:37:07.554521 trace.c:487             performance: 0.000273834 s: git command: /usr/local/libexec/git-core/git rerere gc
> >   10:37:10.168233 trace.c:487             performance: 82.533331484 s: git command: git -C git-partial.git/ gc
> >   elapsed: 1:22.54  max RSS: 1891832k
> >
> > Notice the ~9s vs. 82s runtime and ~350M vs. 1.9G memory consumption
> > increase.  What's going on here?
> >
> > Also note that that second 'git pack-objects' invocation doesn't show
> > any progress for ~75s.
> >
> > FWIW, doing the same in a 'tree:0' partial clone is fast.
> 
> I'm not expert on the area - by "area": the entire git code base :).
> However, I was intrigued by this performance numbers and decided to give
> it a try on the investigation, mostly for learning.

That's the spirit!

> While I'm not sure
> about the solution of the problem, I decided to share it here with the
> hope that at least I'll be saving someone else time.
> 
> When I was digging into the code and adding trace2_region_*() calls, I
> notice most of the time spent on the `git gc` (for the reported
> situation) was in:
> 
>        # In builtin/pack-objects.c
>        static void get_object_list(int ac, const char **av)
>        {
>                ...
>                if (unpack_unreachable)
>                        loosen_unused_packed_objects();
>                ...
>        }
> 
> The loosen_unused_packed_objects() will unpack unreachable objects as
> loose objects, and given that the partial cloned .pack file is
> incomplete, this result in writing a lot of loose objects in $GIT_DIR
> increasing the execution time and memory consumption. This can be seen
> by watching the $GIT_DIR/objects/ during the `git gc` execution on the
> partial cloned repo.

Indeed, that 'blob:none' partial clone grew in size temporarily to
over 1.3GB during repacking, but luckily all those unnecessarily
loosened objects were removed at the end.  I first noticed this issue
while attempting to repack a considerably larger partial-cloned
repository, which I aborted because it ate up all the memory...  I
suppose that even if it didn't use that much memory, it would
eventually run out of available disk space for all those loose objects
anyway...


> I'm not entirely sure about this (not this late in the day), but it seems to
> me that we should simply skip the "missing" (promisor) files when
> operating on a partial clone.
> 
> Perhaps something like:
> 
> --- >8 ---
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 525c2d8552..fedf58323d 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
>  {
>         if (!unpack_unreachable_expiration)
>                 return 0;
> +       if (exclude_promisor_objects && is_promisor_object(oid))
> +               return 1;
>         if (mtime > unpack_unreachable_expiration)
>                 return 0;
>         if (oid_array_lookup(&recent_objects, oid) >= 0)
> --- >8 ---
> 
> I'll try to prepare a patch for this change with proper testing, if this
> turns out to be proper way to handle partial clone repository.
> 
> A quick benchmark did show some promising result:
> 
>     # built from: 2e36527f23 (The sixth batch, 2021-04-02)
>     Benchmark #1: ./bin-wrappers/git -C git.git gc
>           Time (mean ± σ):     135.669 s ±  0.665 s    [User: 42.789 s, System: 91.332 s]
>           Range (min … max):   134.905 s … 136.115 s    3 runs
> 
>     # built from: 2e36527f23 + minor patch (from above)
>     Benchmark #2: ./bin-wrappers/git -C git.git gc
>           Time (mean ± σ):     12.586 s ±  0.031 s    [User: 11.462 s, System: 1.365 s]
>           Range (min … max):   12.553 s … 12.616 s    3 runs
> 
>     Summary:
>           'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'

I can confirm that you change speeds up repacking considerably and it
completely eliminates that temporary repo size explision due to
unpacked objects, but, alas, it doesn't seem to reduce the memory
usage.

Thanks,
Gábor


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-07 21:17   ` Jeff King
  2021-04-08  0:02     ` Jonathan Tan
@ 2021-04-12  7:09     ` Rafael Silva
  2021-04-12 21:36     ` SZEDER Gábor
  2 siblings, 0 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-12  7:09 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, SZEDER Gábor, Git Mailing List


Jeff King <peff@peff.net> writes:

> On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:
>
>> I'm not entirely sure about this (not this late in the day), but it seems to
>> me that we should simply skip the "missing" (promisor) files when
>> operating on a partial clone.
>> 
>> Perhaps something like:
>> 
>> --- >8 ---
>> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
>> index 525c2d8552..fedf58323d 100644
>> --- a/builtin/pack-objects.c
>> +++ b/builtin/pack-objects.c
>> @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
>>  {
>>         if (!unpack_unreachable_expiration)
>>                 return 0;
>> +       if (exclude_promisor_objects && is_promisor_object(oid))
>> +               return 1;
>>         if (mtime > unpack_unreachable_expiration)
>>                 return 0;
>>         if (oid_array_lookup(&recent_objects, oid) >= 0)
>> --- >8 ---
>
> you are avoiding writing out the file. But we should realize much
> earlier that it is not something we need to even consider loosening.
>
> In the loop in loosen_unused_packed_objects(), we skip packs that are
> marked as "keep", so we'd skip the new promisor pack entirely. But we'd
> still see all these objects in the _old_ promisor pack. However, for
> each object there, we call has_sha1_pack_kept_or_nonlocal(), so that
> would likewise realize that each object is already being kept in the
> other pack.
>

Agreed. Realizing sooner that we shouldn't even consider loosening the
objects from the packfile it's better solution.

> Something like this seems to work, but I only lightly tested it, and it
> could probably use some refactoring to make it less horrible:
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index fdee8e4578..457525953a 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -574,6 +574,23 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		repack_promisor_objects(&po_args, &names);
>  
>  		if (existing_packs.nr && delete_redundant) {
> +			/*
> +			 * tell pack-objects about our new promisor pack, which
> +			 * we will also be keeping
> +			 */
> +			for_each_string_list_item(item, &names) {
> +				/*
> +				 * yuck, we seem to only have the name with the
> +				 * packdir prefixed
> +				 */
> +				const char *prefix;
> +				if (!skip_prefix(packtmp, packdir, &prefix) ||
> +				    *prefix++ != '/')
> +					BUG("confused by packtmp");
> +				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
> +					     prefix, item->string);
> +			}
> +
>  			if (unpack_unreachable) {
>  				strvec_pushf(&cmd.args,
>  					     "--unpack-unreachable=%s",
>
> Do you want to try to work with that?
>

Yes, I'll try to work with that, together with refactoring that you
mentioned in the code and the other replies.

Thanks for the suggestion.

>> A quick benchmark did show some promising result:
>> 
>>     # built from: 2e36527f23 (The sixth batch, 2021-04-02)
>>     Benchmark #1: ./bin-wrappers/git -C git.git gc
>>           Time (mean ± σ):     135.669 s ±  0.665 s    [User: 42.789 s, System: 91.332 s]
>>           Range (min … max):   134.905 s … 136.115 s    3 runs
>> 
>>     # built from: 2e36527f23 + minor patch (from above)
>>     Benchmark #2: ./bin-wrappers/git -C git.git gc
>>           Time (mean ± σ):     12.586 s ±  0.031 s    [User: 11.462 s, System: 1.365 s]
>>           Range (min … max):   12.553 s … 12.616 s    3 runs
>> 
>>     Summary:
>>           'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'
>
> It's still quite a bit slower than a non-partial clone because the
> traversal with --exclude-promisor-objects is slow. I think that's
> because it has to open up all of the objects in the promisor pack to see
> what they refer to. I don't know if we can do better (and it's largely
> an orthogonal problem to what you're solving here, so it probably makes
> sense to just punt on it for now).
>
> -Peff

Make sense.

-- 
Thanks
Rafael

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-11 10:59   ` SZEDER Gábor
@ 2021-04-12  7:53     ` Rafael Silva
  0 siblings, 0 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-12  7:53 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Git Mailing List


SZEDER Gábor <szeder.dev@gmail.com> writes:

> On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:
>> > here are trace timings of running 'git gc' in a "normal" and in a
>> > 'blob:none' partial clone:
>> >
>> >   $ git clone --bare https://github.com/git/git git-full.git
>> >   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-full.git/ gc
> [...]
>> >   elapsed: 0:09.51  max RSS: 358964k
>> >
>> >   $ git clone --bare --filter=blob:none https://github.com/git/git git-partial.git
>> >   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-partial.git/ gc
>> >   10:35:47.637735 trace.c:487             performance: 0.000872539 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
>> >   10:35:47.675498 trace.c:487             performance: 0.036246403 s: git command: /usr/local/libexec/git-core/git reflog expire --all
>> >   Enumerating objects: 188205, done.
>> >   Counting objects: 100% (188205/188205), done.
>> >   Delta compression using up to 4 threads
>> >   Compressing objects: 100% (66520/66520), done.
>> >   Writing objects: 100% (188205/188205), done.
>> >   Total 188205 (delta 119967), reused 188205 (delta 119967), pack-reused 0
>> >   10:35:50.081709 trace.c:487             performance: 2.402625839 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack
>> >   10:35:50.100131 read-cache.c:2315       performance: 0.000009979 s:  read cache ./index
>> >   10:37:04.973541 trace.c:487             performance: 74.885793630 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --exclude-promisor-objects --unpack-unreachable=2.weeks.ago
>> >   Removing duplicate objects: 100% (256/256), done.
>> >   10:37:07.482791 trace.c:487             performance: 79.804973525 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
>> >   10:37:07.549333 trace.c:487             performance: 0.008025426 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago --exclude-promisor-objects
>> >   10:37:07.552499 trace.c:487             performance: 0.000362981 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
>> >   10:37:07.554521 trace.c:487             performance: 0.000273834 s: git command: /usr/local/libexec/git-core/git rerere gc
>> >   10:37:10.168233 trace.c:487             performance: 82.533331484 s: git command: git -C git-partial.git/ gc
>> >   elapsed: 1:22.54  max RSS: 1891832k
>> >
>> > Notice the ~9s vs. 82s runtime and ~350M vs. 1.9G memory consumption
>> > increase.  What's going on here?
>> >
>> > Also note that that second 'git pack-objects' invocation doesn't show
>> > any progress for ~75s.
>> >
>> > FWIW, doing the same in a 'tree:0' partial clone is fast.
>> 
>> I'm not expert on the area - by "area": the entire git code base :).
>> However, I was intrigued by this performance numbers and decided to give
>> it a try on the investigation, mostly for learning.
>
> That's the spirit!
>

:)

>> While I'm not sure
>> about the solution of the problem, I decided to share it here with the
>> hope that at least I'll be saving someone else time.
>> 
>> When I was digging into the code and adding trace2_region_*() calls, I
>> notice most of the time spent on the `git gc` (for the reported
>> situation) was in:
>> 
>>        # In builtin/pack-objects.c
>>        static void get_object_list(int ac, const char **av)
>>        {
>>                ...
>>                if (unpack_unreachable)
>>                        loosen_unused_packed_objects();
>>                ...
>>        }
>> 
>> The loosen_unused_packed_objects() will unpack unreachable objects as
>> loose objects, and given that the partial cloned .pack file is
>> incomplete, this result in writing a lot of loose objects in $GIT_DIR
>> increasing the execution time and memory consumption. This can be seen
>> by watching the $GIT_DIR/objects/ during the `git gc` execution on the
>> partial cloned repo.
>
> Indeed, that 'blob:none' partial clone grew in size temporarily to
> over 1.3GB during repacking, but luckily all those unnecessarily
> loosened objects were removed at the end.  I first noticed this issue
> while attempting to repack a considerably larger partial-cloned
> repository, which I aborted because it ate up all the memory...  I
> suppose that even if it didn't use that much memory, it would
> eventually run out of available disk space for all those loose objects
> anyway...
>
>
>> I'm not entirely sure about this (not this late in the day), but it seems to
>> me that we should simply skip the "missing" (promisor) files when
>> operating on a partial clone.
>> 
>> Perhaps something like:
>> 
>> --- >8 ---
>> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
>> index 525c2d8552..fedf58323d 100644
>> --- a/builtin/pack-objects.c
>> +++ b/builtin/pack-objects.c
>> @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
>>  {
>>         if (!unpack_unreachable_expiration)
>>                 return 0;
>> +       if (exclude_promisor_objects && is_promisor_object(oid))
>> +               return 1;
>>         if (mtime > unpack_unreachable_expiration)
>>                 return 0;
>>         if (oid_array_lookup(&recent_objects, oid) >= 0)
>> --- >8 ---
>> 
>> I'll try to prepare a patch for this change with proper testing, if this
>> turns out to be proper way to handle partial clone repository.
>> 
>> A quick benchmark did show some promising result:
>> 
>>     # built from: 2e36527f23 (The sixth batch, 2021-04-02)
>>     Benchmark #1: ./bin-wrappers/git -C git.git gc
>>           Time (mean ± σ):     135.669 s ±  0.665 s    [User: 42.789 s, System: 91.332 s]
>>           Range (min … max):   134.905 s … 136.115 s    3 runs
>> 
>>     # built from: 2e36527f23 + minor patch (from above)
>>     Benchmark #2: ./bin-wrappers/git -C git.git gc
>>           Time (mean ± σ):     12.586 s ±  0.031 s    [User: 11.462 s, System: 1.365 s]
>>           Range (min … max):   12.553 s … 12.616 s    3 runs
>> 
>>     Summary:
>>           'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'
>
> I can confirm that you change speeds up repacking considerably and it
> completely eliminates that temporary repo size explision due to
> unpacked objects, but, alas, it doesn't seem to reduce the memory
> usage.
>
> Thanks,
> Gábor

Thanks for confirming it. The memory usage, indeed, is almost the same
after the changes.  Nevertheless, Peff suggested a better approach to
addressing this issue, by realizing earlier that we should even
consider loosening the objects from the pack.

I'll try to work with that approach and see how this apply to memory
usage.

-- 
Thanks
Rafael

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-07 21:17   ` Jeff King
  2021-04-08  0:02     ` Jonathan Tan
  2021-04-12  7:09     ` Rafael Silva
@ 2021-04-12 21:36     ` SZEDER Gábor
  2021-04-12 21:49       ` Bryan Turner
  2021-04-12 23:47       ` Jeff King
  2 siblings, 2 replies; 46+ messages in thread
From: SZEDER Gábor @ 2021-04-12 21:36 UTC (permalink / raw)
  To: Jeff King; +Cc: Rafael Silva, Jonathan Tan, Git Mailing List

On Wed, Apr 07, 2021 at 05:17:50PM -0400, Jeff King wrote:
> On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:
> 
> > When I was digging into the code and adding trace2_region_*() calls, I
> > notice most of the time spent on the `git gc` (for the reported
> > situation) was in:
> > 
> >        # In builtin/pack-objects.c
> >        static void get_object_list(int ac, const char **av)
> >        {
> >                ...
> >                if (unpack_unreachable)
> >                        loosen_unused_packed_objects();
> >                ...
> >        }
> 
> Yeah, good find.
> 
> This is my first time looking at the repacking strategy for partial
> clones. It looks like we run an initial pack-objects to cover all the
> promisor objects, and then do the "real" repack for everything else,
> with "--exclude-promisor-objects".
> 
> The purpose of loosen_unused_packed_objects() is to catch any objects
> that will be lost when our caller deletes all of the packs. But in this
> case, those promisor objects are in a pack which won't be deleted, so
> they should not be included.
> 
> > Another interesting thing is, the loosen_unused_packed_objects()
> > function is being called twice because the function loads all packs
> > files, via get_all_packs(), which will return the .temp-*pack file that
> > is created by the `git pack-objects` child process from `git gc`:
> > 
> >     git pack-objects ... --delta-base-offset objects/pack/.tmp-82853-pack ...
> 
> Yes, this is the "promisor" pack created by git-repack. It seems like
> git-repack should tell pack-objects about the new pack with --keep-pack,
> so that we know it is not going to be deleted.
> 
> That would also solve the rest of the problem, I _think_. In your
> suggestion here:
> 
> > I'm not entirely sure about this (not this late in the day), but it seems to
> > me that we should simply skip the "missing" (promisor) files when
> > operating on a partial clone.
> > 
> > Perhaps something like:
> > 
> > --- >8 ---
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index 525c2d8552..fedf58323d 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
> >  {
> >         if (!unpack_unreachable_expiration)
> >                 return 0;
> > +       if (exclude_promisor_objects && is_promisor_object(oid))
> > +               return 1;
> >         if (mtime > unpack_unreachable_expiration)
> >                 return 0;
> >         if (oid_array_lookup(&recent_objects, oid) >= 0)
> > --- >8 ---
> 
> you are avoiding writing out the file. But we should realize much
> earlier that it is not something we need to even consider loosening.
> 
> In the loop in loosen_unused_packed_objects(), we skip packs that are
> marked as "keep", so we'd skip the new promisor pack entirely. But we'd
> still see all these objects in the _old_ promisor pack. However, for
> each object there, we call has_sha1_pack_kept_or_nonlocal(), so that
> would likewise realize that each object is already being kept in the
> other pack.
> 
> Something like this seems to work, but I only lightly tested it, and it
> could probably use some refactoring to make it less horrible:
> 
> diff --git a/builtin/repack.c b/builtin/repack.c
> index fdee8e4578..457525953a 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -574,6 +574,23 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		repack_promisor_objects(&po_args, &names);
>  
>  		if (existing_packs.nr && delete_redundant) {
> +			/*
> +			 * tell pack-objects about our new promisor pack, which
> +			 * we will also be keeping
> +			 */
> +			for_each_string_list_item(item, &names) {
> +				/*
> +				 * yuck, we seem to only have the name with the
> +				 * packdir prefixed
> +				 */
> +				const char *prefix;
> +				if (!skip_prefix(packtmp, packdir, &prefix) ||
> +				    *prefix++ != '/')
> +					BUG("confused by packtmp");
> +				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
> +					     prefix, item->string);
> +			}
> +
>  			if (unpack_unreachable) {
>  				strvec_pushf(&cmd.args,
>  					     "--unpack-unreachable=%s",
> 

All what you wrote above makes sense to me, but I've never looked at
how partial clones work, so it doesn't mean anything...  In any case
your patch brings great speedups, but, unfortunately, the memory usage
remains as high as it was:

  $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk /home/szeder/src/git/bin-wrappers/git -C git-partial.git/ gc
  Enumerating objects: 188450, done.
  Counting objects: 100% (188450/188450), done.
  Delta compression using up to 4 threads
  Compressing objects: 100% (66623/66623), done.
  Writing objects: 100% (188450/188450), done.
  Total 188450 (delta 120109), reused 188450 (delta 120109), pack-reused 0
  elapsed: 0:15.18  max RSS: 1888332k

And git.git is not all that large, I wonder how much memory would be
necessary to 'gc' a 'blob:none' clone of e.g. chromium?! :)

BTW, this high memory usage in a partial clone is not specific to
'repack', 'fsck' suffers just as much:

  $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk git -C git-full.git/ fsck
  Checking object directories: 100% (256/256), done.
  warning in tag d6602ec5194c87b0fc87103ca4d67251c76f233a: missingTaggerEntry: invalid format - expected 'tagger' line
  Checking objects: 100% (305662/305662), done.
  Checking connectivity: 305662, done.
  Verifying commits in commit graph: 100% (65031/65031), done.
  elapsed: 0:33.99  max RSS: 281936k
  $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk git -C git-partial.git/ fsck
  Checking object directories: 100% (256/256), done.
  warning in tag d6602ec5194c87b0fc87103ca4d67251c76f233a: missingTaggerEntry: invalid format - expected 'tagger' line
  Checking objects: 100% (188450/188450), done.
  Verifying commits in commit graph: 100% (65031/65031), done.
  elapsed: 0:29.96  max RSS: 1877340k


And a somewhat related issue: when the server doesn't support filters,
then 'git clone --filter=...' prints a warning and proceeds to clone
the full repo.  Reading ba95710a3b ({fetch,upload}-pack: support
filter in protocol v2, 2018-05-03) this seems to be intentional and I
tend to think that it makes sense (though I managed to overlook that
warning twice today...  I surely wouldn't have overlooked a hard
error, but that would perhaps be too harsh in this case, dunno).
However, the resulting full clone is still marked as partial:

  $ git clone --bare --filter=blob:none https://git.kernel.org/pub/scm/git/git.git git-not-really-partial.git
  Cloning into bare repository 'git-not-really-partial.git'...
  warning: filtering not recognized by server, ignoring
  remote: Enumerating objects: 591, done.
  remote: Counting objects: 100% (591/591), done.
  remote: Compressing objects: 100% (293/293), done.
  remote: Total 305662 (delta 372), reused 393 (delta 298), pack-reused 305071
  Receiving objects: 100% (305662/305662), 96.83 MiB | 2.10 MiB/s, done.
  Resolving deltas: 100% (228123/228123), done.
  $ ls -l git-not-really-partial.git/objects/pack/
  total 107568
  -r--r--r-- 1 szeder szeder   8559608 Apr 12 21:13 pack-53f3ee0dfeaa8cea65c78473cd5904bf5ddfaa20.idx
  -r--r--r-- 1 szeder szeder 101535430 Apr 12 21:13 pack-53f3ee0dfeaa8cea65c78473cd5904bf5ddfaa20.pack
  -rw------- 1 szeder szeder     49012 Apr 12 21:13 pack-53f3ee0dfeaa8cea65c78473cd5904bf5ddfaa20.promisor
  $ cat git-not-really-partial.git/config 
  [core]
  	repositoryformatversion = 1
  	filemode = true
  	bare = true
  [remote "origin"]
  	url = https://git.kernel.org/pub/scm/git/git.git
  	promisor = true
  	partialclonefilter = blob:none

I wonder whether this is intentional, or that it is really the desired
behavior, considering that 'gc/repack/fsck' still treat it as a
partial clone, and, consequently, are affected by this slowness and
much higher memory usage, and since the repo now contains a lot more
objects than expected (all the blobs as well), they are much slower:

  $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk git -C git-not-really-partial.git/ gc
  Enumerating objects: 305662, done.
  Counting objects: 100% (305662/305662), done.
  Delta compression using up to 4 threads
  Compressing objects: 100% (75200/75200), done.
  Writing objects: 100% (305662/305662), done.
  Total 305662 (delta 228123), reused 305662 (delta 228123), pack-reused 0
  Removing duplicate objects: 100% (256/256), done.
  elapsed: 4:28.96  max RSS: 1985100k
  # with Peff's patch above:
  $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk /home/szeder/src/git/bin-wrappers/git -C git-not-really-partial.git/ gc
  Enumerating objects: 305662, done.
  Counting objects: 100% (305662/305662), done.
  Delta compression using up to 4 threads
  Compressing objects: 100% (75200/75200), done.
  Writing objects: 100% (305662/305662), done.
  Total 305662 (delta 228123), reused 305662 (delta 228123), pack-reused 0
  elapsed: 1:21.83  max RSS: 1959740k


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-12 21:36     ` SZEDER Gábor
@ 2021-04-12 21:49       ` Bryan Turner
  2021-04-12 23:51         ` Jeff King
  2021-04-12 23:47       ` Jeff King
  1 sibling, 1 reply; 46+ messages in thread
From: Bryan Turner @ 2021-04-12 21:49 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Jeff King, Rafael Silva, Jonathan Tan, Git Mailing List

On Mon, Apr 12, 2021 at 2:37 PM SZEDER Gábor <szeder.dev@gmail.com> wrote:
>
> And a somewhat related issue: when the server doesn't support filters,
> then 'git clone --filter=...' prints a warning and proceeds to clone
> the full repo.  Reading ba95710a3b ({fetch,upload}-pack: support
> filter in protocol v2, 2018-05-03) this seems to be intentional and I
> tend to think that it makes sense (though I managed to overlook that
> warning twice today...  I surely wouldn't have overlooked a hard
> error, but that would perhaps be too harsh in this case, dunno).
> However, the resulting full clone is still marked as partial:
>
>   $ git clone --bare --filter=blob:none https://git.kernel.org/pub/scm/git/git.git git-not-really-partial.git
>   Cloning into bare repository 'git-not-really-partial.git'...
>   warning: filtering not recognized by server, ignoring
>   remote: Enumerating objects: 591, done.
>   remote: Counting objects: 100% (591/591), done.
>   remote: Compressing objects: 100% (293/293), done.
>   remote: Total 305662 (delta 372), reused 393 (delta 298), pack-reused 305071
>   Receiving objects: 100% (305662/305662), 96.83 MiB | 2.10 MiB/s, done.
>   Resolving deltas: 100% (228123/228123), done.
>   $ ls -l git-not-really-partial.git/objects/pack/
>   total 107568
>   -r--r--r-- 1 szeder szeder   8559608 Apr 12 21:13 pack-53f3ee0dfeaa8cea65c78473cd5904bf5ddfaa20.idx
>   -r--r--r-- 1 szeder szeder 101535430 Apr 12 21:13 pack-53f3ee0dfeaa8cea65c78473cd5904bf5ddfaa20.pack
>   -rw------- 1 szeder szeder     49012 Apr 12 21:13 pack-53f3ee0dfeaa8cea65c78473cd5904bf5ddfaa20.promisor
>   $ cat git-not-really-partial.git/config
>   [core]
>         repositoryformatversion = 1
>         filemode = true
>         bare = true
>   [remote "origin"]
>         url = https://git.kernel.org/pub/scm/git/git.git
>         promisor = true
>         partialclonefilter = blob:none

I ran into this same surprising behavior recently, too. I was adding
some automated testing to Bitbucket for partial clones and initially
tried to use whether the repository was configured with a partial
clone filter as one of my checks, only to find that even when filters
weren't supported it was still set. The only way I could find to
detect that a partial clone that was requested didn't actually happen
was to parse the git clone output and look for the warning.

>
> I wonder whether this is intentional, or that it is really the desired
> behavior, considering that 'gc/repack/fsck' still treat it as a
> partial clone, and, consequently, are affected by this slowness and
> much higher memory usage, and since the repo now contains a lot more
> objects than expected (all the blobs as well), they are much slower:
>
>   $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk git -C git-not-really-partial.git/ gc
>   Enumerating objects: 305662, done.
>   Counting objects: 100% (305662/305662), done.
>   Delta compression using up to 4 threads
>   Compressing objects: 100% (75200/75200), done.
>   Writing objects: 100% (305662/305662), done.
>   Total 305662 (delta 228123), reused 305662 (delta 228123), pack-reused 0
>   Removing duplicate objects: 100% (256/256), done.
>   elapsed: 4:28.96  max RSS: 1985100k
>   # with Peff's patch above:
>   $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk /home/szeder/src/git/bin-wrappers/git -C git-not-really-partial.git/ gc
>   Enumerating objects: 305662, done.
>   Counting objects: 100% (305662/305662), done.
>   Delta compression using up to 4 threads
>   Compressing objects: 100% (75200/75200), done.
>   Writing objects: 100% (305662/305662), done.
>   Total 305662 (delta 228123), reused 305662 (delta 228123), pack-reused 0
>   elapsed: 1:21.83  max RSS: 1959740k
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-12 21:36     ` SZEDER Gábor
  2021-04-12 21:49       ` Bryan Turner
@ 2021-04-12 23:47       ` Jeff King
  2021-04-13  7:12         ` [PATCH 0/3] low-hanging performance fruit with promisor packs Jeff King
  2021-04-13 18:05         ` rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
  1 sibling, 2 replies; 46+ messages in thread
From: Jeff King @ 2021-04-12 23:47 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Rafael Silva, Jonathan Tan, Git Mailing List

On Mon, Apr 12, 2021 at 11:36:53PM +0200, SZEDER Gábor wrote:

> All what you wrote above makes sense to me, but I've never looked at
> how partial clones work, so it doesn't mean anything...  In any case
> your patch brings great speedups, but, unfortunately, the memory usage
> remains as high as it was:
> 
>   $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk /home/szeder/src/git/bin-wrappers/git -C git-partial.git/ gc
>   Enumerating objects: 188450, done.
>   Counting objects: 100% (188450/188450), done.
>   Delta compression using up to 4 threads
>   Compressing objects: 100% (66623/66623), done.
>   Writing objects: 100% (188450/188450), done.
>   Total 188450 (delta 120109), reused 188450 (delta 120109), pack-reused 0
>   elapsed: 0:15.18  max RSS: 1888332k
> 
> And git.git is not all that large, I wonder how much memory would be
> necessary to 'gc' a 'blob:none' clone of e.g. chromium?! :)
> 
> BTW, this high memory usage in a partial clone is not specific to
> 'repack', 'fsck' suffers just as much:

I think the issue is in the exclude-promisor-object code paths of the
traversal. Try this:

  [two clones of git.git, one full and one partial]
  $ git clone --no-local --bare /path/to/git full.git
  $ git clone --no-local --bare --filter=blob:none /path/to/git partial.git

  [full clone is quick to traverse all objects]
  $ time git -C full.git rev-list --count --all
  63215
  real	0m0.369s
  user	0m0.365s
  sys	0m0.004s

  [partial is, too; it's the same amount of work because we're just
   looking at the commits here]
  $ time git -C partial.git rev-list --count --all
  63215
  real	0m0.373s
  user	0m0.364s
  sys	0m0.009s

  [but now ask it to exclude promisor objects, and it's much slower;
  this is because is_promisor_object() opens up each tree in the pack in
  order to see which "promised" objects it mentions]
  $ time git -C partial.git rev-list --exclude-promisor-objects --count --all
  0
  real	0m11.723s
  user	0m11.354s
  sys	0m0.369s

And I think that is the source for the memory use, too. Without that
option, we peak at 11MB heap. With it, 1.6GB. Oops. Looks like there
might be some "leaks" when we parse tree objects and then leave the
buffers connected to the structs (technically not a leak because we
still have the pointers, but obviously having every tree in memory at
once is not good).

The patch below drops the peak heap to 165MB. Still quite a bit more,
but I think it's a combination of delta-base cache (96MB) plus extra
structs for all the non-commit objects whose flags we marked.

It does seem like there's probably a good space/time tradeoff to be made
here (e.g., caching the list of "promisor" object ids for the pack
instead of inflating and reading all of the trees on the fly).

diff --git a/packfile.c b/packfile.c
index 8668345d93..b79cbc8cd4 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2247,6 +2247,7 @@ static int add_promisor_object(const struct object_id *oid,
 			return 0;
 		while (tree_entry_gently(&desc, &entry))
 			oidset_insert(set, &entry.oid);
+		free_tree_buffer(tree);
 	} else if (obj->type == OBJ_COMMIT) {
 		struct commit *commit = (struct commit *) obj;
 		struct commit_list *parents = commit->parents;
diff --git a/revision.c b/revision.c
index 553c0faa9b..fac2577748 100644
--- a/revision.c
+++ b/revision.c
@@ -3271,8 +3271,15 @@ static int mark_uninteresting(const struct object_id *oid,
 			      void *cb)
 {
 	struct rev_info *revs = cb;
+	/*
+	 * yikes, do we really need to parse here? maybe
+	 * lookup_unknown_object() would be sufficient, or
+	 * even oid_object_info() followed by the correct type
+	 */
 	struct object *o = parse_object(revs->repo, oid);
 	o->flags |= UNINTERESTING | SEEN;
+	if (o->type == OBJ_TREE)
+		free_tree_buffer((struct tree *)o);
 	return 0;
 }
 

-Peff

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-12 21:49       ` Bryan Turner
@ 2021-04-12 23:51         ` Jeff King
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2021-04-12 23:51 UTC (permalink / raw)
  To: Bryan Turner
  Cc: SZEDER Gábor, Rafael Silva, Jonathan Tan, Git Mailing List

On Mon, Apr 12, 2021 at 02:49:00PM -0700, Bryan Turner wrote:

> I ran into this same surprising behavior recently, too. I was adding
> some automated testing to Bitbucket for partial clones and initially
> tried to use whether the repository was configured with a partial
> clone filter as one of my checks, only to find that even when filters
> weren't supported it was still set. The only way I could find to
> detect that a partial clone that was requested didn't actually happen
> was to parse the git clone output and look for the warning.

I think the state of "we have all the objects, but things are marked as
partial" is some place you could actually get into naturally: you start
with a partial clone, and then later fetch the objects you need, and you
just happen to have all the objects). Or you could even start there if
the filter happens not to exclude any objects. So I don't think that
state is invalid in any way.

But I do agree that if the client _knows_ that the filter was not used
(because the other side did not advertise filters and so we did not even
send it), then it is silly to create the client-side config marking us
as partial. It is misleading at best, and makes things slower at worst.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 0/3] low-hanging performance fruit with promisor packs
  2021-04-12 23:47       ` Jeff King
@ 2021-04-13  7:12         ` Jeff King
  2021-04-13  7:15           ` [PATCH 1/3] is_promisor_object(): free tree buffer after parsing Jeff King
                             ` (5 more replies)
  2021-04-13 18:05         ` rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
  1 sibling, 6 replies; 46+ messages in thread
From: Jeff King @ 2021-04-13  7:12 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Rafael Silva, Jonathan Tan, Git Mailing List

On Mon, Apr 12, 2021 at 07:47:41PM -0400, Jeff King wrote:

> The patch below drops the peak heap to 165MB. Still quite a bit more,
> but I think it's a combination of delta-base cache (96MB) plus extra
> structs for all the non-commit objects whose flags we marked.

I think we can do even better than that, after looking into the "do we
really need to parse the objects?" comment I left (spoiler: the answer
is no, we do not need to, at least for that caller).

Here are some cleaned-up patches that I think improve the situation
quite a bit. This is just the low-hanging fruit from this part of the
discussion; I'm sure there's more to do to make using partial clones
pleasant. In particular:

  - this does nothing for the "oops, we turned all of the promisor
    objects loose and then deleted them" problem. Hopefully Rafael
    will produce a nice patch for that

  - In is_promisor_object(), we still call parse_object(), because it
    really does look at the contents. But doing so for blobs is wasteful
    (it's a lot of bytes we push through sha1, and we don't even look at
    them). It might be worth using oid_object_info() to avoid this. This
    introduces a little overhead, but I think would be a net win (it
    would be really nice if we could amortize the object lookup work;
    i.e., if there was a way to call oid_object_info_extended() and say
    "open the object and look at the type; only return the contents if
    it's a non-blob").

  - I still think it's probably worth having a mode where we store the
    set of pointed-to objects we don't have, rather than parsing on the
    fly. This could easily be made optional (it's a good thing if you
    _have_ a lot of objects that point to a small or moderate number of
    missing objects, like a blob:limit or blob:none filter; it's a bad
    thing if you have very few objects but they point to a very large
    number).

I didn't explore any of those here, and I don't plan to look into them
anytime soon. I'm just documenting my findings for later.

Anyway, here are the patches.

  [1/3]: is_promisor_object(): free tree buffer after parsing
  [2/3]: lookup_unknown_object(): take a repository argument
  [3/3]: revision: avoid parsing with --exclude-promisor-objects

 builtin/fsck.c                   |  2 +-
 builtin/pack-objects.c           |  2 +-
 http-push.c                      |  2 +-
 object.c                         |  7 +++----
 object.h                         |  2 +-
 packfile.c                       |  1 +
 refs.c                           |  2 +-
 revision.c                       |  2 +-
 t/helper/test-example-decorate.c |  6 +++---
 t/perf/p5600-partial-clone.sh    | 12 ++++++++++++
 upload-pack.c                    |  2 +-
 walker.c                         |  2 +-
 12 files changed, 27 insertions(+), 15 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/3] is_promisor_object(): free tree buffer after parsing
  2021-04-13  7:12         ` [PATCH 0/3] low-hanging performance fruit with promisor packs Jeff King
@ 2021-04-13  7:15           ` Jeff King
  2021-04-13 20:17             ` Junio C Hamano
  2021-04-13  7:16           ` [PATCH 2/3] lookup_unknown_object(): take a repository argument Jeff King
                             ` (4 subsequent siblings)
  5 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2021-04-13  7:15 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Rafael Silva, Jonathan Tan, Git Mailing List

To get the list of all promisor objects, we not only include all objects
in promisor packs, but also parse each of those objects to see which
objects they reference. After parsing a tree object, the tree->buffer
field will remain populated until we explicitly free it. So in a partial
clone of blob:none, for example, we are essentially reading every tree
in the repository (since they're all in the initial promisor pack), and
keeping all of their uncompressed contents in memory at once.

This patch frees the tree buffers after we've finished marking all of
their reachable objects. We shouldn't need to do this for any other
object type. While we are using some extra memory to store the structs,
no other object type stores the whole contents in its parsed form (we do
sometimes hold on to commit buffers, but less so these days due to
commit graphs, plus most commands which care about promisor objects turn
off the save_commit_buffer global).

Even for a moderate-sized repository like git.git, this patch drops the
peak heap (as measured by massif) for git-fsck from ~1.7GB to ~138MB.
Fsck is a good candidate for measuring here because it doesn't interact
with the promisor code except to call is_promisor_object(), so we can
isolate just this problem.

The added perf test shows only a tiny improvement on my machine for
git.git, since 1.7GB isn't enough to cause any real memory pressure:

  Test                                 HEAD^               HEAD
  --------------------------------------------------------------------------------
  5600.4: fsck                         21.26(20.90+0.35)   20.84(20.79+0.04) -2.0%

With linux.git the absolute change is a bit bigger, though still a small
percentage:

  Test                          HEAD^                 HEAD
  -----------------------------------------------------------------------------
  5600.4: fsck                  262.26(259.13+3.12)   254.92(254.62+0.29) -2.8%

I didn't have the patience to run it under massif with linux.git, but
it's probably on the order of about 14GB improvement, since that's the
sum of the sizes of all of the uncompressed trees (but still isn't
enough to create memory pressure on this particular machine, which has
64GB of RAM). Smaller machines would probably see a bigger effect on
runtime (and sadly our perf suite does not measure peak heap).

Signed-off-by: Jeff King <peff@peff.net>
---
 packfile.c                    | 1 +
 t/perf/p5600-partial-clone.sh | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/packfile.c b/packfile.c
index 8668345d93..b79cbc8cd4 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2247,6 +2247,7 @@ static int add_promisor_object(const struct object_id *oid,
 			return 0;
 		while (tree_entry_gently(&desc, &entry))
 			oidset_insert(set, &entry.oid);
+		free_tree_buffer(tree);
 	} else if (obj->type == OBJ_COMMIT) {
 		struct commit *commit = (struct commit *) obj;
 		struct commit_list *parents = commit->parents;
diff --git a/t/perf/p5600-partial-clone.sh b/t/perf/p5600-partial-clone.sh
index 3e04bd2ae1..754aaec3dc 100755
--- a/t/perf/p5600-partial-clone.sh
+++ b/t/perf/p5600-partial-clone.sh
@@ -23,4 +23,8 @@ test_perf 'checkout of result' '
 	git -C worktree checkout -f
 '
 
+test_perf 'fsck' '
+	git -C bare.git fsck
+'
+
 test_done
-- 
2.31.1.659.g9b9913af63


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/3] lookup_unknown_object(): take a repository argument
  2021-04-13  7:12         ` [PATCH 0/3] low-hanging performance fruit with promisor packs Jeff King
  2021-04-13  7:15           ` [PATCH 1/3] is_promisor_object(): free tree buffer after parsing Jeff King
@ 2021-04-13  7:16           ` Jeff King
  2021-04-13  7:17           ` [PATCH 3/3] revision: avoid parsing with --exclude-promisor-objects Jeff King
                             ` (3 subsequent siblings)
  5 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2021-04-13  7:16 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Rafael Silva, Jonathan Tan, Git Mailing List

All of the other lookup_foo() functions take a repository argument, but
lookup_unknown_object() was never converted, and it uses the_repository
internally. Let's fix that.

We could leave a wrapper that uses the_repository, but there aren't that
many calls, so we'll just convert them all. I looked briefly at each
site to see if we had a repository struct (besides the_repository) we
could pass, but none of them do (so this conversion to pass
the_repository is a pure noop in each case, though it does take us one
step closer to eventually getting rid of the_repository).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fsck.c                   | 2 +-
 builtin/pack-objects.c           | 2 +-
 http-push.c                      | 2 +-
 object.c                         | 7 +++----
 object.h                         | 2 +-
 refs.c                           | 2 +-
 t/helper/test-example-decorate.c | 6 +++---
 upload-pack.c                    | 2 +-
 walker.c                         | 2 +-
 9 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 70ff95837a..e6a80e5404 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -725,7 +725,7 @@ static int fsck_cache_tree(struct cache_tree *it)
 
 static void mark_object_for_connectivity(const struct object_id *oid)
 {
-	struct object *obj = lookup_unknown_object(oid);
+	struct object *obj = lookup_unknown_object(the_repository, oid);
 	obj->flags |= HAS_OBJ;
 }
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 525c2d8552..c1186f50a3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3386,7 +3386,7 @@ static void add_objects_in_unpacked_packs(void)
 
 		for (i = 0; i < p->num_objects; i++) {
 			nth_packed_object_id(&oid, p, i);
-			o = lookup_unknown_object(&oid);
+			o = lookup_unknown_object(the_repository, &oid);
 			if (!(o->flags & OBJECT_ADDED))
 				mark_in_pack_object(o, p, &in_pack);
 			o->flags |= OBJECT_ADDED;
diff --git a/http-push.c b/http-push.c
index b60d5fcc85..813123242e 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1436,7 +1436,7 @@ static void one_remote_ref(const char *refname)
 	 * may be required for updating server info later.
 	 */
 	if (repo->can_update_info_refs && !has_object_file(&ref->old_oid)) {
-		obj = lookup_unknown_object(&ref->old_oid);
+		obj = lookup_unknown_object(the_repository, &ref->old_oid);
 		fprintf(stderr,	"  fetch %s for %s\n",
 			oid_to_hex(&ref->old_oid), refname);
 		add_fetch_request(obj);
diff --git a/object.c b/object.c
index 78343781ae..14188453c5 100644
--- a/object.c
+++ b/object.c
@@ -177,12 +177,11 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet)
 	}
 }
 
-struct object *lookup_unknown_object(const struct object_id *oid)
+struct object *lookup_unknown_object(struct repository *r, const struct object_id *oid)
 {
-	struct object *obj = lookup_object(the_repository, oid);
+	struct object *obj = lookup_object(r, oid);
 	if (!obj)
-		obj = create_object(the_repository, oid,
-				    alloc_object_node(the_repository));
+		obj = create_object(r, oid, alloc_object_node(r));
 	return obj;
 }
 
diff --git a/object.h b/object.h
index 59daadce21..87a6da47c8 100644
--- a/object.h
+++ b/object.h
@@ -145,7 +145,7 @@ struct object *parse_object_or_die(const struct object_id *oid, const char *name
 struct object *parse_object_buffer(struct repository *r, const struct object_id *oid, enum object_type type, unsigned long size, void *buffer, int *eaten_p);
 
 /** Returns the object, with potentially excess memory allocated. **/
-struct object *lookup_unknown_object(const struct object_id *oid);
+struct object *lookup_unknown_object(struct repository *r, const struct object_id *oid);
 
 struct object_list *object_list_insert(struct object *item,
 				       struct object_list **list_p);
diff --git a/refs.c b/refs.c
index 261fd82beb..1616c7554a 100644
--- a/refs.c
+++ b/refs.c
@@ -337,7 +337,7 @@ static int filter_refs(const char *refname, const struct object_id *oid,
 
 enum peel_status peel_object(const struct object_id *name, struct object_id *oid)
 {
-	struct object *o = lookup_unknown_object(name);
+	struct object *o = lookup_unknown_object(the_repository, name);
 
 	if (o->type == OBJ_NONE) {
 		int type = oid_object_info(the_repository, name, NULL);
diff --git a/t/helper/test-example-decorate.c b/t/helper/test-example-decorate.c
index c8a1cde7d2..b9d1200eb9 100644
--- a/t/helper/test-example-decorate.c
+++ b/t/helper/test-example-decorate.c
@@ -26,8 +26,8 @@ int cmd__example_decorate(int argc, const char **argv)
 	 * Add 2 objects, one with a non-NULL decoration and one with a NULL
 	 * decoration.
 	 */
-	one = lookup_unknown_object(&one_oid);
-	two = lookup_unknown_object(&two_oid);
+	one = lookup_unknown_object(the_repository, &one_oid);
+	two = lookup_unknown_object(the_repository, &two_oid);
 	ret = add_decoration(&n, one, &decoration_a);
 	if (ret)
 		BUG("when adding a brand-new object, NULL should be returned");
@@ -56,7 +56,7 @@ int cmd__example_decorate(int argc, const char **argv)
 	ret = lookup_decoration(&n, two);
 	if (ret != &decoration_b)
 		BUG("lookup should return added declaration");
-	three = lookup_unknown_object(&three_oid);
+	three = lookup_unknown_object(the_repository, &three_oid);
 	ret = lookup_decoration(&n, three);
 	if (ret)
 		BUG("lookup for unknown object should return NULL");
diff --git a/upload-pack.c b/upload-pack.c
index e19583ae0f..5c1cd19612 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -1153,7 +1153,7 @@ static void receive_needs(struct upload_pack_data *data,
 static int mark_our_ref(const char *refname, const char *refname_full,
 			const struct object_id *oid)
 {
-	struct object *o = lookup_unknown_object(oid);
+	struct object *o = lookup_unknown_object(the_repository, oid);
 
 	if (ref_is_hidden(refname, refname_full)) {
 		o->flags |= HIDDEN_REF;
diff --git a/walker.c b/walker.c
index 4984bf8b3d..c5e2921979 100644
--- a/walker.c
+++ b/walker.c
@@ -298,7 +298,7 @@ int walker_fetch(struct walker *walker, int targets, char **target,
 			error("Could not interpret response from server '%s' as something to pull", target[i]);
 			goto done;
 		}
-		if (process(walker, lookup_unknown_object(&oids[i])))
+		if (process(walker, lookup_unknown_object(the_repository, &oids[i])))
 			goto done;
 	}
 
-- 
2.31.1.659.g9b9913af63


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/3] revision: avoid parsing with --exclude-promisor-objects
  2021-04-13  7:12         ` [PATCH 0/3] low-hanging performance fruit with promisor packs Jeff King
  2021-04-13  7:15           ` [PATCH 1/3] is_promisor_object(): free tree buffer after parsing Jeff King
  2021-04-13  7:16           ` [PATCH 2/3] lookup_unknown_object(): take a repository argument Jeff King
@ 2021-04-13  7:17           ` Jeff King
  2021-04-13 20:22             ` Junio C Hamano
  2021-04-13 18:10           ` [PATCH 0/3] low-hanging performance fruit with promisor packs SZEDER Gábor
                             ` (2 subsequent siblings)
  5 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2021-04-13  7:17 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Rafael Silva, Jonathan Tan, Git Mailing List

When --exclude-promisor-objects is given, before traversing any objects
we iterate over all of the objects in any promisor packs, marking them
as UNINTERESTING and SEEN. We turn the oid we get from iterating the
pack into an object with parse_object(), but this has two problems:

  - it's slow; we are zlib inflating (and reconstructing from deltas)
    every byte of every object in the packfile

  - it leaves the tree buffers attached to their structs, which means
    our heap usage will grow to store every uncompressed tree
    simultaneously. This can be gigabytes.

We can obviously fix the second by freeing the tree buffers after we've
parsed them. But we can observe that the function doesn't look at the
object contents at all! The only reason we call parse_object() is that
we need a "struct object" on which to set the flags. There are two
options here:

  - we can look up just the object type via oid_object_info(), and then
    call the appropriate lookup_foo() function

  - we can call lookup_unknown_object(), which gives us an OBJ_NONE
    struct (which will get auto-converted later by object_as_type() via
    calls to lookup_commit(), etc).

The first one is closer to the current code, but we do pay the price to
look up the type for each object. The latter should be more efficient in
CPU, though it wastes a little bit of memory (the "unknown" object
structs are a union of all object types, so some of the structs are
bigger than they need to be). It also runs the risk of triggering a
latent bug in code that calls lookup_object() directly but isn't ready
to handle OBJ_NONE (such code would already be buggy, but we use
lookup_unknown_object() infrequently enough that it might be hiding).

I went with the second option here. I don't think the risk is high (and
we'd want to find and fix any such bugs anyway), and it should be more
efficient overall.

The new tests in p5600 show off the improvement (this is on git.git):

  Test                                 HEAD^               HEAD
  -------------------------------------------------------------------------------
  5600.5: count commits                0.37(0.37+0.00)     0.38(0.38+0.00) +2.7%
  5600.6: count non-promisor commits   11.74(11.37+0.37)   0.04(0.03+0.00) -99.7%

The improvement is particularly big in this script because _every_
object in the newly-cloned partial repo is a promisor object. So after
marking them all, there's nothing left to traverse.

Signed-off-by: Jeff King <peff@peff.net>
---
 revision.c                    | 2 +-
 t/perf/p5600-partial-clone.sh | 8 ++++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/revision.c b/revision.c
index 553c0faa9b..7e73dafd96 100644
--- a/revision.c
+++ b/revision.c
@@ -3271,7 +3271,7 @@ static int mark_uninteresting(const struct object_id *oid,
 			      void *cb)
 {
 	struct rev_info *revs = cb;
-	struct object *o = parse_object(revs->repo, oid);
+	struct object *o = lookup_unknown_object(revs->repo, oid);
 	o->flags |= UNINTERESTING | SEEN;
 	return 0;
 }
diff --git a/t/perf/p5600-partial-clone.sh b/t/perf/p5600-partial-clone.sh
index 754aaec3dc..ca785a3341 100755
--- a/t/perf/p5600-partial-clone.sh
+++ b/t/perf/p5600-partial-clone.sh
@@ -27,4 +27,12 @@ test_perf 'fsck' '
 	git -C bare.git fsck
 '
 
+test_perf 'count commits' '
+	git -C bare.git rev-list --all --count
+'
+
+test_perf 'count non-promisor commits' '
+	git -C bare.git rev-list --all --count --exclude-promisor-objects
+'
+
 test_done
-- 
2.31.1.659.g9b9913af63

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-12 23:47       ` Jeff King
  2021-04-13  7:12         ` [PATCH 0/3] low-hanging performance fruit with promisor packs Jeff King
@ 2021-04-13 18:05         ` SZEDER Gábor
  2021-04-14  5:14           ` Jeff King
  1 sibling, 1 reply; 46+ messages in thread
From: SZEDER Gábor @ 2021-04-13 18:05 UTC (permalink / raw)
  To: Jeff King; +Cc: Rafael Silva, Jonathan Tan, Git Mailing List

On Mon, Apr 12, 2021 at 07:47:41PM -0400, Jeff King wrote:
> On Mon, Apr 12, 2021 at 11:36:53PM +0200, SZEDER Gábor wrote:
> 
> > All what you wrote above makes sense to me, but I've never looked at
> > how partial clones work, so it doesn't mean anything...  In any case
> > your patch brings great speedups, but, unfortunately, the memory usage
> > remains as high as it was:
> > 
> >   $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk /home/szeder/src/git/bin-wrappers/git -C git-partial.git/ gc
> >   Enumerating objects: 188450, done.
> >   Counting objects: 100% (188450/188450), done.
> >   Delta compression using up to 4 threads
> >   Compressing objects: 100% (66623/66623), done.
> >   Writing objects: 100% (188450/188450), done.
> >   Total 188450 (delta 120109), reused 188450 (delta 120109), pack-reused 0
> >   elapsed: 0:15.18  max RSS: 1888332k
> > 
> > And git.git is not all that large, I wonder how much memory would be
> > necessary to 'gc' a 'blob:none' clone of e.g. chromium?! :)
> > 
> > BTW, this high memory usage in a partial clone is not specific to
> > 'repack', 'fsck' suffers just as much:
> 
> I think the issue is in the exclude-promisor-object code paths of the
> traversal. Try this:
> 
>   [two clones of git.git, one full and one partial]
>   $ git clone --no-local --bare /path/to/git full.git
>   $ git clone --no-local --bare --filter=blob:none /path/to/git partial.git
> 
>   [full clone is quick to traverse all objects]
>   $ time git -C full.git rev-list --count --all
>   63215
>   real	0m0.369s
>   user	0m0.365s
>   sys	0m0.004s
> 
>   [partial is, too; it's the same amount of work because we're just
>    looking at the commits here]
>   $ time git -C partial.git rev-list --count --all
>   63215
>   real	0m0.373s
>   user	0m0.364s
>   sys	0m0.009s
> 
>   [but now ask it to exclude promisor objects, and it's much slower;
>   this is because is_promisor_object() opens up each tree in the pack in
>   order to see which "promised" objects it mentions]

I don't understand this: 'git rev-list --count --all' only counts
commit objects, so why should it open any trees at all?

>   $ time git -C partial.git rev-list --exclude-promisor-objects --count --all
>   0
>   real	0m11.723s
>   user	0m11.354s
>   sys	0m0.369s
> 
> And I think that is the source for the memory use, too. Without that
> option, we peak at 11MB heap. With it, 1.6GB. Oops. Looks like there
> might be some "leaks" when we parse tree objects and then leave the
> buffers connected to the structs (technically not a leak because we
> still have the pointers, but obviously having every tree in memory at
> once is not good).
> 
> The patch below drops the peak heap to 165MB. Still quite a bit more,
> but I think it's a combination of delta-base cache (96MB) plus extra
> structs for all the non-commit objects whose flags we marked.
> 
> It does seem like there's probably a good space/time tradeoff to be made
> here (e.g., caching the list of "promisor" object ids for the pack
> instead of inflating and reading all of the trees on the fly).
> 
> diff --git a/packfile.c b/packfile.c
> index 8668345d93..b79cbc8cd4 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -2247,6 +2247,7 @@ static int add_promisor_object(const struct object_id *oid,
>  			return 0;
>  		while (tree_entry_gently(&desc, &entry))
>  			oidset_insert(set, &entry.oid);
> +		free_tree_buffer(tree);
>  	} else if (obj->type == OBJ_COMMIT) {
>  		struct commit *commit = (struct commit *) obj;
>  		struct commit_list *parents = commit->parents;
> diff --git a/revision.c b/revision.c
> index 553c0faa9b..fac2577748 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3271,8 +3271,15 @@ static int mark_uninteresting(const struct object_id *oid,
>  			      void *cb)
>  {
>  	struct rev_info *revs = cb;
> +	/*
> +	 * yikes, do we really need to parse here? maybe

Heh, a "yikes" here and a "yuck" in your previous patch...  This issue
was worth reporting :)

> +	 * lookup_unknown_object() would be sufficient, or
> +	 * even oid_object_info() followed by the correct type
> +	 */
>  	struct object *o = parse_object(revs->repo, oid);
>  	o->flags |= UNINTERESTING | SEEN;
> +	if (o->type == OBJ_TREE)
> +		free_tree_buffer((struct tree *)o);
>  	return 0;
>  }
>  
> 
> -Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] low-hanging performance fruit with promisor packs
  2021-04-13  7:12         ` [PATCH 0/3] low-hanging performance fruit with promisor packs Jeff King
                             ` (2 preceding siblings ...)
  2021-04-13  7:17           ` [PATCH 3/3] revision: avoid parsing with --exclude-promisor-objects Jeff King
@ 2021-04-13 18:10           ` SZEDER Gábor
  2021-04-14 17:14           ` Jonathan Tan
  2021-04-14 19:22           ` Rafael Silva
  5 siblings, 0 replies; 46+ messages in thread
From: SZEDER Gábor @ 2021-04-13 18:10 UTC (permalink / raw)
  To: Jeff King; +Cc: Rafael Silva, Jonathan Tan, Git Mailing List

On Tue, Apr 13, 2021 at 03:12:54AM -0400, Jeff King wrote:
> On Mon, Apr 12, 2021 at 07:47:41PM -0400, Jeff King wrote:
> 
> > The patch below drops the peak heap to 165MB. Still quite a bit more,
> > but I think it's a combination of delta-base cache (96MB) plus extra
> > structs for all the non-commit objects whose flags we marked.
> 
> I think we can do even better than that, after looking into the "do we
> really need to parse the objects?" comment I left (spoiler: the answer
> is no, we do not need to, at least for that caller).
> 
> Here are some cleaned-up patches that I think improve the situation
> quite a bit.

> Anyway, here are the patches.
> 
>   [1/3]: is_promisor_object(): free tree buffer after parsing
>   [2/3]: lookup_unknown_object(): take a repository argument
>   [3/3]: revision: avoid parsing with --exclude-promisor-objects

I tried these patches together with your first in this thread [1], and
then could finally 'gc' my 'blob:none' chromium clone in:

  elapsed: 3:23.64  max RSS: 3206552k

Thanks.


[1] https://public-inbox.org/git/YG4hfge2y%2FAmcklZ@coredump.intra.peff.net/


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/3] is_promisor_object(): free tree buffer after parsing
  2021-04-13  7:15           ` [PATCH 1/3] is_promisor_object(): free tree buffer after parsing Jeff King
@ 2021-04-13 20:17             ` Junio C Hamano
  2021-04-14  5:18               ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2021-04-13 20:17 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, Rafael Silva, Jonathan Tan, Git Mailing List

Jeff King <peff@peff.net> writes:

> The added perf test shows only a tiny improvement on my machine for
> git.git, since 1.7GB isn't enough to cause any real memory pressure:
>
>   Test                                 HEAD^               HEAD
>   --------------------------------------------------------------------------------
>   5600.4: fsck                         21.26(20.90+0.35)   20.84(20.79+0.04) -2.0%
>
> With linux.git the absolute change is a bit bigger, though still a small
> percentage:
>
>   Test                          HEAD^                 HEAD
>   -----------------------------------------------------------------------------
>   5600.4: fsck                  262.26(259.13+3.12)   254.92(254.62+0.29) -2.8%
>
> I didn't have the patience to run it under massif with linux.git, but
> it's probably on the order of about 14GB improvement, since that's the
> sum of the sizes of all of the uncompressed trees (but still isn't
> enough to create memory pressure on this particular machine, which has
> 64GB of RAM). Smaller machines would probably see a bigger effect on
> runtime (and sadly our perf suite does not measure peak heap).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  packfile.c                    | 1 +
>  t/perf/p5600-partial-clone.sh | 4 ++++
>  2 files changed, 5 insertions(+)
>
> diff --git a/packfile.c b/packfile.c
> index 8668345d93..b79cbc8cd4 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -2247,6 +2247,7 @@ static int add_promisor_object(const struct object_id *oid,
>  			return 0;
>  		while (tree_entry_gently(&desc, &entry))
>  			oidset_insert(set, &entry.oid);
> +		free_tree_buffer(tree);
>  	} else if (obj->type == OBJ_COMMIT) {
>  		struct commit *commit = (struct commit *) obj;
>  		struct commit_list *parents = commit->parents;

Hmph, does an added free() without removing one later mean we've
been leaking?

Nicely done.  Thanks.


> diff --git a/t/perf/p5600-partial-clone.sh b/t/perf/p5600-partial-clone.sh
> index 3e04bd2ae1..754aaec3dc 100755
> --- a/t/perf/p5600-partial-clone.sh
> +++ b/t/perf/p5600-partial-clone.sh
> @@ -23,4 +23,8 @@ test_perf 'checkout of result' '
>  	git -C worktree checkout -f
>  '
>  
> +test_perf 'fsck' '
> +	git -C bare.git fsck
> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] revision: avoid parsing with --exclude-promisor-objects
  2021-04-13  7:17           ` [PATCH 3/3] revision: avoid parsing with --exclude-promisor-objects Jeff King
@ 2021-04-13 20:22             ` Junio C Hamano
  0 siblings, 0 replies; 46+ messages in thread
From: Junio C Hamano @ 2021-04-13 20:22 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, Rafael Silva, Jonathan Tan, Git Mailing List

Jeff King <peff@peff.net> writes:

> ... The only reason we call parse_object() is that
> we need a "struct object" on which to set the flags. There are two
> options here:
>
>   - we can look up just the object type via oid_object_info(), and then
>     call the appropriate lookup_foo() function
>
>   - we can call lookup_unknown_object(), which gives us an OBJ_NONE
>     struct (which will get auto-converted later by object_as_type() via
>     calls to lookup_commit(), etc).
>
> The first one is closer to the current code, but we do pay the price to
> look up the type for each object. The latter should be more efficient in
> CPU, though it wastes a little bit of memory (the "unknown" object

That's clever.  I like it.

>   5600.5: count commits                0.37(0.37+0.00)     0.38(0.38+0.00) +2.7%
>   5600.6: count non-promisor commits   11.74(11.37+0.37)   0.04(0.03+0.00) -99.7%
>
> The improvement is particularly big in this script because _every_
> object in the newly-cloned partial repo is a promisor object. So after
> marking them all, there's nothing left to traverse.

;-).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-13 18:05         ` rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
@ 2021-04-14  5:14           ` Jeff King
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2021-04-14  5:14 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Rafael Silva, Jonathan Tan, Git Mailing List

On Tue, Apr 13, 2021 at 08:05:52PM +0200, SZEDER Gábor wrote:

> >   [but now ask it to exclude promisor objects, and it's much slower;
> >   this is because is_promisor_object() opens up each tree in the pack in
> >   order to see which "promised" objects it mentions]
> 
> I don't understand this: 'git rev-list --count --all' only counts
> commit objects, so why should it open any trees at all?

Because the promisor code is a bit over-eager. There's actually one
small error in what I wrote above. In that particular rev-list, we're
not calling is_promisor_object() at all, because we'll already have
excluded all the promisor objects by marking them UNINTERESTING and
SEEN.

So:

  - in is_promisor_object(), we load the _whole_ list of promisor
    objects, which requires opening trees to find out about referenced
    blobs. In theory it could know that we only care about commits, but
    it's not connected to a particular traversal, so it gets the whole
    list.

  - mark_uninteresting() is the code where we pre-mark the objects from
    the promisor pack as UNINTERESTING, and that was loading all of the
    trees in this case. And it could know that we are not looking at
    --objects, so there's no need to touch trees. But after my patches,
    we do not load the contents of _any_ objects at all in that
    function. We could avoid even creating "struct object" for
    non-commits there, too, but that would imply looking up the type of
    each object (so more CPU, though it would save us some memory when
    we only care about commits). I suspect in practice that most callers
    would generally pass --objects anyway, though (e.g., your original
    pack-objects that started this thread certainly cares about
    non-commits).

> > +	/*
> > +	 * yikes, do we really need to parse here? maybe
> 
> Heh, a "yikes" here and a "yuck" in your previous patch...  This issue
> was worth reporting :)

Yeah. I think the client side of a lot of this partial-clone / promisor
stuff is not very mature. It's waiting on people to start using it and
finding all of these rough edges.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/3] is_promisor_object(): free tree buffer after parsing
  2021-04-13 20:17             ` Junio C Hamano
@ 2021-04-14  5:18               ` Jeff King
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2021-04-14  5:18 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: SZEDER Gábor, Rafael Silva, Jonathan Tan, Git Mailing List

On Tue, Apr 13, 2021 at 01:17:55PM -0700, Junio C Hamano wrote:

> > diff --git a/packfile.c b/packfile.c
> > index 8668345d93..b79cbc8cd4 100644
> > --- a/packfile.c
> > +++ b/packfile.c
> > @@ -2247,6 +2247,7 @@ static int add_promisor_object(const struct object_id *oid,
> >  			return 0;
> >  		while (tree_entry_gently(&desc, &entry))
> >  			oidset_insert(set, &entry.oid);
> > +		free_tree_buffer(tree);
> >  	} else if (obj->type == OBJ_COMMIT) {
> >  		struct commit *commit = (struct commit *) obj;
> >  		struct commit_list *parents = commit->parents;
> 
> Hmph, does an added free() without removing one later mean we've
> been leaking?

Yes. Though perhaps not technically a leak, in that we are still holding
on to the "struct tree" entries via the obj_hash table. But nobody was
freeing them at all until the end of the program.

I actually think it may be a mistake for "struct tree" to have
buffer/len fields at all. It is a slight convenience to be able to pass
them around with the struct, but it makes the expected lifetime much
more confusing. In practice, all code wants to deal with one tree at a
time, then drop the buffer when it's done (we might hold several when
recursing through subtrees, but we'd never hold more than the distance
from the leaf to the root, and each recursive invocation of something
like process_tree() is holding exactly one tree buffer).

It may not be worth the trouble to try to clean it up at this point,
though.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] low-hanging performance fruit with promisor packs
  2021-04-13  7:12         ` [PATCH 0/3] low-hanging performance fruit with promisor packs Jeff King
                             ` (3 preceding siblings ...)
  2021-04-13 18:10           ` [PATCH 0/3] low-hanging performance fruit with promisor packs SZEDER Gábor
@ 2021-04-14 17:14           ` Jonathan Tan
  2021-04-14 19:22           ` Rafael Silva
  5 siblings, 0 replies; 46+ messages in thread
From: Jonathan Tan @ 2021-04-14 17:14 UTC (permalink / raw)
  To: peff; +Cc: szeder.dev, rafaeloliveira.cs, jonathantanmy, git

> Here are some cleaned-up patches that I think improve the situation
> quite a bit. This is just the low-hanging fruit from this part of the
> discussion; I'm sure there's more to do to make using partial clones
> pleasant. In particular:

[snip]

Thanks, Peff. These patches look good to me.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 0/2] prevent `repack` to unpack and delete promisor objects
  2021-04-03  9:04 rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
  2021-04-05  1:02 ` Rafael Silva
@ 2021-04-14 19:14 ` Rafael Silva
  2021-04-14 19:14   ` [PATCH 1/2] repack: teach --no-prune-packed to skip `git prune-packed` Rafael Silva
                     ` (4 more replies)
  1 sibling, 5 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-14 19:14 UTC (permalink / raw)
  To: git; +Cc: SZEDER Gábor, Jeff King, Jonathan Tan, Rafael Silva

This series is built on top of jk/promisor-optim. It conflicts with
changes on p5600 otherwise.

The following patches fixes the issue where we unnecessarily turn loose
all the promisor objects and deletes them right after when running
`repack -A -d ..` (via `git gc) for a partial repository. 

Special thanks to Peff, for proposing a better approach for managing
the situation and for Jonathan Tan for earlier interaction on the
solution. Previously, I thought we should skip the promisor objects
by just adding a check in loosened_object_can_be_discarded(). However,
Peff pointed out that we can do better by realizing much sooner that
we should not even consider loosening the objects for the _old_ promisor
packs.

It took me a bit to come up with the test because it seems `repack`
doesn't offer an option to skip the "deletion of unpacked objects",
so this series adds a new option to `repack` for skip the
`git prune-packed` execution thus allowing us to easily inspect the
unpacked objects before they are removed and simplification of our
test suite. Furthermore, The test will now test the `repack` code
path instead of performing the operations by calling
`pack-objects`.

Rafael Silva (2):
  repack: teach --no-prune-packed to skip `git prune-packed`
  repack: avoid loosening promisor pack objects in partial clones

 Documentation/git-repack.txt  |  5 +++++
 builtin/repack.c              | 15 ++++++++++++---
 t/perf/p5600-partial-clone.sh |  4 ++++
 t/t5616-partial-clone.sh      |  9 +++++++++
 t/t7700-repack.sh             | 23 +++++++++++------------
 5 files changed, 41 insertions(+), 15 deletions(-)

-- 
2.31.0.565.gcc42f43761


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/2] repack: teach --no-prune-packed to skip `git prune-packed`
  2021-04-14 19:14 ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Rafael Silva
@ 2021-04-14 19:14   ` Rafael Silva
  2021-04-14 23:50     ` Jonathan Tan
  2021-04-14 19:14   ` [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones Rafael Silva
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 46+ messages in thread
From: Rafael Silva @ 2021-04-14 19:14 UTC (permalink / raw)
  To: git; +Cc: SZEDER Gábor, Jeff King, Jonathan Tan, Rafael Silva

The `git repack -d` command will remove any packfile that becomes
redundant after repacking, and then call  `git pruned-packed` for
pruning any unpacked objects. The command, however, does not offer
an option to skip the `pruned-packed` execution in order to allow
inspecting the objects before they are removed, forcing developers
to either make some temporary code changes or to manually craft
the `pack-objects` command in order to skip their deletion:

    git pack-objects --honor-pack-keep --non-empty --all --reflog \
    	--unpack-unreachable

Let’s teach `repack -d` to take --no-pruned-packed option that will
simply skip the call to `git prune-packed`, thus providing an easy
way to debug the `repack` command. This also allows us to simplify
our test suite by replacing calls to `pack-objects` with `repack`:

    git repack -A -d --no-prune-packed

Moreover, using the repack command will actually test its code path
instead of just testing the real repack operation by calling
`git pack-objects` directly. Let's refactor two tests in t7700 with
the new option.

Aside from improving our test suite, this new option will be used in
an upcoming commit for testing the behavior of `repack` with partial
clone repositories.

Signed-off-by: Rafael Silva <rafaeloliveira.cs@gmail.com>
---
 Documentation/git-repack.txt |  5 +++++
 builtin/repack.c             |  6 +++++-
 t/t7700-repack.sh            | 23 +++++++++++------------
 3 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 317d63cf0d..6ff7a1be16 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -86,6 +86,11 @@ to the new separate pack will be written.
 	this repository (or a direct copy of it)
 	over HTTP or FTP.  See linkgit:git-update-server-info[1].
 
+--no-prune-packed::
+	Only useful with `-d`.  Do not remove redundant loose object
+	files by skipping the execution of `git prune-packed`.
+	See linkgit:git-prune-packed[1].
+
 --window=<n>::
 --depth=<n>::
 	These two options affect how the objects contained in the pack are
diff --git a/builtin/repack.c b/builtin/repack.c
index 2847fdfbab..6baaeb979c 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -452,6 +452,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int keep_unreachable = 0;
 	struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
 	int no_update_server_info = 0;
+	int no_prune_packed = 0;
 	struct pack_objects_args po_args = {NULL};
 	int geometric_factor = 0;
 
@@ -469,6 +470,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("pass --no-reuse-object to git-pack-objects")),
 		OPT_BOOL('n', NULL, &no_update_server_info,
 				N_("do not run git-update-server-info")),
+		OPT_BOOL(0, "no-prune-packed", &no_prune_packed,
+				N_("do not run git-prune-packed")),
 		OPT__QUIET(&po_args.quiet, N_("be quiet")),
 		OPT_BOOL('l', "local", &po_args.local,
 				N_("pass --local to git-pack-objects")),
@@ -707,7 +710,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		}
 		if (!po_args.quiet && isatty(2))
 			opts |= PRUNE_PACKED_VERBOSE;
-		prune_packed_objects(opts);
+		if (!no_prune_packed)
+			prune_packed_objects(opts);
 
 		if (!keep_unreachable &&
 		    (!(pack_everything & LOOSEN_UNREACHABLE) ||
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 25b235c063..728a16ad97 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -127,12 +127,7 @@ test_expect_success 'packed unreachable obs in alternate ODB are not loosened' '
 	git reset --hard HEAD^ &&
 	test_tick &&
 	git reflog expire --expire=$test_tick --expire-unreachable=$test_tick --all &&
-	# The pack-objects call on the next line is equivalent to
-	# git repack -A -d without the call to prune-packed
-	git pack-objects --honor-pack-keep --non-empty --all --reflog \
-	    --unpack-unreachable </dev/null pack &&
-	rm -f .git/objects/pack/* &&
-	mv pack-* .git/objects/pack/ &&
+	git repack -A -d --no-prune-packed &&
 	git verify-pack -v -- .git/objects/pack/*.idx >packlist &&
 	! grep "^$coid " packlist &&
 	echo >.git/objects/info/alternates &&
@@ -144,18 +139,22 @@ test_expect_success 'local packed unreachable obs that exist in alternate ODB ar
 	echo "$coid" | git pack-objects --non-empty --all --reflog pack &&
 	rm -f .git/objects/pack/* &&
 	mv pack-* .git/objects/pack/ &&
-	# The pack-objects call on the next line is equivalent to
-	# git repack -A -d without the call to prune-packed
-	git pack-objects --honor-pack-keep --non-empty --all --reflog \
-	    --unpack-unreachable </dev/null pack &&
-	rm -f .git/objects/pack/* &&
-	mv pack-* .git/objects/pack/ &&
+	git repack -A -d --no-prune-packed &&
 	git verify-pack -v -- .git/objects/pack/*.idx >packlist &&
 	! grep "^$coid " &&
 	echo >.git/objects/info/alternates &&
 	test_must_fail git show $coid
 '
 
+test_expect_success '-A -d and --no-prune-packed do not remove loose objects' '
+	test_create_repo repo &&
+	test_when_finished "rm -rf repo" &&
+	test_commit -C repo commit &&
+	git -C repo repack -A -d --no-prune-packed &&
+	git -C repo count-objects -v >out &&
+	grep "^prune-packable: 3" out
+'
+
 test_expect_success 'objects made unreachable by grafts only are kept' '
 	test_tick &&
 	git commit --allow-empty -m "commit 4" &&
-- 
2.31.0.565.gcc42f43761


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones
  2021-04-14 19:14 ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Rafael Silva
  2021-04-14 19:14   ` [PATCH 1/2] repack: teach --no-prune-packed to skip `git prune-packed` Rafael Silva
@ 2021-04-14 19:14   ` Rafael Silva
  2021-04-15  1:04     ` Jonathan Tan
  2021-04-15 18:06     ` Junio C Hamano
  2021-04-14 22:10   ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Junio C Hamano
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-14 19:14 UTC (permalink / raw)
  To: git; +Cc: SZEDER Gábor, Jeff King, Jonathan Tan, Rafael Silva

When `-A` and `-d` are used together, besides packing all objects (-A)
and removing redundant packs (-d), it also unpack all unreachable
objects and deletes them by calling `git pruned-packed`. For a partial
clone, that contains unreferenced objects, this results in unpacking
all "promisor" objects and deleting them right after, which
unnecessarily increases the `repack` execution time and disk usage
during the unpacking of the objects.

For instance, a partially cloned repository that filters all the blob
objects (e.g. "--filter=blob:none"), `repack` ends up unpacking all
blobs into the filesystem that, depending on the repo size, makes
nearly impossible to repack the operation before running out of disk.

For a partial clone, `git repack` calls `git pack-objects` twice: (1)
for handle the "promisor" objects and (2) for performing the repack
with --exclude-promisor-objects option, that results in unpacking and
deleting of the objects. Given that we actually should keep the
promisor objects, let's teach `repack` to tell `pack-objects` to
--keep the old "promisor" pack file.

The --keep-pack option takes only a packfile name, but we concatenate
both the path and the name in a single string. Instead, let's split
them into separate string in order to easily pass the packfile name
later.

Additionally, add a new perf test to evaluate the performance
impact made by this changes (tested on git.git):

    Test            HEAD^                 HEAD
    ------------------------------------------------------------
    5600.5: gc      137.67(42.48+93.64)   8.08(6.91+1.45) -94.1%

In this particular script, the improvement is big because every
object in the newly-cloned partial repository is a promisor object.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Rafael Silva <rafaeloliveira.cs@gmail.com>
---
 builtin/repack.c              | 9 +++++++--
 t/perf/p5600-partial-clone.sh | 4 ++++
 t/t5616-partial-clone.sh      | 9 +++++++++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 6baaeb979c..0ecd76b79c 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -20,7 +20,7 @@ static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
-static char *packdir, *packtmp;
+static char *packdir, *packtmp_name, *packtmp;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -533,7 +533,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 
 	packdir = mkpathdup("%s/pack", get_object_directory());
-	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
+	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
+	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
 
 	sigchain_push_common(remove_pack_on_signal);
 
@@ -576,6 +577,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		repack_promisor_objects(&po_args, &names);
 
 		if (existing_packs.nr && delete_redundant) {
+			for_each_string_list_item(item, &names) {
+				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
+					     packtmp_name, item->string);
+			}
 			if (unpack_unreachable) {
 				strvec_pushf(&cmd.args,
 					     "--unpack-unreachable=%s",
diff --git a/t/perf/p5600-partial-clone.sh b/t/perf/p5600-partial-clone.sh
index ca785a3341..a965f2c4d6 100755
--- a/t/perf/p5600-partial-clone.sh
+++ b/t/perf/p5600-partial-clone.sh
@@ -35,4 +35,8 @@ test_perf 'count non-promisor commits' '
 	git -C bare.git rev-list --all --count --exclude-promisor-objects
 '
 
+test_perf 'gc' '
+	git -C bare.git gc
+'
+
 test_done
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 5cb415386e..de77822735 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -548,6 +548,15 @@ test_expect_success 'fetch from a partial clone, protocol v2' '
 	grep "version 2" trace
 '
 
+test_expect_success 'repack does not loose all objects' '
+	rm -rf client &&
+	git clone --bare --filter=blob:none "file://$(pwd)/srv.bare" client &&
+	test_when_finished "rm -rf client" &&
+	git -C client repack -A -l -d --no-prune-packed &&
+	git -C client count-objects -v >object-count &&
+	grep "^prune-packable: 0" object-count
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-- 
2.31.0.565.gcc42f43761


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] low-hanging performance fruit with promisor packs
  2021-04-13  7:12         ` [PATCH 0/3] low-hanging performance fruit with promisor packs Jeff King
                             ` (4 preceding siblings ...)
  2021-04-14 17:14           ` Jonathan Tan
@ 2021-04-14 19:22           ` Rafael Silva
  5 siblings, 0 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-14 19:22 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, Jonathan Tan, Git Mailing List


Jeff King <peff@peff.net> writes:

>
> I didn't explore any of those here, and I don't plan to look into them
> anytime soon. I'm just documenting my findings for later.
>
> Anyway, here are the patches.
>
>   [1/3]: is_promisor_object(): free tree buffer after parsing
>   [2/3]: lookup_unknown_object(): take a repository argument
>   [3/3]: revision: avoid parsing with --exclude-promisor-objects
>
>  builtin/fsck.c                   |  2 +-
>  builtin/pack-objects.c           |  2 +-
>  http-push.c                      |  2 +-
>  object.c                         |  7 +++----
>  object.h                         |  2 +-
>  packfile.c                       |  1 +
>  refs.c                           |  2 +-
>  revision.c                       |  2 +-
>  t/helper/test-example-decorate.c |  6 +++---
>  t/perf/p5600-partial-clone.sh    | 12 ++++++++++++
>  upload-pack.c                    |  2 +-
>  walker.c                         |  2 +-
>  12 files changed, 27 insertions(+), 15 deletions(-)
>
> -Peff

I took look on this series and tested as well, together with the
fix for the "unpacking and deleting" promisor objects situation.

It looks good to me.

-- 
Thanks
Rafael

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/2] prevent `repack` to unpack and delete promisor objects
  2021-04-14 19:14 ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Rafael Silva
  2021-04-14 19:14   ` [PATCH 1/2] repack: teach --no-prune-packed to skip `git prune-packed` Rafael Silva
  2021-04-14 19:14   ` [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones Rafael Silva
@ 2021-04-14 22:10   ` Junio C Hamano
  2021-04-15  9:15   ` Jeff King
  2021-04-18 13:57   ` [PATCH v2 0/1] " Rafael Silva
  4 siblings, 0 replies; 46+ messages in thread
From: Junio C Hamano @ 2021-04-14 22:10 UTC (permalink / raw)
  To: Rafael Silva; +Cc: git, SZEDER Gábor, Jeff King, Jonathan Tan

Rafael Silva <rafaeloliveira.cs@gmail.com> writes:

> This series is built on top of jk/promisor-optim. It conflicts with
> changes on p5600 otherwise.
>
> The following patches fixes the issue where we unnecessarily turn loose
> all the promisor objects and deletes them right after when running
> `repack -A -d ..` (via `git gc) for a partial repository. 
>
> Special thanks to Peff, for proposing a better approach for managing
> the situation and for Jonathan Tan for earlier interaction on the
> solution. Previously, I thought we should skip the promisor objects
> by just adding a check in loosened_object_can_be_discarded(). However,
> Peff pointed out that we can do better by realizing much sooner that
> we should not even consider loosening the objects for the _old_ promisor
> packs.
>
> It took me a bit to come up with the test because it seems `repack`
> doesn't offer an option to skip the "deletion of unpacked objects",
> so this series adds a new option to `repack` for skip the
> `git prune-packed` execution thus allowing us to easily inspect the
> unpacked objects before they are removed and simplification of our
> test suite. Furthermore, The test will now test the `repack` code
> path instead of performing the operations by calling
> `pack-objects`.

Beautiful.  Thanks for working so well together, all of you.


> Rafael Silva (2):
>   repack: teach --no-prune-packed to skip `git prune-packed`
>   repack: avoid loosening promisor pack objects in partial clones
>
>  Documentation/git-repack.txt  |  5 +++++
>  builtin/repack.c              | 15 ++++++++++++---
>  t/perf/p5600-partial-clone.sh |  4 ++++
>  t/t5616-partial-clone.sh      |  9 +++++++++
>  t/t7700-repack.sh             | 23 +++++++++++------------
>  5 files changed, 41 insertions(+), 15 deletions(-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] repack: teach --no-prune-packed to skip `git prune-packed`
  2021-04-14 19:14   ` [PATCH 1/2] repack: teach --no-prune-packed to skip `git prune-packed` Rafael Silva
@ 2021-04-14 23:50     ` Jonathan Tan
  2021-04-18 14:15       ` Rafael Silva
  0 siblings, 1 reply; 46+ messages in thread
From: Jonathan Tan @ 2021-04-14 23:50 UTC (permalink / raw)
  To: rafaeloliveira.cs; +Cc: git, szeder.dev, peff, jonathantanmy

> The `git repack -d` command will remove any packfile that becomes
> redundant after repacking, and then call  `git pruned-packed` for
> pruning any unpacked objects.

s/pruned-packed/prune-packed/ (note that there is no "d" after "prune")
throughout this commit message.

Also, if there are any objects pruned, they are packed objects, not
unpacked objects. Maybe better to say "...for pruning any objects
already in packs".

> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index 25b235c063..728a16ad97 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -127,12 +127,7 @@ test_expect_success 'packed unreachable obs in alternate ODB are not loosened' '
>  	git reset --hard HEAD^ &&
>  	test_tick &&
>  	git reflog expire --expire=$test_tick --expire-unreachable=$test_tick --all &&
> -	# The pack-objects call on the next line is equivalent to
> -	# git repack -A -d without the call to prune-packed
> -	git pack-objects --honor-pack-keep --non-empty --all --reflog \
> -	    --unpack-unreachable </dev/null pack &&
> -	rm -f .git/objects/pack/* &&
> -	mv pack-* .git/objects/pack/ &&
> +	git repack -A -d --no-prune-packed &&

This is great!

> +test_expect_success '-A -d and --no-prune-packed do not remove loose objects' '
> +	test_create_repo repo &&
> +	test_when_finished "rm -rf repo" &&
> +	test_commit -C repo commit &&
> +	git -C repo repack -A -d --no-prune-packed &&
> +	git -C repo count-objects -v >out &&
> +	grep "^prune-packable: 3" out
> +'

As for the test description, I would prefer "...do not remove loose
objects already in packs".

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones
  2021-04-14 19:14   ` [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones Rafael Silva
@ 2021-04-15  1:04     ` Jonathan Tan
  2021-04-15  3:51       ` Junio C Hamano
                         ` (2 more replies)
  2021-04-15 18:06     ` Junio C Hamano
  1 sibling, 3 replies; 46+ messages in thread
From: Jonathan Tan @ 2021-04-15  1:04 UTC (permalink / raw)
  To: rafaeloliveira.cs; +Cc: git, szeder.dev, peff, jonathantanmy

> When `-A` and `-d` are used together, besides packing all objects (-A)
> and removing redundant packs (-d), it also unpack all unreachable
> objects and deletes them by calling `git pruned-packed`.

I still think of these objects as not unreachable, even though I know
that pack-objects calls them that (the argument is called
--unpack-unreachable). So I would say "it also loosens all objects that
were previously packed but did not go into the new pack", but perhaps
this is OK too.

> For a partial
> clone, that contains unreferenced objects, this results in unpacking
> all "promisor" objects and deleting them right after, which
> unnecessarily increases the `repack` execution time and disk usage
> during the unpacking of the objects.

I think that the commit message also needs to explain that we're
deleting the promisor objects immediately because they happen to be in a
promisor pack. So perhaps this whole part could be written as follows:

  When "git repack -Ad" is run in a partial clone, "pack-objects" is
  invoked twice: once to repack all promisor objects, and once to repack
  all non-promisor objects. The latter "pack-objects" invocation is with
  --exclude-promisor-objects and --unpack-unreachable, which loosens all
  unused objects. Unfortunately, this includes promisor objects.

  Because the "-d" argument to "git repack" subsequently deletes all
  loose objects also in packs, these just-loosened promisor objects will
  be immediately deleted. But this extra disk churn is unnecessary in
  the first place.

> For instance, a partially cloned repository that filters all the blob
> objects (e.g. "--filter=blob:none"), `repack` ends up unpacking all
> blobs into the filesystem that, depending on the repo size, makes
> nearly impossible to repack the operation before running out of disk.
> 
> For a partial clone, `git repack` calls `git pack-objects` twice: (1)
> for handle the "promisor" objects and (2) for performing the repack
> with --exclude-promisor-objects option, that results in unpacking and
> deleting of the objects. Given that we actually should keep the
> promisor objects, let's teach `repack` to tell `pack-objects` to
> --keep the old "promisor" pack file.

It's not this call (2) that results in any deleting of the objects, but
the later call to prune_packed_objects(). Also, promisor objects are
kept regardless of what we pass to "pack-objects" here (the keeping is
done separately). Maybe write (continuation from my suggestion above):

  In order to avoid this extra disk churn, pass the names of the
  promisor packfiles as "--keep-pack" arguments to this
  second invocation of "pack-objects". This informs "pack-objects" that
  the promisor objects are already in a safe packfile and, therefore, do
  not need to be loosened.

> @@ -533,7 +533,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	}
>  
>  	packdir = mkpathdup("%s/pack", get_object_directory());
> -	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
> +	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
> +	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
>  
>  	sigchain_push_common(remove_pack_on_signal);
>  

Normally I would be concerned that packtmp_name is not freed, but in
this case, it's a static variable (same as packtmp).

> @@ -576,6 +577,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		repack_promisor_objects(&po_args, &names);
>  
>  		if (existing_packs.nr && delete_redundant) {
> +			for_each_string_list_item(item, &names) {
> +				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
> +					     packtmp_name, item->string);
> +			}

Git style is to not have braces for single-statement loops.

> +test_expect_success 'repack does not loose all objects' '
> +	rm -rf client &&
> +	git clone --bare --filter=blob:none "file://$(pwd)/srv.bare" client &&
> +	test_when_finished "rm -rf client" &&
> +	git -C client repack -A -l -d --no-prune-packed &&
> +	git -C client count-objects -v >object-count &&
> +	grep "^prune-packable: 0" object-count
> +'

s/loose all objects/loosen promisor objects/

Also, add a comment describing why we have "--no-prune-packed" there
(probably something about not pruning any loose objects that are already
in packs, so that we can verify that no redundant loose objects are
being created in the first place).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones
  2021-04-15  1:04     ` Jonathan Tan
@ 2021-04-15  3:51       ` Junio C Hamano
  2021-04-15  9:03         ` Jeff King
  2021-04-15  9:05       ` Jeff King
  2021-04-18  7:12       ` Rafael Silva
  2 siblings, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2021-04-15  3:51 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: rafaeloliveira.cs, git, szeder.dev, peff

Jonathan Tan <jonathantanmy@google.com> writes:

>> When `-A` and `-d` are used together, besides packing all objects (-A)
>> and removing redundant packs (-d), it also unpack all unreachable
>> objects and deletes them by calling `git pruned-packed`.
>
> I still think of these objects as not unreachable, even though I know
> that pack-objects calls them that (the argument is called
> --unpack-unreachable). So I would say "it also loosens all objects that
> were previously packed but did not go into the new pack", but perhaps
> this is OK too.

Hmph, that is puzzling.  I understand that the operation about

 (1) finding all the objects that are still reachable and send them
     into a newly created pack, and

 (2) among the objects that were previously in the packs, eject
     those that weren't made into the new pack with the previous
     point.

Where did I get it wrong?  If all the reachable ones are dealt with
with the first point, what is leftover is not reachable, no?



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones
  2021-04-15  3:51       ` Junio C Hamano
@ 2021-04-15  9:03         ` Jeff King
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2021-04-15  9:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, rafaeloliveira.cs, git, szeder.dev

On Wed, Apr 14, 2021 at 08:51:02PM -0700, Junio C Hamano wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> >> When `-A` and `-d` are used together, besides packing all objects (-A)
> >> and removing redundant packs (-d), it also unpack all unreachable
> >> objects and deletes them by calling `git pruned-packed`.
> >
> > I still think of these objects as not unreachable, even though I know
> > that pack-objects calls them that (the argument is called
> > --unpack-unreachable). So I would say "it also loosens all objects that
> > were previously packed but did not go into the new pack", but perhaps
> > this is OK too.
> 
> Hmph, that is puzzling.  I understand that the operation about
> 
>  (1) finding all the objects that are still reachable and send them
>      into a newly created pack, and
> 
>  (2) among the objects that were previously in the packs, eject
>      those that weren't made into the new pack with the previous
>      point.
> 
> Where did I get it wrong?  If all the reachable ones are dealt with
> with the first point, what is leftover is not reachable, no?

Right. I think your understanding is correct, and the commit message is
a bit confused. Normally after we eject loose objects, they'd stay there
(a follow-up git-gc may run git-prune and delete them, though if they
were recent enough not to just drop completely during the repack, then
git-prune would likewise leave them be). Talking about prune-packed here
is misleading, because it usually has nothing to do with these objects.

What makes the partial-clone situation under discussion interesting is
that the objects _are_ reachable. They are excluded from the new pack
because we put them in a separate promisor pack. But we erroneously turn
them loose, rather than realizing that they were excluded for a
different reason.

So the fundamental bug is that we turn them loose at all. What makes the
bug trickier to see is that when we run prune-packed afterwards, we then
clean up the evidence of the bug (so it looks more like a performance
problem than a correctness one).

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones
  2021-04-15  1:04     ` Jonathan Tan
  2021-04-15  3:51       ` Junio C Hamano
@ 2021-04-15  9:05       ` Jeff King
  2021-04-18  7:12       ` Rafael Silva
  2 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2021-04-15  9:05 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: rafaeloliveira.cs, git, szeder.dev

On Wed, Apr 14, 2021 at 06:04:54PM -0700, Jonathan Tan wrote:

> > @@ -576,6 +577,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> >  		repack_promisor_objects(&po_args, &names);
> >  
> >  		if (existing_packs.nr && delete_redundant) {
> > +			for_each_string_list_item(item, &names) {
> > +				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
> > +					     packtmp_name, item->string);
> > +			}
> 
> Git style is to not have braces for single-statement loops.

It is, though given that for_each_string_list_item() is a weird macro
instead of a regular for-loop, IMHO it makes things more obvious to have
the braces.

(All the rest of your comments seemed quite good to me).

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/2] prevent `repack` to unpack and delete promisor objects
  2021-04-14 19:14 ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Rafael Silva
                     ` (2 preceding siblings ...)
  2021-04-14 22:10   ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Junio C Hamano
@ 2021-04-15  9:15   ` Jeff King
  2021-04-18  8:20     ` Rafael Silva
  2021-04-18 13:57   ` [PATCH v2 0/1] " Rafael Silva
  4 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2021-04-15  9:15 UTC (permalink / raw)
  To: Rafael Silva; +Cc: git, SZEDER Gábor, Jonathan Tan

On Wed, Apr 14, 2021 at 09:14:01PM +0200, Rafael Silva wrote:

> It took me a bit to come up with the test because it seems `repack`
> doesn't offer an option to skip the "deletion of unpacked objects",
> so this series adds a new option to `repack` for skip the
> `git prune-packed` execution thus allowing us to easily inspect the
> unpacked objects before they are removed and simplification of our
> test suite. Furthermore, The test will now test the `repack` code
> path instead of performing the operations by calling
> `pack-objects`.

Thanks for working on this. Overall the patches seem sane, though I
think Jonathan's comments (especially about the confusion in the commit
message of 2/2) are worth addressing.

I have mixed feelings on the "--no-prune-packed" option, just because
it's user-visible and I don't think it's something a normal user would
ever really want.

In the new test (and I think in the old ones you modified, though I
didn't look carefully) the main thing we care about is whether we write
out loose objects. So another solution would be to improve the debug
logging inside pack-objects to tell us more about what it's doing.

The fork of Git we use at GitHub has something similar; when we discard
objects or force them loose, we write their sha1 values to a log file.
This has come in handy for a lot of after-the-fact debugging ("oops,
this repo is corrupted; did we intentionally delete object X?").

I wonder if we could do something similar with the trace2 facility. I
know it can be turned on via config, but I don't know how good the
support is for enabling just one segment of data (and this may generate
a lot of entries, so people using trace2 for telemetry probably wouldn't
want it on).

For the purposes of the tests, though just a normal GIT_TRACE_PACK_DEBUG
would be plenty. I dunno. I don't want to open up a can of worms on
logging that would hold up getting this quite-substantial fix in place.
But once we add --no-prune-packed, it will be hard to take away.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones
  2021-04-14 19:14   ` [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones Rafael Silva
  2021-04-15  1:04     ` Jonathan Tan
@ 2021-04-15 18:06     ` Junio C Hamano
  2021-04-18  8:40       ` Rafael Silva
  1 sibling, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2021-04-15 18:06 UTC (permalink / raw)
  To: Rafael Silva; +Cc: git, SZEDER Gábor, Jeff King, Jonathan Tan

Rafael Silva <rafaeloliveira.cs@gmail.com> writes:

> For instance, a partially cloned repository that filters all the blob
> objects (e.g. "--filter=blob:none"), `repack` ends up unpacking all
> blobs into the filesystem that, depending on the repo size, makes
> nearly impossible to repack the operation before running out of disk.

Could you clarify this paragraph a bit more?  It is unclear why the
repository has "all the blob objects" that it can loosen with repack
in the first place, if it was cloned without any blob.  Do you mean
that repack does not stop at the promisor pack boundary and instead
lazily download blobs "on-demand", which ends up as loose objects?

Thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones
  2021-04-15  1:04     ` Jonathan Tan
  2021-04-15  3:51       ` Junio C Hamano
  2021-04-15  9:05       ` Jeff King
@ 2021-04-18  7:12       ` Rafael Silva
  2 siblings, 0 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-18  7:12 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, szeder.dev, peff, Junio C Hamano


Jonathan Tan <jonathantanmy@google.com> writes:

>> For a partial
>> clone, that contains unreferenced objects, this results in unpacking
>> all "promisor" objects and deleting them right after, which
>> unnecessarily increases the `repack` execution time and disk usage
>> during the unpacking of the objects.
>
> I think that the commit message also needs to explain that we're
> deleting the promisor objects immediately because they happen to be in a
> promisor pack. So perhaps this whole part could be written as follows:
>
>   When "git repack -Ad" is run in a partial clone, "pack-objects" is
>   invoked twice: once to repack all promisor objects, and once to repack
>   all non-promisor objects. The latter "pack-objects" invocation is with
>   --exclude-promisor-objects and --unpack-unreachable, which loosens all
>   unused objects. Unfortunately, this includes promisor objects.
>
>   Because the "-d" argument to "git repack" subsequently deletes all
>   loose objects also in packs, these just-loosened promisor objects will
>   be immediately deleted. But this extra disk churn is unnecessary in
>   the first place.
>

Thanks for suggesting this message, obviously it's much better than
mine specially because removes the confusion that I made when writing
the message about the `pack-objects` and `prune-packed`.

I'll update patch's message on the next revision.

>> For instance, a partially cloned repository that filters all the blob
>> objects (e.g. "--filter=blob:none"), `repack` ends up unpacking all
>> blobs into the filesystem that, depending on the repo size, makes
>> nearly impossible to repack the operation before running out of disk.
>> 
>> For a partial clone, `git repack` calls `git pack-objects` twice: (1)
>> for handle the "promisor" objects and (2) for performing the repack
>> with --exclude-promisor-objects option, that results in unpacking and
>> deleting of the objects. Given that we actually should keep the
>> promisor objects, let's teach `repack` to tell `pack-objects` to
>> --keep the old "promisor" pack file.
>
> It's not this call (2) that results in any deleting of the objects, but
> the later call to prune_packed_objects(). Also, promisor objects are
> kept regardless of what we pass to "pack-objects" here (the keeping is
> done separately). Maybe write (continuation from my suggestion above):
>
>   In order to avoid this extra disk churn, pass the names of the
>   promisor packfiles as "--keep-pack" arguments to this
>   second invocation of "pack-objects". This informs "pack-objects" that
>   the promisor objects are already in a safe packfile and, therefore, do
>   not need to be loosened.
>

You're right, the second call only loosens the object and the deleting
comes after it, which makes my message misleading. 

The suggested paragraph explain better the situation, thanks for
suggesting it and will replace mine in the next revision.

>> @@ -533,7 +533,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>  	}
>>  
>>  	packdir = mkpathdup("%s/pack", get_object_directory());
>> -	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
>> +	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
>> +	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
>>  
>>  	sigchain_push_common(remove_pack_on_signal);
>>  
>
> Normally I would be concerned that packtmp_name is not freed, but in
> this case, it's a static variable (same as packtmp).
>

Indeed.

>> @@ -576,6 +577,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>  		repack_promisor_objects(&po_args, &names);
>>  
>>  		if (existing_packs.nr && delete_redundant) {
>> +			for_each_string_list_item(item, &names) {
>> +				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
>> +					     packtmp_name, item->string);
>> +			}
>
> Git style is to not have braces for single-statement loops.
>

I preferred to keep the braces simply because the
for_each_string_list_item() is macro instead of a normal for-loop
statement. 

>> +test_expect_success 'repack does not loose all objects' '
>> +	rm -rf client &&
>> +	git clone --bare --filter=blob:none "file://$(pwd)/srv.bare" client &&
>> +	test_when_finished "rm -rf client" &&
>> +	git -C client repack -A -l -d --no-prune-packed &&
>> +	git -C client count-objects -v >object-count &&
>> +	grep "^prune-packable: 0" object-count
>> +'
>
> s/loose all objects/loosen promisor objects/
>
> Also, add a comment describing why we have "--no-prune-packed" there
> (probably something about not pruning any loose objects that are already
> in packs, so that we can verify that no redundant loose objects are
> being created in the first place).

Thanks for the reviewing and helping clarify the patch intentions. I'll
address this comments on the next revision.

-- 
Thanks
Rafael

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/2] prevent `repack` to unpack and delete promisor objects
  2021-04-15  9:15   ` Jeff King
@ 2021-04-18  8:20     ` Rafael Silva
  0 siblings, 0 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-18  8:20 UTC (permalink / raw)
  To: Jeff King; +Cc: git, SZEDER Gábor, Jonathan Tan, Junio C Hamano


Jeff King <peff@peff.net> writes:

> On Wed, Apr 14, 2021 at 09:14:01PM +0200, Rafael Silva wrote:
>
>> It took me a bit to come up with the test because it seems `repack`
>> doesn't offer an option to skip the "deletion of unpacked objects",
>> so this series adds a new option to `repack` for skip the
>> `git prune-packed` execution thus allowing us to easily inspect the
>> unpacked objects before they are removed and simplification of our
>> test suite. Furthermore, The test will now test the `repack` code
>> path instead of performing the operations by calling
>> `pack-objects`.
>
> Thanks for working on this. Overall the patches seem sane, though I
> think Jonathan's comments (especially about the confusion in the commit
> message of 2/2) are worth addressing.
>

Thanks for the review. Indeed, I will address Jonathan's comments on
the next revision to remove the confused and misleading commit message.

Sorry for the confusion.

>
> I have mixed feelings on the "--no-prune-packed" option, just because
> it's user-visible and I don't think it's something a normal user would
> ever really want.
>
> In the new test (and I think in the old ones you modified, though I
> didn't look carefully) the main thing we care about is whether we write
> out loose objects. So another solution would be to improve the debug
> logging inside pack-objects to tell us more about what it's doing.
>

Honestly, I'm also not happy adding an end user-visible variable just
for the sake of testing it, specially because it's unclear whether the
user will actually use this option.

Initially, what convinced myself for proposing the changes was the
additional cleanup in our test suite, but giving a second thought I'm
not sure now whether this is a strong argument either.

> The fork of Git we use at GitHub has something similar; when we discard
> objects or force them loose, we write their sha1 values to a log file.
> This has come in handy for a lot of after-the-fact debugging ("oops,
> this repo is corrupted; did we intentionally delete object X?").
>
> I wonder if we could do something similar with the trace2 facility. I
> know it can be turned on via config, but I don't know how good the
> support is for enabling just one segment of data (and this may generate
> a lot of entries, so people using trace2 for telemetry probably wouldn't
> want it on).
>
> For the purposes of the tests, though just a normal GIT_TRACE_PACK_DEBUG
> would be plenty. I dunno. I don't want to open up a can of worms on
> logging that would hold up getting this quite-substantial fix in place.
> But once we add --no-prune-packed, it will be hard to take away.
>
> -Peff

After reading this comment and investigating a bit more, I believe
increasing the debug logging of `pack-objects` will help drop the first
patch, at least for now, and allowing the fix to progress without the
"controversial" user option.  Later, we can revisit adding the
`--no-prune-packed` (or a better named) option in case we think this
will be useful for the users.

I'll be addressing this in v2.

-- 
Thanks
Rafael

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones
  2021-04-15 18:06     ` Junio C Hamano
@ 2021-04-18  8:40       ` Rafael Silva
  0 siblings, 0 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-18  8:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, SZEDER Gábor, Jeff King, Jonathan Tan


Junio C Hamano <gitster@pobox.com> writes:

> Rafael Silva <rafaeloliveira.cs@gmail.com> writes:
>
>> For instance, a partially cloned repository that filters all the blob
>> objects (e.g. "--filter=blob:none"), `repack` ends up unpacking all
>> blobs into the filesystem that, depending on the repo size, makes
>> nearly impossible to repack the operation before running out of disk.
>
> Could you clarify this paragraph a bit more?  It is unclear why the
> repository has "all the blob objects" that it can loosen with repack
> in the first place, if it was cloned without any blob.  Do you mean
> that repack does not stop at the promisor pack boundary and instead
> lazily download blobs "on-demand", which ends up as loose objects?
>
> Thanks.

I should have written that we "unpack (or better turn loose) the 
promisor objects" not the `blob` objects. Instead, I should have
written something like:

    ... ends up loosening all promisor objects, on this case all
    the `trees` and `commits` objects, into the filesystem ...

Apologise for the confusion and this misleading message. I'll clarify
this in the v2. 

-- 
Thanks
Rafael

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v2 0/1] prevent `repack` to unpack and delete promisor objects
  2021-04-14 19:14 ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Rafael Silva
                     ` (3 preceding siblings ...)
  2021-04-15  9:15   ` Jeff King
@ 2021-04-18 13:57   ` Rafael Silva
  2021-04-18 13:57     ` [PATCH v2 1/1] repack: avoid loosening promisor objects in partial clones Rafael Silva
  2021-04-21 19:32     ` [PATCH v3] " Rafael Silva
  4 siblings, 2 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-18 13:57 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Jonathan Tan, SZEDER Gábor, Junio C Hamano, Rafael Silva

Here's the v2, sorry for the delay.

This series is built on top of jk/promisor-optim (now graduated to next). It
conflicts with changes on p5600 otherwise.

The following patches fixes the issue where we unnecessarily turn loose
all the promisor objects and deletes them right after when running
`git repack -A -d ..` (via `git gc) for a partial repository. 

Special thanks to Peff, for proposing a better approach for managing
the situation and for Jonathan Tan for earlier interaction on the
solution. Previously, I thought we should skip the promisor objects
by just adding a check in loosened_object_can_be_discarded(). However,
Peff pointed out that we can do better by realizing much sooner that
we should not even consider loosening the objects for the _old_ promisor
packs.

===============
Changes from v1:

    * v2 contains only patch instead of two from the previous round.

    * I include Jonathan's suggestion throughout the entire patch, most
      notably the commit message that was confused and a misleading. Sorry
      for that. Special thanks to him for helping suggesting a _much better_
      commit message among other suggested improvements and corrections.

    * The [Patch 1/2] from v1 is dropped. It was adding a user-visible
      option to `repack` in order to skip the call to `prune-packed` and
      prevent destroying the evidence of the bug so it can be tested.
      However, Peff raised some concerns about adding a user-visible option
      (that is unclear whether the user will ever want) just for the sake
      of testing it - honestly, I wasn't too happy with this either. In v2, we
      now teach `pack-objects` to count the objects of interest
      (loosened objects) and emit this information via trace2, which allows
      checking the debugging logging for the evidence.

    * Test is modified to rely on the the added trace2 event information.

    * Updates on the performance numbers, including adding one execution
      for a bigger repository (linux.git).

Even though the v2 is 1-patch long, I thought it make thing clear to read if
a re-send the cover-letter as the [Patch v2 1/1] is already a bit lengthy,
and explaining the whole story here.

Rafael Silva (1):
  repack: avoid loosening promisor objects in partial clones

 builtin/pack-objects.c        | 8 +++++++-
 builtin/repack.c              | 9 +++++++--
 t/perf/p5600-partial-clone.sh | 4 ++++
 t/t5616-partial-clone.sh      | 8 ++++++++
 4 files changed, 26 insertions(+), 3 deletions(-)

Range-diff against v1:
1:  2431f8b75d < -:  ---------- repack: teach --no-prune-packed to skip `git prune-packed`
2:  1331049a86 ! 1:  9d996393c9 repack: avoid loosening promisor pack objects in partial clones
    @@ Metadata
     Author: Rafael Silva <rafaeloliveira.cs@gmail.com>
     
      ## Commit message ##
    -    repack: avoid loosening promisor pack objects in partial clones
    -
    -    When `-A` and `-d` are used together, besides packing all objects (-A)
    -    and removing redundant packs (-d), it also unpack all unreachable
    -    objects and deletes them by calling `git pruned-packed`. For a partial
    -    clone, that contains unreferenced objects, this results in unpacking
    -    all "promisor" objects and deleting them right after, which
    -    unnecessarily increases the `repack` execution time and disk usage
    -    during the unpacking of the objects.
    -
    -    For instance, a partially cloned repository that filters all the blob
    -    objects (e.g. "--filter=blob:none"), `repack` ends up unpacking all
    -    blobs into the filesystem that, depending on the repo size, makes
    -    nearly impossible to repack the operation before running out of disk.
    -
    -    For a partial clone, `git repack` calls `git pack-objects` twice: (1)
    -    for handle the "promisor" objects and (2) for performing the repack
    -    with --exclude-promisor-objects option, that results in unpacking and
    -    deleting of the objects. Given that we actually should keep the
    -    promisor objects, let's teach `repack` to tell `pack-objects` to
    -    --keep the old "promisor" pack file.
    -
    -    The --keep-pack option takes only a packfile name, but we concatenate
    -    both the path and the name in a single string. Instead, let's split
    -    them into separate string in order to easily pass the packfile name
    -    later.
    -
    -    Additionally, add a new perf test to evaluate the performance
    -    impact made by this changes (tested on git.git):
    -
    -        Test            HEAD^                 HEAD
    -        ------------------------------------------------------------
    -        5600.5: gc      137.67(42.48+93.64)   8.08(6.91+1.45) -94.1%
    -
    -    In this particular script, the improvement is big because every
    -    object in the newly-cloned partial repository is a promisor object.
    +    repack: avoid loosening promisor objects in partial clones
    +
    +    When `git repack -A -d` is run in a partial clone, `pack-objects`
    +    is invoked twice: once to repack all promisor objects, and once to
    +    repack all non-promisor objects. The latter `pack-objects` invocation
    +    is with --exclude-promisor-objects and --unpack-unreachable, which
    +    loosens all unused objects. Unfortunately, this includes promisor
    +    objects.
    +
    +    Because the -d argument to `git repack` subsequently deletes all loose
    +    objects also in packs, these just-loosened promisor objects will be
    +    immediately deleted. However, this extra disk churn is unnecessary in
    +    the first place.  For example, a newly-clone partial repo that filters
    +    all blob objects (e.g. `--filter=blob:none`), `repack` ends up
    +    unpacking all trees and commits into the filesystem because every
    +    object, in this particular case, is a promisor object. Depending on
    +    the repo size, this increases the disk usage considerably: In my copy
    +    of the linux.git, the object directory peaked 26GB of more disk usage.
    +
    +    In order to avoid this extra disk churn, pass the names of the promisor
    +    packfiles as --keep-pack arguments to the second invocation of
    +    `pack-objects`. This informs `pack-objects` that the promisor objects
    +    are already in a safe packfile and, therefore, do not need to be
    +    loosened. The --keep-pack option takes only a packfile name, but we
    +    concatenate both the path and the name in a single string. Instead,
    +    let's split them into separate string in order to easily pass the
    +    packfile name later.
    +
    +    For testing, we need to validate whether any object was loosened.
    +    However, the "evidence" (loosened objects) is deleted during the
    +    process which prevents us from inspecting the object directory.
    +    Instead, let's teach `pack-objects` to count loosened objects and
    +    emit via trace2 thus allowing inspecting the debug events after the
    +    process is finished. This new event is used on the added regression
    +    test.
    +
    +    Lastly, add a new perf test to evaluate the performance impact
    +    made by this changes (tested on git.git):
    +
    +         Test          HEAD^                 HEAD
    +         ----------------------------------------------------------
    +         5600.3: gc    134.38(41.93+90.95)   7.80(6.72+1.35) -94.2%
    +
    +    For a bigger repository, such as linux.git, the improvement is
    +    even bigger:
    +
    +         Test          HEAD^                     HEAD
    +         -------------------------------------------------------------------
    +         5600.3: gc    6833.00(918.07+3162.74)   268.79(227.02+39.18) -96.1%
    +
    +    These improvements are particular big because every object in the
    +    newly-cloned partial repository is a promisor object.
     
         Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
         Helped-by: Jeff King <peff@peff.net>
    +    Helped-by: Jonathan Tan <jonathantanmy@google.com>
         Signed-off-by: Rafael Silva <rafaeloliveira.cs@gmail.com>
    -    Signed-off-by: Junio C Hamano <gitster@pobox.com>
    +
    + ## builtin/pack-objects.c ##
    +@@ builtin/pack-objects.c: static void loosen_unused_packed_objects(void)
    + {
    + 	struct packed_git *p;
    + 	uint32_t i;
    ++	uint32_t loosened_objects_nr = 0;
    + 	struct object_id oid;
    + 
    + 	for (p = get_all_packs(the_repository); p; p = p->next) {
    +@@ builtin/pack-objects.c: static void loosen_unused_packed_objects(void)
    + 			nth_packed_object_id(&oid, p, i);
    + 			if (!packlist_find(&to_pack, &oid) &&
    + 			    !has_sha1_pack_kept_or_nonlocal(&oid) &&
    +-			    !loosened_object_can_be_discarded(&oid, p->mtime))
    ++			    !loosened_object_can_be_discarded(&oid, p->mtime)) {
    + 				if (force_object_loose(&oid, p->mtime))
    + 					die(_("unable to force loose object"));
    ++				loosened_objects_nr++;
    ++			}
    + 		}
    + 	}
    ++
    ++	trace2_data_intmax("pack-objects", the_repository,
    ++			   "loosen_unused_packed_objects/loosened", loosened_objects_nr);
    + }
    + 
    + /*
     
      ## builtin/repack.c ##
     @@ builtin/repack.c: static int delta_base_offset = 1;
    @@ t/t5616-partial-clone.sh: test_expect_success 'fetch from a partial clone, proto
      	grep "version 2" trace
      '
      
    -+test_expect_success 'repack does not loose all objects' '
    -+	rm -rf client &&
    ++test_expect_success 'repack does not loosen promisor objects' '
    ++	rm -rf client trace &&
     +	git clone --bare --filter=blob:none "file://$(pwd)/srv.bare" client &&
    -+	test_when_finished "rm -rf client" &&
    -+	git -C client repack -A -l -d --no-prune-packed &&
    -+	git -C client count-objects -v >object-count &&
    -+	grep "^prune-packable: 0" object-count
    ++	test_when_finished "rm -rf client trace" &&
    ++	GIT_TRACE2_PERF="$(pwd)/trace" git -C client repack -A -d &&
    ++	grep "loosen_unused_packed_objects/loosened:0" trace
     +'
     +
      . "$TEST_DIRECTORY"/lib-httpd.sh
-- 
2.31.0.699.g8849f49b87


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v2 1/1] repack: avoid loosening promisor objects in partial clones
  2021-04-18 13:57   ` [PATCH v2 0/1] " Rafael Silva
@ 2021-04-18 13:57     ` Rafael Silva
  2021-04-19 19:15       ` Jonathan Tan
  2021-04-19 23:09       ` Junio C Hamano
  2021-04-21 19:32     ` [PATCH v3] " Rafael Silva
  1 sibling, 2 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-18 13:57 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Jonathan Tan, SZEDER Gábor, Junio C Hamano, Rafael Silva

When `git repack -A -d` is run in a partial clone, `pack-objects`
is invoked twice: once to repack all promisor objects, and once to
repack all non-promisor objects. The latter `pack-objects` invocation
is with --exclude-promisor-objects and --unpack-unreachable, which
loosens all unused objects. Unfortunately, this includes promisor
objects.

Because the -d argument to `git repack` subsequently deletes all loose
objects also in packs, these just-loosened promisor objects will be
immediately deleted. However, this extra disk churn is unnecessary in
the first place.  For example, a newly-clone partial repo that filters
all blob objects (e.g. `--filter=blob:none`), `repack` ends up
unpacking all trees and commits into the filesystem because every
object, in this particular case, is a promisor object. Depending on
the repo size, this increases the disk usage considerably: In my copy
of the linux.git, the object directory peaked 26GB of more disk usage.

In order to avoid this extra disk churn, pass the names of the promisor
packfiles as --keep-pack arguments to the second invocation of
`pack-objects`. This informs `pack-objects` that the promisor objects
are already in a safe packfile and, therefore, do not need to be
loosened. The --keep-pack option takes only a packfile name, but we
concatenate both the path and the name in a single string. Instead,
let's split them into separate string in order to easily pass the
packfile name later.

For testing, we need to validate whether any object was loosened.
However, the "evidence" (loosened objects) is deleted during the
process which prevents us from inspecting the object directory.
Instead, let's teach `pack-objects` to count loosened objects and
emit via trace2 thus allowing inspecting the debug events after the
process is finished. This new event is used on the added regression
test.

Lastly, add a new perf test to evaluate the performance impact
made by this changes (tested on git.git):

     Test          HEAD^                 HEAD
     ----------------------------------------------------------
     5600.3: gc    134.38(41.93+90.95)   7.80(6.72+1.35) -94.2%

For a bigger repository, such as linux.git, the improvement is
even bigger:

     Test          HEAD^                     HEAD
     -------------------------------------------------------------------
     5600.3: gc    6833.00(918.07+3162.74)   268.79(227.02+39.18) -96.1%

These improvements are particular big because every object in the
newly-cloned partial repository is a promisor object.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Rafael Silva <rafaeloliveira.cs@gmail.com>
---
 builtin/pack-objects.c        | 8 +++++++-
 builtin/repack.c              | 9 +++++++--
 t/perf/p5600-partial-clone.sh | 4 ++++
 t/t5616-partial-clone.sh      | 8 ++++++++
 4 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 40ee6fa19f..73889cec95 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3479,6 +3479,7 @@ static void loosen_unused_packed_objects(void)
 {
 	struct packed_git *p;
 	uint32_t i;
+	uint32_t loosened_objects_nr = 0;
 	struct object_id oid;
 
 	for (p = get_all_packs(the_repository); p; p = p->next) {
@@ -3492,11 +3493,16 @@ static void loosen_unused_packed_objects(void)
 			nth_packed_object_id(&oid, p, i);
 			if (!packlist_find(&to_pack, &oid) &&
 			    !has_sha1_pack_kept_or_nonlocal(&oid) &&
-			    !loosened_object_can_be_discarded(&oid, p->mtime))
+			    !loosened_object_can_be_discarded(&oid, p->mtime)) {
 				if (force_object_loose(&oid, p->mtime))
 					die(_("unable to force loose object"));
+				loosened_objects_nr++;
+			}
 		}
 	}
+
+	trace2_data_intmax("pack-objects", the_repository,
+			   "loosen_unused_packed_objects/loosened", loosened_objects_nr);
 }
 
 /*
diff --git a/builtin/repack.c b/builtin/repack.c
index 2847fdfbab..5f9bc74adc 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -20,7 +20,7 @@ static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
-static char *packdir, *packtmp;
+static char *packdir, *packtmp_name, *packtmp;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -530,7 +530,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 
 	packdir = mkpathdup("%s/pack", get_object_directory());
-	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
+	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
+	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
 
 	sigchain_push_common(remove_pack_on_signal);
 
@@ -573,6 +574,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		repack_promisor_objects(&po_args, &names);
 
 		if (existing_packs.nr && delete_redundant) {
+			for_each_string_list_item(item, &names) {
+				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
+					     packtmp_name, item->string);
+			}
 			if (unpack_unreachable) {
 				strvec_pushf(&cmd.args,
 					     "--unpack-unreachable=%s",
diff --git a/t/perf/p5600-partial-clone.sh b/t/perf/p5600-partial-clone.sh
index ca785a3341..a965f2c4d6 100755
--- a/t/perf/p5600-partial-clone.sh
+++ b/t/perf/p5600-partial-clone.sh
@@ -35,4 +35,8 @@ test_perf 'count non-promisor commits' '
 	git -C bare.git rev-list --all --count --exclude-promisor-objects
 '
 
+test_perf 'gc' '
+	git -C bare.git gc
+'
+
 test_done
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 5cb415386e..6e3e7565d0 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -548,6 +548,14 @@ test_expect_success 'fetch from a partial clone, protocol v2' '
 	grep "version 2" trace
 '
 
+test_expect_success 'repack does not loosen promisor objects' '
+	rm -rf client trace &&
+	git clone --bare --filter=blob:none "file://$(pwd)/srv.bare" client &&
+	test_when_finished "rm -rf client trace" &&
+	GIT_TRACE2_PERF="$(pwd)/trace" git -C client repack -A -d &&
+	grep "loosen_unused_packed_objects/loosened:0" trace
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-- 
2.31.0.699.g8849f49b87


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] repack: teach --no-prune-packed to skip `git prune-packed`
  2021-04-14 23:50     ` Jonathan Tan
@ 2021-04-18 14:15       ` Rafael Silva
  0 siblings, 0 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-18 14:15 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, szeder.dev, peff, Junio C Hamano


Thanks for reviewing this patch. really appreciate it.

However, in the v2, I decided to drop this in favor of just adding a
trace2 event that emits all the loosened objects. This is to avoid
adding user-visible option just for the sake of testing it.

Jonathan Tan <jonathantanmy@google.com> writes:

>> The `git repack -d` command will remove any packfile that becomes
>> redundant after repacking, and then call  `git pruned-packed` for
>> pruning any unpacked objects.
>
> s/pruned-packed/prune-packed/ (note that there is no "d" after "prune")
> throughout this commit message.
>
> Also, if there are any objects pruned, they are packed objects, not
> unpacked objects. Maybe better to say "...for pruning any objects
> already in packs".
>

Thanks for catching the typo and the clarification.

>> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
>> index 25b235c063..728a16ad97 100755
>> --- a/t/t7700-repack.sh
>> +++ b/t/t7700-repack.sh
>> @@ -127,12 +127,7 @@ test_expect_success 'packed unreachable obs in alternate ODB are not loosened' '
>>  	git reset --hard HEAD^ &&
>>  	test_tick &&
>>  	git reflog expire --expire=$test_tick --expire-unreachable=$test_tick --all &&
>> -	# The pack-objects call on the next line is equivalent to
>> -	# git repack -A -d without the call to prune-packed
>> -	git pack-objects --honor-pack-keep --non-empty --all --reflog \
>> -	    --unpack-unreachable </dev/null pack &&
>> -	rm -f .git/objects/pack/* &&
>> -	mv pack-* .git/objects/pack/ &&
>> +	git repack -A -d --no-prune-packed &&
>
> This is great!
>

Unfortunately, with the drop of this patch in v2, we no longer refactor
these tests. However, this still might be possible with the trace2 event
information that is now-added in the v2. Of course, if this doesn't
impact or weakens the test in anyway.

>> +test_expect_success '-A -d and --no-prune-packed do not remove loose objects' '
>> +	test_create_repo repo &&
>> +	test_when_finished "rm -rf repo" &&
>> +	test_commit -C repo commit &&
>> +	git -C repo repack -A -d --no-prune-packed &&
>> +	git -C repo count-objects -v >out &&
>> +	grep "^prune-packable: 3" out
>> +'
>
> As for the test description, I would prefer "...do not remove loose
> objects already in packs".

Yes, this is an improvement of the test description. 

-- 
Thanks
Rafael

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 1/1] repack: avoid loosening promisor objects in partial clones
  2021-04-18 13:57     ` [PATCH v2 1/1] repack: avoid loosening promisor objects in partial clones Rafael Silva
@ 2021-04-19 19:15       ` Jonathan Tan
  2021-04-21 18:54         ` Rafael Silva
  2021-04-19 23:09       ` Junio C Hamano
  1 sibling, 1 reply; 46+ messages in thread
From: Jonathan Tan @ 2021-04-19 19:15 UTC (permalink / raw)
  To: rafaeloliveira.cs; +Cc: git, peff, jonathantanmy, szeder.dev, gitster

> When `git repack -A -d` is run in a partial clone, `pack-objects`
> is invoked twice: once to repack all promisor objects, and once to
> repack all non-promisor objects. The latter `pack-objects` invocation
> is with --exclude-promisor-objects and --unpack-unreachable, which
> loosens all unused objects. Unfortunately, this includes promisor
> objects.

s/loosens all unused objects/loosens all objects unused during this invocation/

> [snip] The --keep-pack option takes only a packfile name, but we
> concatenate both the path and the name in a single string. Instead,
> let's split them into separate string in order to easily pass the
> packfile name later.

I think mentioning this part is unnecessary in the commit message.

With or without these changes, this patch looks good to me.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 1/1] repack: avoid loosening promisor objects in partial clones
  2021-04-18 13:57     ` [PATCH v2 1/1] repack: avoid loosening promisor objects in partial clones Rafael Silva
  2021-04-19 19:15       ` Jonathan Tan
@ 2021-04-19 23:09       ` Junio C Hamano
  2021-04-21 19:25         ` Rafael Silva
  1 sibling, 1 reply; 46+ messages in thread
From: Junio C Hamano @ 2021-04-19 23:09 UTC (permalink / raw)
  To: Rafael Silva; +Cc: git, Jeff King, Jonathan Tan, SZEDER Gábor

Rafael Silva <rafaeloliveira.cs@gmail.com> writes:

> When `git repack -A -d` is run in a partial clone, `pack-objects`
> is invoked twice: once to repack all promisor objects, and once to
> repack all non-promisor objects. The latter `pack-objects` invocation
> is with --exclude-promisor-objects and --unpack-unreachable, which
> loosens all unused objects. Unfortunately, this includes promisor
> objects.
>
> Because the -d argument to `git repack` subsequently deletes all loose
> objects also in packs, these just-loosened promisor objects will be
> immediately deleted. However, this extra disk churn is unnecessary in
> the first place.  For example, a newly-clone partial repo that filters

"in a newly-cloned partial repo", I'd think.

> For testing, we need to validate whether any object was loosened.
> However, the "evidence" (loosened objects) is deleted during the
> process which prevents us from inspecting the object directory.
> Instead, let's teach `pack-objects` to count loosened objects and
> emit via trace2 thus allowing inspecting the debug events after the
> process is finished. This new event is used on the added regression
> test.

Nicely designed.

> +	uint32_t loosened_objects_nr = 0;
>  	struct object_id oid;
>  
>  	for (p = get_all_packs(the_repository); p; p = p->next) {
> @@ -3492,11 +3493,16 @@ static void loosen_unused_packed_objects(void)
>  			nth_packed_object_id(&oid, p, i);
>  			if (!packlist_find(&to_pack, &oid) &&
>  			    !has_sha1_pack_kept_or_nonlocal(&oid) &&
> -			    !loosened_object_can_be_discarded(&oid, p->mtime))
> +			    !loosened_object_can_be_discarded(&oid, p->mtime)) {
>  				if (force_object_loose(&oid, p->mtime))
>  					die(_("unable to force loose object"));
> +				loosened_objects_nr++;
> +			}
>  		}
>  	}
> +
> +	trace2_data_intmax("pack-objects", the_repository,
> +			   "loosen_unused_packed_objects/loosened", loosened_objects_nr);
>  }

OK, so this is just the "stats".

> diff --git a/builtin/repack.c b/builtin/repack.c
> index 2847fdfbab..5f9bc74adc 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -20,7 +20,7 @@ static int delta_base_offset = 1;
>  static int pack_kept_objects = -1;
>  static int write_bitmaps = -1;
>  static int use_delta_islands;
> -static char *packdir, *packtmp;
> +static char *packdir, *packtmp_name, *packtmp;
>  
>  static const char *const git_repack_usage[] = {
>  	N_("git repack [<options>]"),
> @@ -530,7 +530,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	}
>  
>  	packdir = mkpathdup("%s/pack", get_object_directory());
> -	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
> +	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
> +	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);

Just a mental note, but we should move away from ".tmp-$$" that is a
remnant from the days back when this was a shell script, and use the
tempfile.h API (#leftoverbits).  Such a change must not be part of
this topic, of course.

Thanks.  Will queue and see what others say.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 1/1] repack: avoid loosening promisor objects in partial clones
  2021-04-19 19:15       ` Jonathan Tan
@ 2021-04-21 18:54         ` Rafael Silva
  0 siblings, 0 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-21 18:54 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff, szeder.dev, gitster


Jonathan Tan <jonathantanmy@google.com> writes:

>> When `git repack -A -d` is run in a partial clone, `pack-objects`
>> is invoked twice: once to repack all promisor objects, and once to
>> repack all non-promisor objects. The latter `pack-objects` invocation
>> is with --exclude-promisor-objects and --unpack-unreachable, which
>> loosens all unused objects. Unfortunately, this includes promisor
>> objects.
>
> s/loosens all unused objects/loosens all objects unused during this invocation/
>

Thanks, will include this change in v3.

>> [snip] The --keep-pack option takes only a packfile name, but we
>> concatenate both the path and the name in a single string. Instead,
>> let's split them into separate string in order to easily pass the
>> packfile name later.
>
> I think mentioning this part is unnecessary in the commit message.
>

make sense. I'll remove then to reduce the commit message.

> With or without these changes, this patch looks good to me.

Thanks for the review.

-- 
Thanks
Rafael

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v2 1/1] repack: avoid loosening promisor objects in partial clones
  2021-04-19 23:09       ` Junio C Hamano
@ 2021-04-21 19:25         ` Rafael Silva
  0 siblings, 0 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-21 19:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Jonathan Tan, SZEDER Gábor


Junio C Hamano <gitster@pobox.com> writes:

> Rafael Silva <rafaeloliveira.cs@gmail.com> writes:
>
>> When `git repack -A -d` is run in a partial clone, `pack-objects`
>> is invoked twice: once to repack all promisor objects, and once to
>> repack all non-promisor objects. The latter `pack-objects` invocation
>> is with --exclude-promisor-objects and --unpack-unreachable, which
>> loosens all unused objects. Unfortunately, this includes promisor
>> objects.
>>
>> Because the -d argument to `git repack` subsequently deletes all loose
>> objects also in packs, these just-loosened promisor objects will be
>> immediately deleted. However, this extra disk churn is unnecessary in
>> the first place.  For example, a newly-clone partial repo that filters
>
> "in a newly-cloned partial repo", I'd think.
>

Thanks, will fix on the next revision.

>> For testing, we need to validate whether any object was loosened.
>> However, the "evidence" (loosened objects) is deleted during the
>> process which prevents us from inspecting the object directory.
>> Instead, let's teach `pack-objects` to count loosened objects and
>> emit via trace2 thus allowing inspecting the debug events after the
>> process is finished. This new event is used on the added regression
>> test.
>
> Nicely designed.
>

Thanks :)

>> +	uint32_t loosened_objects_nr = 0;
>>  	struct object_id oid;
>>  
>>  	for (p = get_all_packs(the_repository); p; p = p->next) {
>> @@ -3492,11 +3493,16 @@ static void loosen_unused_packed_objects(void)
>>  			nth_packed_object_id(&oid, p, i);
>>  			if (!packlist_find(&to_pack, &oid) &&
>>  			    !has_sha1_pack_kept_or_nonlocal(&oid) &&
>> -			    !loosened_object_can_be_discarded(&oid, p->mtime))
>> +			    !loosened_object_can_be_discarded(&oid, p->mtime)) {
>>  				if (force_object_loose(&oid, p->mtime))
>>  					die(_("unable to force loose object"));
>> +				loosened_objects_nr++;
>> +			}
>>  		}
>>  	}
>> +
>> +	trace2_data_intmax("pack-objects", the_repository,
>> +			   "loosen_unused_packed_objects/loosened", loosened_objects_nr);
>>  }
>
> OK, so this is just the "stats".
>
>> diff --git a/builtin/repack.c b/builtin/repack.c
>> index 2847fdfbab..5f9bc74adc 100644
>> --- a/builtin/repack.c
>> +++ b/builtin/repack.c
>> @@ -20,7 +20,7 @@ static int delta_base_offset = 1;
>>  static int pack_kept_objects = -1;
>>  static int write_bitmaps = -1;
>>  static int use_delta_islands;
>> -static char *packdir, *packtmp;
>> +static char *packdir, *packtmp_name, *packtmp;
>>  
>>  static const char *const git_repack_usage[] = {
>>  	N_("git repack [<options>]"),
>> @@ -530,7 +530,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>  	}
>>  
>>  	packdir = mkpathdup("%s/pack", get_object_directory());
>> -	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
>> +	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
>> +	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
>
> Just a mental note, but we should move away from ".tmp-$$" that is a
> remnant from the days back when this was a shell script, and use the
> tempfile.h API (#leftoverbits).  Such a change must not be part of
> this topic, of course.
>

Indeed. This should be move tempfile.h API.

>
> Thanks.  Will queue and see what others say.

Thanks for reviewing it.

-- 
Thanks
Rafael

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v3] repack: avoid loosening promisor objects in partial clones
  2021-04-18 13:57   ` [PATCH v2 0/1] " Rafael Silva
  2021-04-18 13:57     ` [PATCH v2 1/1] repack: avoid loosening promisor objects in partial clones Rafael Silva
@ 2021-04-21 19:32     ` Rafael Silva
  1 sibling, 0 replies; 46+ messages in thread
From: Rafael Silva @ 2021-04-21 19:32 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Jonathan Tan, Junio C Hamano, SZEDER Gábor, Rafael Silva

When `git repack -A -d` is run in a partial clone, `pack-objects`
is invoked twice: once to repack all promisor objects, and once to
repack all non-promisor objects. The latter `pack-objects` invocation
is with --exclude-promisor-objects and --unpack-unreachable, which
loosens all objects unused during this invocation. Unfortunately,
this includes promisor objects.

Because the -d argument to `git repack` subsequently deletes all loose
objects also in packs, these just-loosened promisor objects will be
immediately deleted. However, this extra disk churn is unnecessary in
the first place.  For example, in a newly-cloned partial repo that
filters all blob objects (e.g. `--filter=blob:none`), `repack` ends up
unpacking all trees and commits into the filesystem because every
object, in this particular case, is a promisor object. Depending on
the repo size, this increases the disk usage considerably: In my copy
of the linux.git, the object directory peaked 26GB of more disk usage.

In order to avoid this extra disk churn, pass the names of the promisor
packfiles as --keep-pack arguments to the second invocation of
`pack-objects`. This informs `pack-objects` that the promisor objects
are already in a safe packfile and, therefore, do not need to be
loosened.

For testing, we need to validate whether any object was loosened.
However, the "evidence" (loosened objects) is deleted during the
process which prevents us from inspecting the object directory.
Instead, let's teach `pack-objects` to count loosened objects and
emit via trace2 thus allowing inspecting the debug events after the
process is finished. This new event is used on the added regression
test.

Lastly, add a new perf test to evaluate the performance impact
made by this changes (tested on git.git):

     Test          HEAD^                 HEAD
     ----------------------------------------------------------
     5600.3: gc    134.38(41.93+90.95)   7.80(6.72+1.35) -94.2%

For a bigger repository, such as linux.git, the improvement is
even bigger:

     Test          HEAD^                     HEAD
     -------------------------------------------------------------------
     5600.3: gc    6833.00(918.07+3162.74)   268.79(227.02+39.18) -96.1%

These improvements are particular big because every object in the
newly-cloned partial repository is a promisor object.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Rafael Silva <rafaeloliveira.cs@gmail.com>
---

This series is built on top of jk/promisor-optim (now graduated to next). It
conflicts with changes on p5600 otherwise.

I believe the changes is pretty clear from the `range-diff` (only changes
on the commit message).

Range-diff against v2:
1:  7380e3e724 ! 1:  6d94907ca3 repack: avoid loosening promisor objects in partial clones
    @@ Commit message
         is invoked twice: once to repack all promisor objects, and once to
         repack all non-promisor objects. The latter `pack-objects` invocation
         is with --exclude-promisor-objects and --unpack-unreachable, which
    -    loosens all unused objects. Unfortunately, this includes promisor
    -    objects.
    +    loosens all objects unused during this invocation. Unfortunately,
    +    this includes promisor objects.
     
         Because the -d argument to `git repack` subsequently deletes all loose
         objects also in packs, these just-loosened promisor objects will be
         immediately deleted. However, this extra disk churn is unnecessary in
    -    the first place.  For example, a newly-clone partial repo that filters
    -    all blob objects (e.g. `--filter=blob:none`), `repack` ends up
    +    the first place.  For example, in a newly-cloned partial repo that
    +    filters all blob objects (e.g. `--filter=blob:none`), `repack` ends up
         unpacking all trees and commits into the filesystem because every
         object, in this particular case, is a promisor object. Depending on
         the repo size, this increases the disk usage considerably: In my copy
    @@ Commit message
         packfiles as --keep-pack arguments to the second invocation of
         `pack-objects`. This informs `pack-objects` that the promisor objects
         are already in a safe packfile and, therefore, do not need to be
    -    loosened. The --keep-pack option takes only a packfile name, but we
    -    concatenate both the path and the name in a single string. Instead,
    -    let's split them into separate string in order to easily pass the
    -    packfile name later.
    +    loosened.
     
         For testing, we need to validate whether any object was loosened.
         However, the "evidence" (loosened objects) is deleted during the

 builtin/pack-objects.c        | 8 +++++++-
 builtin/repack.c              | 9 +++++++--
 t/perf/p5600-partial-clone.sh | 4 ++++
 t/t5616-partial-clone.sh      | 8 ++++++++
 4 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c1186f50a3..208cce228d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3479,6 +3479,7 @@ static void loosen_unused_packed_objects(void)
 {
 	struct packed_git *p;
 	uint32_t i;
+	uint32_t loosened_objects_nr = 0;
 	struct object_id oid;
 
 	for (p = get_all_packs(the_repository); p; p = p->next) {
@@ -3492,11 +3493,16 @@ static void loosen_unused_packed_objects(void)
 			nth_packed_object_id(&oid, p, i);
 			if (!packlist_find(&to_pack, &oid) &&
 			    !has_sha1_pack_kept_or_nonlocal(&oid) &&
-			    !loosened_object_can_be_discarded(&oid, p->mtime))
+			    !loosened_object_can_be_discarded(&oid, p->mtime)) {
 				if (force_object_loose(&oid, p->mtime))
 					die(_("unable to force loose object"));
+				loosened_objects_nr++;
+			}
 		}
 	}
+
+	trace2_data_intmax("pack-objects", the_repository,
+			   "loosen_unused_packed_objects/loosened", loosened_objects_nr);
 }
 
 /*
diff --git a/builtin/repack.c b/builtin/repack.c
index 2847fdfbab..5f9bc74adc 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -20,7 +20,7 @@ static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps = -1;
 static int use_delta_islands;
-static char *packdir, *packtmp;
+static char *packdir, *packtmp_name, *packtmp;
 
 static const char *const git_repack_usage[] = {
 	N_("git repack [<options>]"),
@@ -530,7 +530,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 
 	packdir = mkpathdup("%s/pack", get_object_directory());
-	packtmp = mkpathdup("%s/.tmp-%d-pack", packdir, (int)getpid());
+	packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
+	packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
 
 	sigchain_push_common(remove_pack_on_signal);
 
@@ -573,6 +574,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		repack_promisor_objects(&po_args, &names);
 
 		if (existing_packs.nr && delete_redundant) {
+			for_each_string_list_item(item, &names) {
+				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
+					     packtmp_name, item->string);
+			}
 			if (unpack_unreachable) {
 				strvec_pushf(&cmd.args,
 					     "--unpack-unreachable=%s",
diff --git a/t/perf/p5600-partial-clone.sh b/t/perf/p5600-partial-clone.sh
index ca785a3341..a965f2c4d6 100755
--- a/t/perf/p5600-partial-clone.sh
+++ b/t/perf/p5600-partial-clone.sh
@@ -35,4 +35,8 @@ test_perf 'count non-promisor commits' '
 	git -C bare.git rev-list --all --count --exclude-promisor-objects
 '
 
+test_perf 'gc' '
+	git -C bare.git gc
+'
+
 test_done
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 5cb415386e..cf3e82bdf5 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -548,6 +548,14 @@ test_expect_success 'fetch from a partial clone, protocol v2' '
 	grep "version 2" trace
 '
 
+test_expect_success 'repack does not loosen promisor objects' '
+	rm -rf client trace &&
+	git clone --bare --filter=blob:none "file://$(pwd)/srv.bare" client &&
+	test_when_finished "rm -rf client trace" &&
+	GIT_TRACE2_PERF="$(pwd)/trace" git -C client repack -A -d &&
+	grep "loosen_unused_packed_objects/loosened:0" trace
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-- 
2.31.0.715.g22a2752fc5


^ permalink raw reply related	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2021-04-21 19:35 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-03  9:04 rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
2021-04-05  1:02 ` Rafael Silva
2021-04-07 21:17   ` Jeff King
2021-04-08  0:02     ` Jonathan Tan
2021-04-08  0:35       ` Jeff King
2021-04-12  7:09     ` Rafael Silva
2021-04-12 21:36     ` SZEDER Gábor
2021-04-12 21:49       ` Bryan Turner
2021-04-12 23:51         ` Jeff King
2021-04-12 23:47       ` Jeff King
2021-04-13  7:12         ` [PATCH 0/3] low-hanging performance fruit with promisor packs Jeff King
2021-04-13  7:15           ` [PATCH 1/3] is_promisor_object(): free tree buffer after parsing Jeff King
2021-04-13 20:17             ` Junio C Hamano
2021-04-14  5:18               ` Jeff King
2021-04-13  7:16           ` [PATCH 2/3] lookup_unknown_object(): take a repository argument Jeff King
2021-04-13  7:17           ` [PATCH 3/3] revision: avoid parsing with --exclude-promisor-objects Jeff King
2021-04-13 20:22             ` Junio C Hamano
2021-04-13 18:10           ` [PATCH 0/3] low-hanging performance fruit with promisor packs SZEDER Gábor
2021-04-14 17:14           ` Jonathan Tan
2021-04-14 19:22           ` Rafael Silva
2021-04-13 18:05         ` rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
2021-04-14  5:14           ` Jeff King
2021-04-11 10:59   ` SZEDER Gábor
2021-04-12  7:53     ` Rafael Silva
2021-04-14 19:14 ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Rafael Silva
2021-04-14 19:14   ` [PATCH 1/2] repack: teach --no-prune-packed to skip `git prune-packed` Rafael Silva
2021-04-14 23:50     ` Jonathan Tan
2021-04-18 14:15       ` Rafael Silva
2021-04-14 19:14   ` [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones Rafael Silva
2021-04-15  1:04     ` Jonathan Tan
2021-04-15  3:51       ` Junio C Hamano
2021-04-15  9:03         ` Jeff King
2021-04-15  9:05       ` Jeff King
2021-04-18  7:12       ` Rafael Silva
2021-04-15 18:06     ` Junio C Hamano
2021-04-18  8:40       ` Rafael Silva
2021-04-14 22:10   ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Junio C Hamano
2021-04-15  9:15   ` Jeff King
2021-04-18  8:20     ` Rafael Silva
2021-04-18 13:57   ` [PATCH v2 0/1] " Rafael Silva
2021-04-18 13:57     ` [PATCH v2 1/1] repack: avoid loosening promisor objects in partial clones Rafael Silva
2021-04-19 19:15       ` Jonathan Tan
2021-04-21 18:54         ` Rafael Silva
2021-04-19 23:09       ` Junio C Hamano
2021-04-21 19:25         ` Rafael Silva
2021-04-21 19:32     ` [PATCH v3] " Rafael Silva

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).