All of lore.kernel.org
 help / color / mirror / Atom feed
From: Derrick Stolee <derrickstolee@github.com>
To: Richard Oliver <roliver@roku.com>, git@vger.kernel.org
Cc: jonathantanmy@google.com
Subject: Re: [PATCH] mktree: learn about promised objects
Date: Tue, 14 Jun 2022 13:27:18 -0400	[thread overview]
Message-ID: <574dc4a9-b3c7-1fd3-8c0e-39071117c7f0@github.com> (raw)
In-Reply-To: <797af8c8-229f-538b-d122-8ea48067cc19@roku.com>

On 6/14/2022 12:33 PM, Richard Oliver wrote:
> On 14/06/2022 15:14, Derrick Stolee wrote:
>> On 6/14/2022 9:36 AM, Richard Oliver wrote:
>>> Do not use oid_object_info() to determine object type in mktree_line()
>>> as this can cause promised objects to be dynamically faulted-in one at a
>>> time which has poor performance. Instead, use a combination of
>>> oid_object_info_extended() (with OBJECT_INFO_SKIP_FETCH_OBJECT option),
>>> and the newly introduced promisor_object_type() to determine object type
>>> before defaulting to fetch from remote.
>>
>> Have you run some performance tests on this? It seems like this will
>> scan all packed objects, which is probably much slower than just asking
>> the remote for the object in most cases.
>>
>> Thanks,
>> -Stolee
> 
> 
> Hi Stolee,
> 
> I've put together a synthetic experiment below (adding a new blob to anexisting tree) to show you the behaviour that we've been seeing.  Our
> actual use-case (where we first encountered this behaviour) is updating
> submodules to a known hash. As you can see, the round-trip time of fetching
> objects one-by-one is very expensive.
> 
> Before, using system git (git version 2.32.0 (Apple Git-132)):
> 
>> $ git init
>> # Fetch a recent tree
>> $ git fetch --filter=tree:0 --depth 1 https://github.com/git/git cdb48006b0ec7fe19794daf7b5363ab42d9d9371
>> remote: Enumerating objects: 1, done.
>> remote: Counting objects: 100% (1/1), done.
>> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
>> Receiving objects: 100% (1/1), 13.12 KiB | 13.12 MiB/s, done.
>> From https://github.com/git/git
>>  * branch            cdb48006b0ec7fe19794daf7b5363ab42d9d9371 -> FETCH_HEAD
>>
>> $ NEW_BLOB=$(echo zzz | git hash-object --stdin -w)
>>
>> $ cat <(git ls-tree FETCH_HEAD) <(printf "100644 blob ${NEW_BLOB}\tzzz") | time git mktree
>> remote: Enumerating objects: 1, done.
>> remote: Counting objects: 100% (1/1), done.
>> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
>> Receiving objects: 100% (1/1), 334 bytes | 334.00 KiB/s, done.
>> remote: Enumerating objects: 1, done.
>> remote: Counting objects: 100% (1/1), done.
>> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
>> Receiving objects: 100% (1/1), 2.01 KiB | 2.01 MiB/s, done.
>> remote: Enumerating objects: 1, done.
>> remote: Counting objects: 100% (1/1), done.
>> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
>> Receiving objects: 100% (1/1), 256 bytes | 256.00 KiB/s, done.
>> # ...
>> # SOME TIME LATER
>> # ...
>> e26c7ce7357b1649da7b4200d4e80d0b668db8d4
>>       286.49 real        15.66 user        15.59 sys

I see. The problem here is that we are making _many requests_ for the same
tree, so maybe it would be better to introduce a batched download for the
list of missing objects. This would require iterating over the objects for
the tree to check existence (in quick mode) and adding the missing ones in
a list, then requesting that set altogether in a single request.

That probably won't be as fast as your modified mktree experiment below,
but would match my expectations of "probably faster assuming the repo is
big enough".

> Repeated experiment, but using modified mktree:
> 
>> $ cat <(git ls-tree FETCH_HEAD) <(printf "100644 blob ${NEW_BLOB}\tzzz") | time git mktree
>> e26c7ce7357b1649da7b4200d4e80d0b668db8d4
>>         0.06 real         0.01 user         0.03 sys
> 
> Did you have any other sort of performance test in mind? The remotes we
> typically deal with are geographically far away and deal with a high volume
> of traffic so we're keen to move behaviour to the client where it makes sense
> to do so.

I guess I wonder how large your promisor pack-files are in this test,
since your implementation depends on for_each_packed_object(), which
should be really inefficient if you're actually dealing with a large
partial clone.

Thanks,
-Stolee

  reply	other threads:[~2022-06-14 17:27 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-14 13:36 [PATCH] mktree: learn about promised objects Richard Oliver
2022-06-14 14:14 ` Derrick Stolee
2022-06-14 16:33   ` Richard Oliver
2022-06-14 17:27     ` Derrick Stolee [this message]
2022-06-15  0:35       ` Taylor Blau
2022-06-15  4:00         ` Jeff King
2022-06-15 17:40           ` Richard Oliver
2022-06-15 18:17             ` Derrick Stolee
2022-06-16  6:07               ` Jeff King
2022-06-16  6:54                 ` [PATCH] is_promisor_object(): walk promisor packs in pack-order Jeff King
2022-06-16 14:00                   ` Derrick Stolee
2022-06-17 19:50                   ` Jonathan Tan
2022-06-16 13:59                 ` [PATCH] mktree: learn about promised objects Derrick Stolee
2022-06-15 21:01             ` Junio C Hamano
2022-06-16  5:02               ` Jeff King
2022-06-16 15:46               ` [PATCH] mktree: Make '--missing' behave as documented Richard Oliver
2022-06-16 17:44                 ` Junio C Hamano
2022-06-21 13:59                   ` [PATCH] mktree: do not check type of remote objects Richard Oliver
2022-06-21 16:51                     ` Junio C Hamano
2022-06-21 17:48                     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=574dc4a9-b3c7-1fd3-8c0e-39071117c7f0@github.com \
    --to=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=roliver@roku.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.