git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ben Peart <peartben@gmail.com>
To: Duy Nguyen <pclouds@gmail.com>, Ben Peart <Ben.Peart@microsoft.com>
Cc: Git Mailing List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>,
	Eric Sunshine <sunshine@sunshineco.com>
Subject: Re: [PATCH v2] checkout: optimize "git checkout -b <new_branch>"
Date: Thu, 2 Aug 2018 14:02:00 -0400	[thread overview]
Message-ID: <3900401c-4d7b-233c-2098-9771a06ec0dd@gmail.com> (raw)
In-Reply-To: <CACsJy8DMEMsDnKZc65K-0EJcm2udXZ7OKY=xoFmX4COM0dSH=g@mail.gmail.com>



On 8/1/2018 11:10 AM, Duy Nguyen wrote:
> On Tue, Jul 31, 2018 at 7:03 PM Ben Peart <Ben.Peart@microsoft.com> wrote:
>>
>> From: Ben Peart <Ben.Peart@microsoft.com>
>>
>> Skip merging the commit, updating the index and working directory if and
>> only if we are creating a new branch via "git checkout -b <new_branch>."
>> Any other checkout options will still go through the former code path.
> 
> I'd like to see this giant list of checks broken down and pushed down
> to smaller areas so that chances of new things being added but checks
> not updated become much smaller. And ideally there should just be no
> behavior change (I think with your change, "checkout -b" will not
> report local changes, but it's not mentioned in the config document;
> more things like that can easily slip).
> 

One trade off of pushing these optimizations down into the lower-level 
functions is that they have a greater potential to break other command 
if our assumptions are wrong.  Changing these low level functions is a 
much more invasive set of patches.

I didn't feel confident enough to pursue this path and instead, decided 
to do the single, high level optimization around the specific scenario. 
While it has its own drawbacks (the nasty set of conditions we're 
testing), the potential for breaking other commands is much smaller.

That said, I'm willing to look into the model of pushing the 
checks/optimizations down to smaller areas if we can 1) ensure we aren't 
breaking other commands and 2) we can get similar performance.

> So. I assume this reason for this patch is because on super large worktrees
> 
>   - 2-way merge is too slow
>   - applying spare checkout patterns on a huge worktree is also slow
>   - writing index is, again, slow
>   - show_local_changes() slow
> 

That is pretty close but here is some real data on a large repo.

"git checkout -b <new_branch>" with this patch takes 0.32 seconds.
"git checkout -b <new_branch>" without this patch takes 14.6 seconds.

Note, all numbers are with a hot disk cache - real world numbers for the 
unpatched case can be much worse as it has to do a lot of disk IO to 
read/write the 267 MB index, load 500K+ tree objects, etc:

Name                                      	Inc %	      Inc
  ||+ git!mainCRTStartup                   	 89.2	   13,380
  || + git!__tmainCRTStartup               	 89.2	   13,380
  ||  + git!cmd_main                       	 89.2	   13,380
  ||   + git!handle_builtin                	 89.2	   13,380
  ||    + git!cmd_checkout                 	 89.2	   13,380
  ||     + git!unpack_trees                	 71.5	   10,725
  ||     |+ git!traverse_trees             	 39.7	    5,956
  ||     |+ git!cache_tree_update          	 16.1	    2,408
  ||     |+ git!??unpack_callback          	 11.0	    1,649
  ||     |+ git!discard_index              	  2.8	      423
  ||     + git!write_locked_index          	  8.4	    1,257
  ||     + git!??cache_tree_invalidate_path	  5.1	      767
  ||     + git!read_index_preload          	  3.4	      514

> For 2-way merge, I believe we can detect inside unpack_trees() that
> it's a 2-way merge (fn == twoway_merge), from HEAD to HEAD (simple
> enough check), then from the 2-way merge table we know for sure
> nothing is going to change and we can just skip traverse_trees() call
> in unpack_trees().
> 

If we can skip the call to traverse_trees(), that will give us the bulk 
of the savings (39.7% + 11% = 50.7% if my math is correct).

> On the sparse checkout application. This only needs to be done when
> there are new files added, or the spare-checkout file has been updated
> since the last time it's been used. We can keep track of these things
> (sparse-checkout file change could be kept track with just stat info
> maybe as an index extension) then we can skip applying sparse checkout
> not for this particular case for but general checkouts as well. Spare
> checkout file rarely changes. Big win overall.
> 

With the current patch, we don't need to load or update the index at 
all.  Without the patch, we've already replaced the standard 
sparse-checkout logic with something significantly faster so in our 
particular case, I think it's safe to skip the additional complexity of 
keeping track of changes to the sparse-checkout file.

> And if all go according to plan, there will be no changes made in the
> index (by either 2-way merge or sparse checkout stuff) we should be
> able to just skip writing down the index, if we haven't done that
> already.
> 

That would be great as writing the index is 8.4% of the time spent.

> show_local_changes() should be sped up significantly with the new
> cache-tree optimization I'm working on in another thread.
> 

As you can see, updating the cache_tree is relatively expensive (16.1% + 
5.1%) so we would definitely benefit from any improvements there.

> If I have not made any mistake in my analysis so far, we achieve a big
> speedup without adding a new config knob and can still fall back to
> slower-but-same-behavior when things are not in the right condition. I
> know it will not be the same speedup as this patch because when facing
> thousands of items, even counting them takes time. But I think it's a
> reasonable speedup without making the code base more fragile.
> 

So my rough math says we can potentially save 50.7% + 8.4% + (x * 21.2%) 
(where x is the percentage savings with an optimized cache_tree).  If we 
assume x == 50%, that means we can save 69.7% of the overall time.

For comparison, that would put "git checkout -b <new_branch>" at:

0.3 seconds - with the current patch
4.4 seconds - with the proposed patch
14.6 seconds - with no patch

Am I missing anything?  Is my math wrong?  Any other ideas for how to 
improve the proposed patch?

Thanks,

Ben

  reply	other threads:[~2018-08-02 18:02 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-24 18:01 [PATCH v1] checkout: optionally speed up "git checkout -b foo" Ben Peart
2018-07-24 18:42 ` Eric Sunshine
2018-07-24 19:45   ` Ben Peart
2018-07-26 15:04     ` Junio C Hamano
2018-07-26 18:59       ` Eric Sunshine
2018-07-26 19:08         ` Eric Sunshine
2018-07-24 19:21 ` Junio C Hamano
2018-07-24 20:47   ` Ben Peart
2018-07-31 16:39 ` [PATCH v2] checkout: optimize "git checkout -b <new_branch>" Ben Peart
2018-07-31 20:01   ` Junio C Hamano
2018-08-01 15:10   ` Duy Nguyen
2018-08-02 18:02     ` Ben Peart [this message]
2018-08-03 15:58       ` Duy Nguyen
2018-08-06 14:25         ` Ben Peart
2018-08-15 21:05           ` Ben Peart
2018-08-05  8:57       ` Duy Nguyen
2018-08-16 18:27 ` [PATCH v3] " Ben Peart
2018-08-16 18:37   ` Duy Nguyen
2018-08-17 12:37     ` Ben Peart
2018-08-19  1:44       ` Elijah Newren
2018-08-20 13:40         ` Ben Peart
2018-08-20 18:16           ` Elijah Newren
2018-08-21 14:51             ` Duy Nguyen
2018-08-30 17:22               ` Elijah Newren
2018-09-04 16:46                 ` Duy Nguyen
2018-08-20 18:31         ` Junio C Hamano
2018-09-18  5:34   ` [PATCH] config doc: add missing list separator for checkout.optimizeNewBranch Ævar Arnfjörð Bjarmason
2018-09-18 16:57     ` Taylor Blau
2018-09-18 17:16       ` Jeff King
2018-09-18 17:20         ` Taylor Blau
2018-09-18 17:13     ` Jeff King
2018-09-19  4:41       ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3900401c-4d7b-233c-2098-9771a06ec0dd@gmail.com \
    --to=peartben@gmail.com \
    --cc=Ben.Peart@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).