All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org
Subject: Re: [RFC PATCH] diff: only prefetch for certain output formats
Date: Wed, 29 Jan 2020 00:09:35 -0500	[thread overview]
Message-ID: <20200129050935.GA601903@coredump.intra.peff.net> (raw)
In-Reply-To: <20200128213508.31661-1-jonathantanmy@google.com>

On Tue, Jan 28, 2020 at 01:35:08PM -0800, Jonathan Tan wrote:

> Running "git status" on a partial clone that was cloned with
> "--no-checkout", like this:
> 
>   git clone --no-checkout --filter=blob:none <url> foo
>   git -C foo status
> 
> results in an unnecessary lazy fetch. This is because of commit
> 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08), which
> optimized "diff" by prefetching, but inadvertently affected at least one
> command ("status") that does not need prefetching. These 2 cases can be
> distinguished by looking at output_format; the former uses
> DIFF_FORMAT_PATCH and the latter uses DIFF_FORMAT_CALLBACK.

Sometimes "status" does need the actual blobs, if it's going to do
inexact rename detection. Likewise if break-detection is turned on,
though I don't think anything does it by default, and there's (I don't
think) any config option to enable it.

So something like "git log --name-status -B -M" would probably regress
from this (though only in speed, of course; I do think we can play a
little loose with heuristics since we'd generate the correct answer
either way).

You could get pretty specific by putting logic inside diffcore_rename(),
which would know if anything is left over after exact rename detection,
but I suspect just checking:

  (options->break_opt != -1 || options->detect_rename)

in diffcore_std() would be OK in practice.

> (Note that other callers that use DIFF_FORMAT_CALLBACK will also lose
> prefetching. I haven't investigated enough to see if this is a net
> benefit or drawback, but I think this will need to be done on a
> caller-by-caller basis and can be done in the future.)

We can't know what the caller is going to do with a
DIFF_FORMAT_CALLBACK, so I think it makes sense to avoid pre-fetching in
that case. It might be nice, though, if that pre-fetch loop in
diffcore_std() were pulled into a helper function, so they could just
call:

  diffcore_prefetch(&options);

or something. I'm OK to wait on that until one of the FORMAT_CALLBACK
users needs it, though.

> Points of discussion I can think of:
> 
>  - Is the whitelist of output_format constants the best?

I think it could be pared down a bit. For example, --raw doesn't
need the blobs (aside from renames, etc, above). I think the same is
true of --summary. You've already omitted --name-status and --name-only,
which makes sense.

I think --dirstat, even though it is only showing per-file info, still
relies on the line-level stat info. So it should stay.

>  - Should we just have the callers pass a prefetch boolean arg instead
>    of trying to guess it ourselves? I'm leaning towards no since I think
>    we should avoid options unless they are necessary.

If we can avoid it, I think we should. It makes sense to me to try to
tweak the heuristics as much as possible in this central code. If
there's a case that does the wrong thing, then we can decide whether the
heuristics can be improved, or if we need more information from the
caller.

-Peff

  reply	other threads:[~2020-01-29  5:09 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-28 21:35 [RFC PATCH] diff: only prefetch for certain output formats Jonathan Tan
2020-01-29  5:09 ` Jeff King [this message]
2020-01-30  1:39   ` Jonathan Tan
2020-01-30  5:51     ` Jeff King
2020-01-30 23:20       ` Jonathan Tan
2020-01-31  0:14         ` Jeff King
2020-01-31 18:08           ` Junio C Hamano
2020-02-01 11:29             ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200129050935.GA601903@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.