git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Jeff King <peff@peff.net>, Neeraj Singh <nksingh85@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Jonathan Nieder <jrnieder@gmail.com>
Subject: Re: What's cooking in git.git (Sep 2021, #08; Mon, 27)
Date: Fri, 1 Oct 2021 11:12:22 -0700	[thread overview]
Message-ID: <CABPp-BEhLQ3b8OCn3EbTa=g7uh_YXmkQuuGazn=OsT=bG3ot1w@mail.gmail.com> (raw)
In-Reply-To: <xmqq7dew8yq6.fsf@gitster.g>

On Fri, Oct 1, 2021 at 10:02 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > Said another way, I don't think anything should be writing a critical
> > file that needs to be durable when we're in the middle of a
> > "read-only" process like git-log.  The only things written should be
> > temporary stuff, like the automatic remerge calculation from
> > merge-ort, the textconv cache optimization stuff, or perhaps future
> > gitattributes transformation caching.  All that stuff can safely be
> > blown away at the completion of each merge.
>
> The textconv cache can be populated/written during "git log -p" into
> the object store to persist.  With or without "--remerge-diff", we
> can make design decision to either
>
>  - use temporary object store to discard everything we create at the
>    end in one-go, or
>
>  - write them into the object store to let later gc to handle the
>    crufts.
>
> The former will disable persistent write access to the cache.  It
> still allows accesses the cached data during the same process,
> though.  We so far deemed that textconv cache, when the user enables
> it, is valuable enough to make persistent.  Perhaps remerge-diff's
> tentative merge result may fall into the same category?  Some folks
> may want to cache, others may not.
>
> If we were to use the same notes-cache mechanism, we record the tree
> object (perhaps the object name) as the cached value for the merge
> commit in question.  Hopefully most of the merges are clean merges,
> and "caching" the results of the recreation of the merge would cost
> almost nothing.  We need objects to record the fact that "this merge
> has cached result" in the notes-cache, but the tree object that
> represents the cached result is already in the history the merge
> belongs to.

I'm not sure this performance caching for remerge-diff makes any
sense, for multiple reasons:

1) Does the cache become invalidated when the merge algorithm changes
or when config options change (--remerge-diff shows the difference
between what an automatic merge would do, but an automatic merge's
results depend on the current algorithm and various merge.*, diff.*,
etc. config options the user can set).  Do we re-cache each variant?
Do we expire the cache at some point?

2) The performance aspect seems suspect at best, and likely to backfire.  Badly:

2a) Let's say there's 100,000 merge commits in your history (so small
history roughly the size of linux.git).  `git log --remerge-diff` thus
repeats 100,000 merges, creating loose objects for them.  Perhaps only
1 in 10 merges needs any loose objects created (because the rest of
the merges were clean).  But those that do have conflicts will need to
create both new blobs and any necessary new trees that use those new
blobs.  Who knows how many blobs and trees are needed, but let's just
use 10 total blobs+trees per merge that has conflicts as a guesstimate
of the average.  Thus `git log --remerge-diff` will need to create
100,000 * 1/10 * 10 = 100,000 loose objects while it runs.  Preserving
all those files to cache the results slows things down
considerably...until they are gc'ed.  If we do the notes-cache thing,
then yes these objects would become packed.  Perhaps you're suggested
doing an automatic gc (or several) as part of running --remerge-diff,
but that's super slow itself.  When I implemented that, it felt worse
to me than just letting loose objects pile up.

2b) Note that my implementation didn't just clear out the tmp-objdir
at the _end_ of the remerge-diff (which was the easiest to implement),
but cleared it out after each merge, because that was much faster with
as many objects as can accumulate.

2c) Is --remerge-diff slow enough to even merit the complicated effort
of caching results?  --remerge-diff is faster than --cc in my testing,
so why is it even worth the pain to cache?



So with that out of the way, let's return to discussing the textconv
cache.  If the remerge-diff results aren't cached, isn't it unsafe to
allow the textconv cache to persist anything while remerge-diff is
running because it could create corruption?  We could still let the
textconv cache persist results when remerge-diff is not in use, but I
think we _want_ anything written by the textconv cache during
remerge-diff to be thrown away because there's a risk it references a
temporary object created by remerge-diff that will be deleted.

With all that in mind, to me, it seems like using the tmp-objdir as
the primary object store in combination with the refs quarantine is
actually the safest solution for this usecase.  Am I missing
something?

  parent reply	other threads:[~2021-10-01 18:12 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-28  0:52 What's cooking in git.git (Sep 2021, #08; Mon, 27) Junio C Hamano
2021-09-28  1:57 ` Ævar Arnfjörð Bjarmason
2021-09-28 20:52   ` Junio C Hamano
2021-09-28  6:46 ` Elijah Newren
2021-09-28  7:45   ` Ævar Arnfjörð Bjarmason
2021-09-28 17:25     ` Junio C Hamano
2021-09-28 21:00       ` Neeraj Singh
2021-09-28 23:34         ` Junio C Hamano
2021-09-28 23:53           ` Neeraj Singh
2021-10-07 22:01             ` Junio C Hamano
2021-10-08  6:51               ` Elijah Newren
2021-10-08 22:30                 ` Neeraj Singh
2021-10-08 23:01                 ` Junio C Hamano
2021-09-28  8:07   ` Ævar Arnfjörð Bjarmason
2021-09-28 17:27     ` Junio C Hamano
2021-09-28 13:31   ` Derrick Stolee
2021-09-28 17:33     ` Junio C Hamano
2021-09-28 20:16       ` Derrick Stolee
2021-09-28 17:16   ` Junio C Hamano
2021-09-29  6:42     ` Elijah Newren
2021-09-28 23:40   ` Jeff King
2021-09-28 23:49     ` Jeff King
2021-09-29 18:43     ` Neeraj Singh
2021-09-30  8:16       ` Jeff King
2021-10-01  7:50         ` Elijah Newren
2021-10-01 17:02           ` Junio C Hamano
2021-10-01 17:39             ` Neeraj Singh
2021-10-01 18:15               ` Elijah Newren
2021-10-01 18:12             ` Elijah Newren [this message]
2021-10-01 22:02               ` Junio C Hamano
2021-10-01 23:05                 ` Elijah Newren
2021-10-04 13:45     ` Elijah Newren
2021-09-28  8:22 ` da/difftool (was: Re: What's cooking in git.git (Sep 2021, #08; Mon, 27)) Ævar Arnfjörð Bjarmason
2021-09-28  8:23 ` ns/batched-fsync & en/remerge-diff (was " Ævar Arnfjörð Bjarmason
2021-09-28  8:31 ` sg/test-split-index-fix " Ævar Arnfjörð Bjarmason
2021-09-28  8:35 ` hn/reftable (Re: " Ævar Arnfjörð Bjarmason
2021-09-28 12:18   ` Han-Wen Nienhuys
2021-09-30  5:06     ` Carlo Arenas
2021-09-29  8:12 ` What's cooking in git.git (Sep 2021, #08; Mon, 27) Fabian Stelzer
2021-09-30 21:26   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BEhLQ3b8OCn3EbTa=g7uh_YXmkQuuGazn=OsT=bG3ot1w@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=nksingh85@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).