From: Junio C Hamano <gitster@pobox.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
git@vger.kernel.org, peff@peff.net, me@ttaylorr.com,
Derrick Stolee <derrickstolee@github.com>,
Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH] csum-file: flush less often
Date: Thu, 25 Mar 2021 11:46:20 -0700 [thread overview]
Message-ID: <xmqq8s6bvzpf.fsf@gitster.g> (raw)
In-Reply-To: <84ccabca-0bd3-d0cb-6b38-f96d75c0bbd6@gmail.com> (Derrick Stolee's message of "Thu, 25 Mar 2021 07:55:55 -0400")
Derrick Stolee <stolee@gmail.com> writes:
> But...is there something we could still do here?
>
> My confusion about flushing is mostly due to my error, but upon
> reflection the loop is doing a lot of different things, but most of
> the time we know which behavior we need at the start, in the middle,
> and at the end:
>
> 1. Fill the existing buffer with the beginning of 'buf'. If the
> hashfile's buffer is full, then flush.
"But do not do this if f->buffer is empty, and we are writing out
more than sizeof(f->buffer)." is missing, isn't it?
> 2. Flush sizeof(f->buffer) chunks directly out of 'buf' as long as
> possible.
>
> 3. Copy the remaining bytes out of 'buf' into the hashfile's buffer.
It is debatable if the existing loop, which came mostly from Nico's
a8032d12 (sha1write: don't copy full sized buffers, 2008-09-02), is
too clever; I personally find it concise and readable enough, but my
reading is tainted.
If you use a couple of helpers to reduce the repeated "crc and hash"
pattern in your variant, it may become easier to follow than the
original, but I dunno.
> Here is a rewrite that more explicitly follows this flow:
>
> void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
> {
> const int full_buffer = sizeof(f->buffer);
> unsigned left = full_buffer - f->offset;
> unsigned nr = count > left ? left : count;
>
> /*
> * Initially fill the buffer in a batch until it
> * is full, then flush.
> */
> if (f->do_crc)
> f->crc32 = crc32(f->crc32, buf, nr);
>
> memcpy(f->buffer + f->offset, buf, nr);
Here, if the f->buffer was empty, we end up memcpy a full bufferful
unconditionally. Nico's original cleverly takes advantage of the
fact that 'nr' would be the full buffer size only when the f->buffer
was empty upon entry to the function and we have more byte than the
size of the buffer to copy out directly from 'buf'.
> f->offset += nr;
> count -= nr;
> buf = (char *) buf + nr;
>
> if (left == nr)
> hashflush(f);
>
> /*
> * After filling the hashfile's buffer and flushing, take
> * batches of full_buffer bytes directly from the input
> * buffer.
> */
> while (count >= full_buffer) {
> if (f->do_crc)
> f->crc32 = crc32(f->crc32, buf, full_buffer);
>
> the_hash_algo->update_fn(&f->ctx, buf, full_buffer);
> flush(f, buf, full_buffer);
>
> count -= full_buffer;
> buf = (char *) buf + full_buffer;
> }
>
> /*
> * Capture any remaining bytes at the end of the input buffer
> * into the hashfile's buffer. We do not need to flush because
> * count is strictly less than full_buffer here.
> */
> if (count) {
> if (f->do_crc)
> f->crc32 = crc32(f->crc32, buf, count);
>
> memcpy(f->buffer + f->offset, buf, count);
> f->offset = count;
> }
>
> if (f->base)
> hashwrite(f->base, buf, count);
> }
> ...
> So, I'm of two minds here:
>
> 1. This is embarassing. I wasted everyone's time for nothing. I can retract
> this patch.
>
> 2. This is embarassing. I overstated the problem here. But we might be able
> to eke out a tiny performance boost here.
>
> I'm open to either. I think we should default to dropping this patch unless
> someone thinks the rewrite above is a better organization of the logic. (I
> can then send a v2 including that version and an updated commit message.)
3. The current code around "if (nr == sizeof(f->buffer))" might be a
bit too clever for readers who try to understand what is going
on, and the whole "while" loop may deserve a comment based on
what you wrote before your replacement implementation.
next prev parent reply other threads:[~2021-03-25 18:47 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-24 17:50 [PATCH] csum-file: flush less often Derrick Stolee via GitGitGadget
2021-03-25 11:55 ` Derrick Stolee
2021-03-25 18:46 ` Junio C Hamano [this message]
2021-03-25 18:52 ` Junio C Hamano
2021-03-26 3:16 ` Jeff King
2021-03-26 12:38 ` [PATCH v2] csum-file: make hashwrite() more readable Derrick Stolee via GitGitGadget
2021-03-26 21:38 ` Junio C Hamano
2021-03-28 8:38 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqq8s6bvzpf.fsf@gitster.g \
--to=gitster@pobox.com \
--cc=derrickstolee@github.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).