[PATCH v3 0/4] Convert index writes to use hashfile API

From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, peff@peff.net, stolee@gmail.com,
	git@jeffhostetler.com, Derrick Stolee <derrickstolee@github.com>
Subject: [PATCH v3 0/4] Convert index writes to use hashfile API
Date: Tue, 18 May 2021 18:32:44 +0000	[thread overview]
Message-ID: <pull.916.v3.git.1621362768.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.916.v2.git.1621254292.gitgitgadget@gmail.com>

As I prepare some ideas on index v5, one thing that strikes me as an
interesting direction to try is to use the chunk-format API. This would make
our extension model extremely simple (they become optional chunks, easily
identified by the table of contents).

But there is a huge hurdle to even starting that investigation: the index
uses its own hashing methods, separate from the hashfile API in csum-file.c!

The internals of the algorithms are mostly identical. The only possible
change is that the buffer sizes are different: 8KB for hashfile and 128KB in
read-cache.c. This change is actually relatively recent! I was able to see a
performance difference before changing that size, so I do make them equal.
This improves some other commands, too, such as git multi-pack-index write.

There is a subtle point about how the EOIE extension works in that it needs
a hash of just the previous extension headers. This won't be needed by the
chunk-format API, so leave this mostly as-is, except where it collides
somewhat with the hashfile changes.

Updates in v3
=============

 * Fixed typos in commit messages.

 * Updated buffer and check_buffer to be heap-allocated members of struct
   hashfile.

 * These buffers have variable size, allowing the uses that calculate a
   throughput progress bar can still use an 8k buffer, while the others grow
   to 128k.

 * The heap-allocated check_buffer allows the hashfile API to remain as
   thread safe as before (parallel threads can write to independent
   hashfiles, not the same one).

Updates in v2
=============

 * Sorry for the false start on v1. I'm correctly testing index writes this
   time, but I also verified other commands, such as "git add .", to be sure
   I was exercising everything. I also previously had incorrect assumptions
   about the EOIE extension, since it is not triggered with "small" repos
   like the Linux kernel repository ;).

 * This version is rebased onto a recent version of 'master' so the
   conflicts with ds/sparse-index-protections are resolved.

 * This version abandons the nested hashfile concept, since that was only
   needed for the incorrect interpretation of the EOIE extension's hash.

 * This version does unify the buffer size to 128 KB with justification.

 * There is also some more unification of writing logic to use
   write_in_full() in the hashfile API.

 * When using check_fd, the hashfile API is no longer thread safe due to a
   stack-allocated buffer moving to the global scope. I'm not sure if there
   is a better way to handle this case.

Thanks, -Stolee

Derrick Stolee (4):
  hashfile: use write_in_full()
  csum-file.h: increase hashfile buffer size
  read-cache: use hashfile instead of git_hash_ctx
  read-cache: delete unused hashing methods

 chunk-format.c |  12 ++--
 csum-file.c    |  94 +++++++++++++++---------
 csum-file.h    |   4 +-
 read-cache.c   | 191 ++++++++++++++++---------------------------------
 4 files changed, 134 insertions(+), 167 deletions(-)

base-commit: 97eea85a0a1ec66d356567808a1e4ca2367e0ce7
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-916%2Fderrickstolee%2Findex-hashfile-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-916/derrickstolee/index-hashfile-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/916

Range-diff vs v2:

 1:  b3e578ac0365 = 1:  b3e578ac0365 hashfile: use write_in_full()
 2:  9dc602f6c422 ! 2:  64ffddd79116 csum-file.h: increase hashfile buffer size
     @@ Commit message
          hashing buffers in read-cache.c, but that code was updated from 8KB to
          128KB in f279894 (read-cache: make the index write buffer size 128K,
          2021-02-18). The justification there was that do_write_index() improves
     -    from 1.02s to 0.72s.
     +    from 1.02s to 0.72s. Since our end goal is to have the index writing
     +    code use the hashfile API, we need to unify this buffer size to avoid a
     +    performance regression.

     -    There is a buffer, check_buffer, that is used to verify the check_fd
     +    There is a buffer, 'check_buffer', that is used to verify the check_fd
          file descriptor. When this buffer increases to 128K to fit the data
          being flushed, it causes the stack to overflow the limits placed in the
     -    test suite. By moving this to a static buffer, we stop using stack data
     -    for this purpose, but we lose some thread-safety. This change makes it
     -    unsafe to write to multiple hashfiles across different threads.
     +    test suite. To avoid issues with stack size, move both 'buffer' and
     +    'check_buffer' to be heap pointers within 'struct hashfile'. The
     +    'check_buffer' member is left as NULL unless check_fd is set in
     +    hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
     +    which also frees the full structure.
     +
     +    Since these buffers are now on the heap, we can adjust their size based
     +    on the needs of the consumer. In particular, callers to
     +    hashfd_throughput() are expecting to report progress indicators as the
     +    buffer flushes. These callers would prefer the smaller 8k buffer to
     +    avoid large delays between updates, especially for users with slower
     +    networks. When the progress indicator is not used, the larger buffer is
     +    preferrable.

          By adding a new trace2 region in the chunk-format API, we can see that
          the writing portion of 'git multi-pack-index write' lowers from ~1.49s
     @@ csum-file.c
      +				 const void *buf,
      +				 unsigned int count)
      +{
     -+	static unsigned char check_buffer[WRITE_BUFFER_SIZE];
     -+	ssize_t ret = read_in_full(f->check_fd, check_buffer, count);
     ++	ssize_t ret = read_in_full(f->check_fd, f->check_buffer, count);
      +
      +	if (ret < 0)
      +		die_errno("%s: sha1 file read error", f->name);
      +	if (ret != count)
      +		die("%s: sha1 file truncated", f->name);
     -+	if (memcmp(buf, check_buffer, count))
     ++	if (memcmp(buf, f->check_buffer, count))
      +		die("sha1 file '%s' validation error", f->name);
      +}
      +
     @@ csum-file.c

       	if (write_in_full(f->fd, buf, count) < 0) {
       		if (errno == ENOSPC)
     -
     - ## csum-file.h ##
     -@@
     - 
     - struct progress;
     +@@ csum-file.c: void hashflush(struct hashfile *f)
     + 	}
     + }

     -+#define WRITE_BUFFER_SIZE (128 * 1024)
     ++static void free_hashfile(struct hashfile *f)
     ++{
     ++	free(f->buffer);
     ++	free(f->check_buffer);
     ++	free(f);
     ++}
      +
     - /* A SHA1-protected file */
     - struct hashfile {
     + int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int flags)
     + {
       	int fd;
     +@@ csum-file.c: int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int fl
     + 		if (close(f->check_fd))
     + 			die_errno("%s: sha1 file error on close", f->name);
     + 	}
     +-	free(f);
     ++	free_hashfile(f);
     + 	return fd;
     + }
     + 
     + void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
     + {
     + 	while (count) {
     +-		unsigned left = sizeof(f->buffer) - f->offset;
     ++		unsigned left = f->buffer_len - f->offset;
     + 		unsigned nr = count > left ? left : count;
     + 
     + 		if (f->do_crc)
     + 			f->crc32 = crc32(f->crc32, buf, nr);
     + 
     +-		if (nr == sizeof(f->buffer)) {
     ++		if (nr == f->buffer_len) {
     + 			/*
     + 			 * Flush a full batch worth of data directly
     + 			 * from the input, skipping the memcpy() to
     +@@ csum-file.c: void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
     + 	}
     + }
     + 
     +-struct hashfile *hashfd(int fd, const char *name)
     +-{
     +-	return hashfd_throughput(fd, name, NULL);
     +-}
     +-
     + struct hashfile *hashfd_check(const char *name)
     + {
     + 	int sink, check;
     +@@ csum-file.c: struct hashfile *hashfd_check(const char *name)
     + 		die_errno("unable to open '%s'", name);
     + 	f = hashfd(sink, name);
     + 	f->check_fd = check;
     ++	f->check_buffer = xmalloc(f->buffer_len);
     ++
     + 	return f;
     + }
     + 
     +-struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp)
     ++static struct hashfile *hashfd_internal(int fd, const char *name,
     ++					struct progress *tp,
     ++					size_t buffer_len)
     + {
     + 	struct hashfile *f = xmalloc(sizeof(*f));
     + 	f->fd = fd;
     +@@ csum-file.c: struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
     + 	f->name = name;
     + 	f->do_crc = 0;
     + 	the_hash_algo->init_fn(&f->ctx);
     ++
     ++	f->buffer_len = buffer_len;
     ++	f->buffer = xmalloc(buffer_len);
     ++	f->check_buffer = NULL;
     ++
     + 	return f;
     + }
     + 
     ++struct hashfile *hashfd(int fd, const char *name)
     ++{
     ++	/*
     ++	 * Since we are not going to use a progress meter to
     ++	 * measure the rate of data passing through this hashfile,
     ++	 * use a larger buffer size to reduce fsync() calls.
     ++	 */
     ++	return hashfd_internal(fd, name, NULL, 128 * 1024);
     ++}
     ++
     ++struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp)
     ++{
     ++	/*
     ++	 * Since we are expecting to report progress of the
     ++	 * write into this hashfile, use a smaller buffer
     ++	 * size so the progress indicators arrive at a more
     ++	 * frequent rate.
     ++	 */
     ++	return hashfd_internal(fd, name, tp, 8 * 1024);
     ++}
     ++
     + void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
     + {
     + 	hashflush(f);
     +
     + ## csum-file.h ##
      @@ csum-file.h: struct hashfile {
       	const char *name;
       	int do_crc;
       	uint32_t crc32;
      -	unsigned char buffer[8192];
     -+	unsigned char buffer[WRITE_BUFFER_SIZE];
     ++	size_t buffer_len;
     ++	unsigned char *buffer;
     ++	unsigned char *check_buffer;
       };

       /* Checkpoint */
 3:  b94172ccf5e9 ! 3:  afd278bef3ab read-cache: use hashfile instead of git_hash_ctx
     @@ Commit message
          codebase, I figured that these roughly equivalent code paths were never
          unified only because it got lost in the shuffle. The hashfile API has
          since been used extensively in other file formats, such as pack-indexes,
     -    mult-pack-indexes, and commit-graphs. Therefore, it seems prudent to
     +    multi-pack-indexes, and commit-graphs. Therefore, it seems prudent to
          unify the index writing code to use the same mechanism.

          I discovered this disparity while trying to create a new index format
     @@ Commit message
          temporary file with writes to the hashfile struct. This takes care of
          many of the direct interactions with the_hash_algo.

     -    There are still some remaining: the extension headers are hashed for use
     -    in the End of Index Entries (EOIE) extension. This use of the
     -    git_hash_ctx is left as-is. There are multiple reasons to not use a
     -    hashfile here, including the fact that the data is not actually writing
     -    to a file, just a hash computation. These hashes do not block our
     -    adoption of the chunk-format API in a future change to the index, so
     +    There are still some git_hash_ctx uses remaining: the extension headers
     +    are hashed for use in the End of Index Entries (EOIE) extension. This
     +    use of the git_hash_ctx is left as-is. There are multiple reasons to not
     +    use a hashfile here, including the fact that the data is not actually
     +    writing to a file, just a hash computation. These hashes do not block
     +    our adoption of the chunk-format API in a future change to the index, so
          leave it as-is.

          The internals of the algorithms are mostly identical. Previously, the
     @@ Commit message
          the index data to be exactly equal. Finally, I did extensive performance
          testing of "git update-index --force-write" on repos of various sizes,
          including one with over 2 million paths at HEAD. These tests
     -    demonstrated less than 1% difference in behavior, so the performance
     -    should be considered identical.
     +    demonstrated less than 1% difference in behavior. As expected, the
     +    performance should be considered unchanged. The previous changes to
     +    increase the hashfile buffer size from 8K to 128K ensured this change
     +    would not create a peformance regression.

          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

 4:  4b3814eb4c80 = 4:  42fb10fb2998 read-cache: delete unused hashing methods

-- 
gitgitgadget