git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Convert index writes to use hashfile API
@ 2021-03-26 19:12 Derrick Stolee via GitGitGadget
  2021-03-26 19:12 ` [PATCH 1/3] csum-file: add nested_hashfile() Derrick Stolee via GitGitGadget
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-26 19:12 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee

As I prepare some ideas on index v5, one thing that strikes me as an
interesting direction to try is to use the chunk-format API. This would make
our extension model extremely simple (they become optional chunks, easily
identified by the table of contents).

But there is a huge hurdle to even starting that investigation: the index
uses its own hashing methods, separate from the hashfile API in csum-file.c!

The internals of the algorithms are mostly identical. The only possible
change is that the buffer sizes are different: 8KB for hashfile and 128KB in
read-cache.c. I was unable to find a performance difference in these two
implementations, despite testing on several repo sizes.

There is a subtle point about how the EOIE extension works in that it needs
a hash of just the previous extension data. This is solved by adding a new
"nested hashfile" mechanism that computes the hash at one level and then
passes the data below to another hashfile. (The good news is that this
extension will not need to exist at all if we use the chunk-format API to
manage extensions.)

Thanks, -Stolee

Derrick Stolee (3):
  csum-file: add nested_hashfile()
  read-cache: use hashfile instead of git_hash_ctx
  read-cache: delete unused hashing methods

 csum-file.c  |  22 +++++++
 csum-file.h  |   9 +++
 read-cache.c | 182 ++++++++++++++++-----------------------------------
 3 files changed, 89 insertions(+), 124 deletions(-)


base-commit: 142430338477d9d1bb25be66267225fb58498d92
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-916%2Fderrickstolee%2Findex-hashfile-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-916/derrickstolee/index-hashfile-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/916
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/3] csum-file: add nested_hashfile()
  2021-03-26 19:12 [PATCH 0/3] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
@ 2021-03-26 19:12 ` Derrick Stolee via GitGitGadget
  2021-03-26 19:12 ` [PATCH 2/3] read-cache: use hashfile instead of git_hash_ctx Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-26 19:12 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The index writing code in do_write_index() uses a custom set of hashing
code, in part because it was introduced before the hashfile API. But
also, the End of Index Entries extension computes a hash of just the
extension data, not the entire file preceding that extension.

Before converting the index writing code to use the hashfile API, create
a concept of a "nested hashfile". By adding a 'base' member to 'struct
hashfile', we indicate that any writes to this hashfile should be passed
along to the base hashfile, too.

In the next change, the index code will use this to create a new
hashfile wose base is the hashfile for the index. The outer hashfile
will compute the hash just for the extension details. Thus, it will
finalize earlier than the base hashfile, hence there is no modification
to finalize_hashfile() here.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 csum-file.c | 22 ++++++++++++++++++++++
 csum-file.h |  9 +++++++++
 2 files changed, 31 insertions(+)

diff --git a/csum-file.c b/csum-file.c
index 0f35fa5ee47c..e73b35316e66 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -13,6 +13,9 @@
 
 static void flush(struct hashfile *f, const void *buf, unsigned int count)
 {
+	if (f->base)
+		return;
+
 	if (0 <= f->check_fd && count)  {
 		unsigned char check_buffer[8192];
 		ssize_t ret = read_in_full(f->check_fd, check_buffer, count);
@@ -116,6 +119,9 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 		}
 		f->offset = offset;
 	}
+
+	if (f->base)
+		hashwrite(f->base, buf, count);
 }
 
 struct hashfile *hashfd(int fd, const char *name)
@@ -150,6 +156,7 @@ struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
 	f->name = name;
 	f->do_crc = 0;
 	the_hash_algo->init_fn(&f->ctx);
+	f->base = NULL;
 	return f;
 }
 
@@ -184,3 +191,18 @@ uint32_t crc32_end(struct hashfile *f)
 	f->do_crc = 0;
 	return f->crc32;
 }
+
+struct hashfile *nested_hashfile(struct hashfile *f)
+{
+	struct hashfile *n = xmalloc(sizeof(*f));
+	n->fd = -1;
+	n->check_fd = -1;
+	n->offset = 0;
+	n->total = 0;
+	n->tp = NULL;
+	n->name = NULL;
+	n->do_crc = 0;
+	the_hash_algo->init_fn(&n->ctx);
+	n->base = f;
+	return n;
+}
diff --git a/csum-file.h b/csum-file.h
index e54d53d1d0b3..b8785e7ecb46 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -16,6 +16,7 @@ struct hashfile {
 	const char *name;
 	int do_crc;
 	uint32_t crc32;
+	struct hashfile *base;
 	unsigned char buffer[8192];
 };
 
@@ -42,6 +43,14 @@ void hashflush(struct hashfile *f);
 void crc32_begin(struct hashfile *);
 uint32_t crc32_end(struct hashfile *);
 
+/*
+ * A nested hashfile uses the same interface as a hashfile, and computes
+ * a hash for the input bytes while passing them to the base hashfile
+ * instead of writing them to its own file. This is useful for computing
+ * a hash of a region within a file during the write.
+ */
+struct hashfile *nested_hashfile(struct hashfile *f);
+
 /*
  * Returns the total number of bytes fed to the hashfile so far (including ones
  * that have not been written out to the descriptor yet).
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 2/3] read-cache: use hashfile instead of git_hash_ctx
  2021-03-26 19:12 [PATCH 0/3] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
  2021-03-26 19:12 ` [PATCH 1/3] csum-file: add nested_hashfile() Derrick Stolee via GitGitGadget
@ 2021-03-26 19:12 ` Derrick Stolee via GitGitGadget
  2021-03-29 15:04   ` Derrick Stolee
  2021-03-26 19:12 ` [PATCH 3/3] read-cache: delete unused hashing methods Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-26 19:12 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The do_write_index() method in read-cache.c has its own hashing logic
and buffering mechanism. Specifically, the ce_write() method was
introduced by 4990aadc (Speed up index file writing by chunking it
nicely, 2005-04-20) and similar mechanisms were introduced a few months
later in c38138cd (git-pack-objects: write the pack files with a SHA1
csum, 2005-06-26). Based on the timing, in the early days of the Git
codebase, I figured that these roughly equivalent code paths were never
unified only because it got lost in the shuffle. The hashfile API has
since been used extensively in other file formats, such as pack-indexes,
mult-pack-indexes, and commit-graphs. Therefore, it seems prudent to
unify the index writing code to use the same mechanism.

I discovered this disparity while trying to create a new index format
that uses the chunk-format API. That API uses a hashfile as its base, so
it is incompatible with the custom code in read-cache.c.

This rewrite of the logic is rather straightforward, except for the
special case of creating a nested hashfile to handle computing the hash
of the extension data just for the End of Index Entries extension. The
previous change introduced the concept for just this purpose.

The internals of the algorithms are mostly identical. The only
meaningful change is that the buffer sizes are different: 8KB for
hashfile and 128KB in read-cache.c. I was unable to find a performance
difference in these two implementations, despite testing on several repo
sizes. I also tried adjusting the buffer size of the hashfile struct for
a variety of sizes between 8KB and 128KB, and did not see a performance
change for any of the commands that currently use hashfiles.

Some static methods become orphaned in this diff, so I marked them as
MAYBE_UNUSED. The diff is much harder to read if they are deleted during
this change. Instead, they will be deleted in the following change.

In addition to the test suite passing, I computed indexes using the
previous binaries and the binaries compiled after this change, and found
the index data to be exactly equal. Finally, I did extensive performance
testing of "git update-index --really-refresh" on repos of various
sizes, including one with over 2 million paths at HEAD. These tests
demonstrated less than 1% difference in behavior, so the performance
should be considered identical.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 read-cache.c | 126 +++++++++++++++++++++++++--------------------------
 1 file changed, 62 insertions(+), 64 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 5a907af2fb52..b9916350f331 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -25,6 +25,7 @@
 #include "fsmonitor.h"
 #include "thread-utils.h"
 #include "progress.h"
+#include "csum-file.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -1957,7 +1958,7 @@ static struct index_entry_offset_table *read_ieot_extension(const char *mmap, si
 static void write_ieot_extension(struct strbuf *sb, struct index_entry_offset_table *ieot);
 
 static size_t read_eoie_extension(const char *mmap, size_t mmap_size);
-static void write_eoie_extension(struct strbuf *sb, git_hash_ctx *eoie_context, size_t offset);
+static void write_eoie_extension(struct strbuf *sb, const unsigned char *hash, size_t offset);
 
 struct load_index_extensions
 {
@@ -2470,6 +2471,7 @@ int repo_index_has_changes(struct repository *repo,
 static unsigned char write_buffer[WRITE_BUFFER_SIZE];
 static unsigned long write_buffer_len;
 
+MAYBE_UNUSED
 static int ce_write_flush(git_hash_ctx *context, int fd)
 {
 	unsigned int buffered = write_buffer_len;
@@ -2482,6 +2484,7 @@ static int ce_write_flush(git_hash_ctx *context, int fd)
 	return 0;
 }
 
+MAYBE_UNUSED
 static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len)
 {
 	while (len) {
@@ -2504,19 +2507,14 @@ static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len)
 	return 0;
 }
 
-static int write_index_ext_header(git_hash_ctx *context, git_hash_ctx *eoie_context,
-				  int fd, unsigned int ext, unsigned int sz)
+static int write_index_ext_header(struct hashfile *f, unsigned int ext, unsigned int sz)
 {
-	ext = htonl(ext);
-	sz = htonl(sz);
-	if (eoie_context) {
-		the_hash_algo->update_fn(eoie_context, &ext, 4);
-		the_hash_algo->update_fn(eoie_context, &sz, 4);
-	}
-	return ((ce_write(context, fd, &ext, 4) < 0) ||
-		(ce_write(context, fd, &sz, 4) < 0)) ? -1 : 0;
+	hashwrite_be32(f, ext);
+	hashwrite_be32(f, sz);
+	return 0;
 }
 
+MAYBE_UNUSED
 static int ce_flush(git_hash_ctx *context, int fd, unsigned char *hash)
 {
 	unsigned int left = write_buffer_len;
@@ -2618,11 +2616,10 @@ static void copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
 	}
 }
 
-static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
+static int ce_write_entry(struct hashfile *f, struct cache_entry *ce,
 			  struct strbuf *previous_name, struct ondisk_cache_entry *ondisk)
 {
 	int size;
-	int result;
 	unsigned int saved_namelen;
 	int stripped_name = 0;
 	static unsigned char padding[8] = { 0x00 };
@@ -2638,11 +2635,9 @@ static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
 	if (!previous_name) {
 		int len = ce_namelen(ce);
 		copy_cache_entry_to_ondisk(ondisk, ce);
-		result = ce_write(c, fd, ondisk, size);
-		if (!result)
-			result = ce_write(c, fd, ce->name, len);
-		if (!result)
-			result = ce_write(c, fd, padding, align_padding_size(size, len));
+		hashwrite(f, ondisk, size);
+		hashwrite(f, ce->name, len);
+		hashwrite(f, padding, align_padding_size(size, len));
 	} else {
 		int common, to_remove, prefix_size;
 		unsigned char to_remove_vi[16];
@@ -2656,13 +2651,10 @@ static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
 		prefix_size = encode_varint(to_remove, to_remove_vi);
 
 		copy_cache_entry_to_ondisk(ondisk, ce);
-		result = ce_write(c, fd, ondisk, size);
-		if (!result)
-			result = ce_write(c, fd, to_remove_vi, prefix_size);
-		if (!result)
-			result = ce_write(c, fd, ce->name + common, ce_namelen(ce) - common);
-		if (!result)
-			result = ce_write(c, fd, padding, 1);
+		hashwrite(f, ondisk, size);
+		hashwrite(f, to_remove_vi, prefix_size);
+		hashwrite(f, ce->name + common, ce_namelen(ce) - common);
+		hashwrite(f, padding, 1);
 
 		strbuf_splice(previous_name, common, to_remove,
 			      ce->name + common, ce_namelen(ce) - common);
@@ -2672,7 +2664,7 @@ static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
 		ce->ce_flags &= ~CE_STRIP_NAME;
 	}
 
-	return result;
+	return 0;
 }
 
 /*
@@ -2784,8 +2776,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 			  int strip_extensions)
 {
 	uint64_t start = getnanotime();
-	int newfd = tempfile->fd;
-	git_hash_ctx c, eoie_c;
+	struct hashfile *f, *eoie_f;
 	struct cache_header hdr;
 	int i, err = 0, removed, extended, hdr_version;
 	struct cache_entry **cache = istate->cache;
@@ -2799,6 +2790,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	struct index_entry_offset_table *ieot = NULL;
 	int nr, nr_threads;
 
+	f = hashfd(tempfile->fd, tempfile->filename.buf);
+
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
@@ -2827,9 +2820,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	hdr.hdr_version = htonl(hdr_version);
 	hdr.hdr_entries = htonl(entries - removed);
 
-	the_hash_algo->init_fn(&c);
-	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
-		return -1;
+	hashwrite(f, &hdr, sizeof(hdr));
 
 	if (!HAVE_THREADS || git_config_get_index_threads(&nr_threads))
 		nr_threads = 1;
@@ -2864,12 +2855,12 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		}
 	}
 
-	offset = lseek(newfd, 0, SEEK_CUR);
+	offset = lseek(f->fd, 0, SEEK_CUR);
 	if (offset < 0) {
 		free(ieot);
 		return -1;
 	}
-	offset += write_buffer_len;
+
 	nr = 0;
 	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
 
@@ -2904,14 +2895,14 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 			if (previous_name)
 				previous_name->buf[0] = 0;
 			nr = 0;
-			offset = lseek(newfd, 0, SEEK_CUR);
+
+			offset = lseek(f->fd, 0, SEEK_CUR);
 			if (offset < 0) {
 				free(ieot);
 				return -1;
 			}
-			offset += write_buffer_len;
 		}
-		if (ce_write_entry(&c, newfd, ce, previous_name, (struct ondisk_cache_entry *)&ondisk) < 0)
+		if (ce_write_entry(f, ce, previous_name, (struct ondisk_cache_entry *)&ondisk) < 0)
 			err = -1;
 
 		if (err)
@@ -2930,14 +2921,21 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		return err;
 	}
 
-	/* Write extension data here */
-	offset = lseek(newfd, 0, SEEK_CUR);
+	offset = lseek(f->fd, 0, SEEK_CUR);
 	if (offset < 0) {
 		free(ieot);
 		return -1;
 	}
-	offset += write_buffer_len;
-	the_hash_algo->init_fn(&eoie_c);
+
+	/*
+	 * The extensions must be hashed on their own for use in the EOIE
+	 * extension. Use a nested hashfile to compute the hash for this
+	 * region while passing the buffer to the original hashfile.
+	 */
+	if (offset && record_eoie())
+		eoie_f = nested_hashfile(f);
+	else
+		eoie_f = f;
 
 	/*
 	 * Lets write out CACHE_EXT_INDEXENTRYOFFSETTABLE first so that we
@@ -2950,8 +2948,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		write_ieot_extension(&sb, ieot);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_INDEXENTRYOFFSETTABLE, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(eoie_f, CACHE_EXT_INDEXENTRYOFFSETTABLE, sb.len) < 0;
+		hashwrite(eoie_f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		free(ieot);
 		if (err)
@@ -2963,9 +2961,9 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		err = write_link_extension(&sb, istate) < 0 ||
-			write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_LINK,
-					       sb.len) < 0 ||
-			ce_write(&c, newfd, sb.buf, sb.len) < 0;
+			write_index_ext_header(eoie_f, CACHE_EXT_LINK,
+					       sb.len) < 0;
+		hashwrite(eoie_f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -2974,8 +2972,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		cache_tree_write(&sb, istate->cache_tree);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_TREE, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(eoie_f, CACHE_EXT_TREE, sb.len) < 0;
+		hashwrite(eoie_f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -2984,9 +2982,9 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		resolve_undo_write(&sb, istate->resolve_undo);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_RESOLVE_UNDO,
-					     sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(eoie_f, CACHE_EXT_RESOLVE_UNDO,
+					     sb.len) < 0;
+		hashwrite(eoie_f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -2995,9 +2993,9 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		write_untracked_extension(&sb, istate->untracked);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_UNTRACKED,
-					     sb.len) < 0 ||
-			ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(eoie_f, CACHE_EXT_UNTRACKED,
+					     sb.len) < 0;
+		hashwrite(eoie_f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -3006,8 +3004,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		write_fsmonitor_extension(&sb, istate);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_FSMONITOR, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(eoie_f, CACHE_EXT_FSMONITOR, sb.len) < 0;
+		hashwrite(eoie_f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -3019,19 +3017,21 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	 * read.  Write it out regardless of the strip_extensions parameter as we need it
 	 * when loading the shared index.
 	 */
-	if (offset && record_eoie()) {
+	if (f != eoie_f) {
 		struct strbuf sb = STRBUF_INIT;
+		unsigned char hash[GIT_MAX_RAWSZ];
 
-		write_eoie_extension(&sb, &eoie_c, offset);
-		err = write_index_ext_header(&c, NULL, newfd, CACHE_EXT_ENDOFINDEXENTRIES, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		finalize_hashfile(eoie_f, hash, 0);
+
+		write_eoie_extension(&sb, hash, offset);
+		err = write_index_ext_header(f, CACHE_EXT_ENDOFINDEXENTRIES, sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
 	}
 
-	if (ce_flush(&c, newfd, istate->oid.hash))
-		return -1;
+	finalize_hashfile(f, istate->oid.hash, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	if (close_tempfile_gently(tempfile)) {
 		error(_("could not close '%s'"), get_tempfile_path(tempfile));
 		return -1;
@@ -3568,17 +3568,15 @@ static size_t read_eoie_extension(const char *mmap, size_t mmap_size)
 	return offset;
 }
 
-static void write_eoie_extension(struct strbuf *sb, git_hash_ctx *eoie_context, size_t offset)
+static void write_eoie_extension(struct strbuf *sb, const unsigned char *hash, size_t offset)
 {
 	uint32_t buffer;
-	unsigned char hash[GIT_MAX_RAWSZ];
 
 	/* offset */
 	put_be32(&buffer, offset);
 	strbuf_add(sb, &buffer, sizeof(uint32_t));
 
 	/* hash */
-	the_hash_algo->final_fn(hash, eoie_context);
 	strbuf_add(sb, hash, the_hash_algo->rawsz);
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 3/3] read-cache: delete unused hashing methods
  2021-03-26 19:12 [PATCH 0/3] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
  2021-03-26 19:12 ` [PATCH 1/3] csum-file: add nested_hashfile() Derrick Stolee via GitGitGadget
  2021-03-26 19:12 ` [PATCH 2/3] read-cache: use hashfile instead of git_hash_ctx Derrick Stolee via GitGitGadget
@ 2021-03-26 19:12 ` Derrick Stolee via GitGitGadget
  2021-03-26 20:16 ` [PATCH 0/3] Convert index writes to use hashfile API Derrick Stolee
  2021-05-17 12:24 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-26 19:12 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

These methods were marked as MAYBE_UNUSED in the previous change to
avoid a complicated diff. Delete them entirely, since we now use the
hashfile API instead of this custom hashing code.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 read-cache.c | 64 ----------------------------------------------------
 1 file changed, 64 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index b9916350f331..3f3f2e01b9a2 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2467,46 +2467,6 @@ int repo_index_has_changes(struct repository *repo,
 	}
 }
 
-#define WRITE_BUFFER_SIZE (128 * 1024)
-static unsigned char write_buffer[WRITE_BUFFER_SIZE];
-static unsigned long write_buffer_len;
-
-MAYBE_UNUSED
-static int ce_write_flush(git_hash_ctx *context, int fd)
-{
-	unsigned int buffered = write_buffer_len;
-	if (buffered) {
-		the_hash_algo->update_fn(context, write_buffer, buffered);
-		if (write_in_full(fd, write_buffer, buffered) < 0)
-			return -1;
-		write_buffer_len = 0;
-	}
-	return 0;
-}
-
-MAYBE_UNUSED
-static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len)
-{
-	while (len) {
-		unsigned int buffered = write_buffer_len;
-		unsigned int partial = WRITE_BUFFER_SIZE - buffered;
-		if (partial > len)
-			partial = len;
-		memcpy(write_buffer + buffered, data, partial);
-		buffered += partial;
-		if (buffered == WRITE_BUFFER_SIZE) {
-			write_buffer_len = buffered;
-			if (ce_write_flush(context, fd))
-				return -1;
-			buffered = 0;
-		}
-		write_buffer_len = buffered;
-		len -= partial;
-		data = (char *) data + partial;
-	}
-	return 0;
-}
-
 static int write_index_ext_header(struct hashfile *f, unsigned int ext, unsigned int sz)
 {
 	hashwrite_be32(f, ext);
@@ -2514,30 +2474,6 @@ static int write_index_ext_header(struct hashfile *f, unsigned int ext, unsigned
 	return 0;
 }
 
-MAYBE_UNUSED
-static int ce_flush(git_hash_ctx *context, int fd, unsigned char *hash)
-{
-	unsigned int left = write_buffer_len;
-
-	if (left) {
-		write_buffer_len = 0;
-		the_hash_algo->update_fn(context, write_buffer, left);
-	}
-
-	/* Flush first if not enough space for hash signature */
-	if (left + the_hash_algo->rawsz > WRITE_BUFFER_SIZE) {
-		if (write_in_full(fd, write_buffer, left) < 0)
-			return -1;
-		left = 0;
-	}
-
-	/* Append the hash signature at the end */
-	the_hash_algo->final_fn(write_buffer + left, context);
-	hashcpy(hash, write_buffer + left);
-	left += the_hash_algo->rawsz;
-	return (write_in_full(fd, write_buffer, left) < 0) ? -1 : 0;
-}
-
 static void ce_smudge_racily_clean_entry(struct index_state *istate,
 					 struct cache_entry *ce)
 {
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] Convert index writes to use hashfile API
  2021-03-26 19:12 [PATCH 0/3] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-03-26 19:12 ` [PATCH 3/3] read-cache: delete unused hashing methods Derrick Stolee via GitGitGadget
@ 2021-03-26 20:16 ` Derrick Stolee
  2021-05-17 12:24 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 27+ messages in thread
From: Derrick Stolee @ 2021-03-26 20:16 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git; +Cc: gitster, peff, git, Derrick Stolee

On 3/26/2021 3:12 PM, Derrick Stolee via GitGitGadget wrote:
> As I prepare some ideas on index v5, one thing that strikes me as an
> interesting direction to try is to use the chunk-format API. This would make
> our extension model extremely simple (they become optional chunks, easily
> identified by the table of contents).
> 
> But there is a huge hurdle to even starting that investigation: the index
> uses its own hashing methods, separate from the hashfile API in csum-file.c!
> 
> The internals of the algorithms are mostly identical. The only possible
> change is that the buffer sizes are different: 8KB for hashfile and 128KB in
> read-cache.c. I was unable to find a performance difference in these two
> implementations, despite testing on several repo sizes.

Of course, shortly after I send this series (thinking I've checked all the
details carefully) I notice that I was using "git update-index --really-refresh"
for testing, but what I really wanted was "git update-index --force-write".

In this case, I _do_ see a performance degradation using the hashfile API.
I will investigate whether this is just a poor implementation of the nesting
hashfile, or something else more tricky. Changing the buffer size doesn't do
the trick.

Please ignore this series for now. Sorry for the noise.

-Stolee

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] read-cache: use hashfile instead of git_hash_ctx
  2021-03-26 19:12 ` [PATCH 2/3] read-cache: use hashfile instead of git_hash_ctx Derrick Stolee via GitGitGadget
@ 2021-03-29 15:04   ` Derrick Stolee
  2021-03-29 19:10     ` Derrick Stolee
  0 siblings, 1 reply; 27+ messages in thread
From: Derrick Stolee @ 2021-03-29 15:04 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: gitster, peff, git, Derrick Stolee, Derrick Stolee

On 3/26/2021 3:12 PM, Derrick Stolee via GitGitGadget wrote:
>  
> -	if (ce_flush(&c, newfd, istate->oid.hash))
> -		return -1;
> +	finalize_hashfile(f, istate->oid.hash, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
>  	if (close_tempfile_gently(tempfile)) {
>  		error(_("could not close '%s'"), get_tempfile_path(tempfile));
>  		return -1;

It was bothering me all weekend why this change made index writes
slower. The reason was this CSUM_FSYNC. Other performance measurement
(instructions, branches, branch misses, etc.) are all really close to
the old code, so this I/O is the only explanation. And truly, it is
the case.

It seems that we avoid an fsync() on the index exactly for this perf
reason, but we don't on other applications of the hashfile API because
writes are not critical paths.

I'll update this series in a v2 soon that has some other improvements
that I noticed while digging deep into things.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] read-cache: use hashfile instead of git_hash_ctx
  2021-03-29 15:04   ` Derrick Stolee
@ 2021-03-29 19:10     ` Derrick Stolee
  0 siblings, 0 replies; 27+ messages in thread
From: Derrick Stolee @ 2021-03-29 19:10 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: gitster, peff, git, Derrick Stolee, Derrick Stolee

On 3/29/2021 11:04 AM, Derrick Stolee wrote:
> On 3/26/2021 3:12 PM, Derrick Stolee via GitGitGadget wrote:
>>  
>> -	if (ce_flush(&c, newfd, istate->oid.hash))
>> -		return -1;
>> +	finalize_hashfile(f, istate->oid.hash, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
>>  	if (close_tempfile_gently(tempfile)) {
>>  		error(_("could not close '%s'"), get_tempfile_path(tempfile));
>>  		return -1;
> 
> It was bothering me all weekend why this change made index writes
> slower. The reason was this CSUM_FSYNC. Other performance measurement
> (instructions, branches, branch misses, etc.) are all really close to
> the old code, so this I/O is the only explanation. And truly, it is
> the case.
> 
> It seems that we avoid an fsync() on the index exactly for this perf
> reason, but we don't on other applications of the hashfile API because
> writes are not critical paths.
> 
> I'll update this series in a v2 soon that has some other improvements
> that I noticed while digging deep into things.

Small update: the pull request [1] has the latest changes that I
think will work for these needs. However, they will conflict with
the upcoming changes to ds/sparse-index to use index extensions.

[1] https://github.com/gitgitgadget/git/pull/916

I'll pause this series and re-send on top of ds/sparse-index
after it solidifies.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 0/4] Convert index writes to use hashfile API
  2021-03-26 19:12 [PATCH 0/3] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-03-26 20:16 ` [PATCH 0/3] Convert index writes to use hashfile API Derrick Stolee
@ 2021-05-17 12:24 ` Derrick Stolee via GitGitGadget
  2021-05-17 12:24   ` [PATCH v2 1/4] hashfile: use write_in_full() Derrick Stolee via GitGitGadget
                     ` (4 more replies)
  4 siblings, 5 replies; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-05-17 12:24 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee

As I prepare some ideas on index v5, one thing that strikes me as an
interesting direction to try is to use the chunk-format API. This would make
our extension model extremely simple (they become optional chunks, easily
identified by the table of contents).

But there is a huge hurdle to even starting that investigation: the index
uses its own hashing methods, separate from the hashfile API in csum-file.c!

The internals of the algorithms are mostly identical. The only possible
change is that the buffer sizes are different: 8KB for hashfile and 128KB in
read-cache.c. This change is actually relatively recent! I was able to see a
performance difference before changing that size, so I do make them equal.
This improves some other commands, too, such as git multi-pack-index write.

There is a subtle point about how the EOIE extension works in that it needs
a hash of just the previous extension headers. This won't be needed by the
chunk-format API, so leave this mostly as-is, except where it collides
somewhat with the hashfile changes.


Updates in v2
=============

 * Sorry for the false start on v1. I'm correctly testing index writes this
   time, but I also verified other commands, such as "git add .", to be sure
   I was exercising everything. I also previously had incorrect assumptions
   about the EOIE extension, since it is not triggered with "small" repos
   like the Linux kernel repository ;).

 * This version is rebased onto a recent version of 'master' so the
   conflicts with ds/sparse-index-protections are resolved.

 * This version abandons the nested hashfile concept, since that was only
   needed for the incorrect interpretation of the EOIE extension's hash.

 * This version does unify the buffer size to 128 KB with justification.

 * There is also some more unification of writing logic to use
   write_in_full() in the hashfile API.

 * When using check_fd, the hashfile API is no longer thread safe due to a
   stack-allocated buffer moving to the global scope. I'm not sure if there
   is a better way to handle this case.

Thanks, -Stolee

Derrick Stolee (4):
  hashfile: use write_in_full()
  csum-file.h: increase hashfile buffer size
  read-cache: use hashfile instead of git_hash_ctx
  read-cache: delete unused hashing methods

 chunk-format.c |  12 ++--
 csum-file.c    |  45 ++++++------
 csum-file.h    |   4 +-
 read-cache.c   | 191 ++++++++++++++++---------------------------------
 4 files changed, 94 insertions(+), 158 deletions(-)


base-commit: 97eea85a0a1ec66d356567808a1e4ca2367e0ce7
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-916%2Fderrickstolee%2Findex-hashfile-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-916/derrickstolee/index-hashfile-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/916

Range-diff vs v1:

 -:  ------------ > 1:  b3e578ac0365 hashfile: use write_in_full()
 1:  0eca529766fc ! 2:  9dc602f6c422 csum-file: add nested_hashfile()
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    csum-file: add nested_hashfile()
     +    csum-file.h: increase hashfile buffer size
      
     -    The index writing code in do_write_index() uses a custom set of hashing
     -    code, in part because it was introduced before the hashfile API. But
     -    also, the End of Index Entries extension computes a hash of just the
     -    extension data, not the entire file preceding that extension.
     +    The hashfile API uses a hard-coded buffer size of 8KB and has ever since
     +    it was introduced in c38138c (git-pack-objects: write the pack files
     +    with a SHA1 csum, 2005-06-26). It performs a similar function to the
     +    hashing buffers in read-cache.c, but that code was updated from 8KB to
     +    128KB in f279894 (read-cache: make the index write buffer size 128K,
     +    2021-02-18). The justification there was that do_write_index() improves
     +    from 1.02s to 0.72s.
      
     -    Before converting the index writing code to use the hashfile API, create
     -    a concept of a "nested hashfile". By adding a 'base' member to 'struct
     -    hashfile', we indicate that any writes to this hashfile should be passed
     -    along to the base hashfile, too.
     +    There is a buffer, check_buffer, that is used to verify the check_fd
     +    file descriptor. When this buffer increases to 128K to fit the data
     +    being flushed, it causes the stack to overflow the limits placed in the
     +    test suite. By moving this to a static buffer, we stop using stack data
     +    for this purpose, but we lose some thread-safety. This change makes it
     +    unsafe to write to multiple hashfiles across different threads.
      
     -    In the next change, the index code will use this to create a new
     -    hashfile wose base is the hashfile for the index. The outer hashfile
     -    will compute the hash just for the extension details. Thus, it will
     -    finalize earlier than the base hashfile, hence there is no modification
     -    to finalize_hashfile() here.
     +    By adding a new trace2 region in the chunk-format API, we can see that
     +    the writing portion of 'git multi-pack-index write' lowers from ~1.49s
     +    to ~1.47s on a Linux machine. These effects may be more pronounced or
     +    diminished on other filesystems. The end-to-end timing is too noisy to
     +    have a definitive change either way.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     - ## csum-file.c ##
     -@@
     + ## chunk-format.c ##
     +@@ chunk-format.c: void add_chunk(struct chunkfile *cf,
       
     - static void flush(struct hashfile *f, const void *buf, unsigned int count)
     + int write_chunkfile(struct chunkfile *cf, void *data)
       {
     -+	if (f->base)
     -+		return;
     +-	int i;
     ++	int i, result = 0;
     + 	uint64_t cur_offset = hashfile_total(cf->f);
     + 
     ++	trace2_region_enter("chunkfile", "write", the_repository);
      +
     - 	if (0 <= f->check_fd && count)  {
     - 		unsigned char check_buffer[8192];
     - 		ssize_t ret = read_in_full(f->check_fd, check_buffer, count);
     -@@ csum-file.c: void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
     - 		}
     - 		f->offset = offset;
     + 	/* Add the table of contents to the current offset */
     + 	cur_offset += (cf->chunks_nr + 1) * CHUNK_TOC_ENTRY_SIZE;
     + 
     +@@ chunk-format.c: int write_chunkfile(struct chunkfile *cf, void *data)
     + 
     + 	for (i = 0; i < cf->chunks_nr; i++) {
     + 		off_t start_offset = hashfile_total(cf->f);
     +-		int result = cf->chunks[i].write_fn(cf->f, data);
     ++		result = cf->chunks[i].write_fn(cf->f, data);
     + 
     + 		if (result)
     +-			return result;
     ++			goto cleanup;
     + 
     + 		if (hashfile_total(cf->f) - start_offset != cf->chunks[i].size)
     + 			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
     +@@ chunk-format.c: int write_chunkfile(struct chunkfile *cf, void *data)
     + 			    hashfile_total(cf->f) - start_offset);
       	}
     -+
     -+	if (f->base)
     -+		hashwrite(f->base, buf, count);
     - }
       
     - struct hashfile *hashfd(int fd, const char *name)
     -@@ csum-file.c: struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
     - 	f->name = name;
     - 	f->do_crc = 0;
     - 	the_hash_algo->init_fn(&f->ctx);
     -+	f->base = NULL;
     - 	return f;
     +-	return 0;
     ++cleanup:
     ++	trace2_region_leave("chunkfile", "write", the_repository);
     ++	return result;
       }
       
     -@@ csum-file.c: uint32_t crc32_end(struct hashfile *f)
     - 	f->do_crc = 0;
     - 	return f->crc32;
     - }
     -+
     -+struct hashfile *nested_hashfile(struct hashfile *f)
     + int read_table_of_contents(struct chunkfile *cf,
     +
     + ## csum-file.c ##
     +@@
     + #include "progress.h"
     + #include "csum-file.h"
     + 
     ++static void verify_buffer_or_die(struct hashfile *f,
     ++				 const void *buf,
     ++				 unsigned int count)
      +{
     -+	struct hashfile *n = xmalloc(sizeof(*f));
     -+	n->fd = -1;
     -+	n->check_fd = -1;
     -+	n->offset = 0;
     -+	n->total = 0;
     -+	n->tp = NULL;
     -+	n->name = NULL;
     -+	n->do_crc = 0;
     -+	the_hash_algo->init_fn(&n->ctx);
     -+	n->base = f;
     -+	return n;
     ++	static unsigned char check_buffer[WRITE_BUFFER_SIZE];
     ++	ssize_t ret = read_in_full(f->check_fd, check_buffer, count);
     ++
     ++	if (ret < 0)
     ++		die_errno("%s: sha1 file read error", f->name);
     ++	if (ret != count)
     ++		die("%s: sha1 file truncated", f->name);
     ++	if (memcmp(buf, check_buffer, count))
     ++		die("sha1 file '%s' validation error", f->name);
      +}
     ++
     + static void flush(struct hashfile *f, const void *buf, unsigned int count)
     + {
     +-	if (0 <= f->check_fd && count)  {
     +-		unsigned char check_buffer[8192];
     +-		ssize_t ret = read_in_full(f->check_fd, check_buffer, count);
     +-
     +-		if (ret < 0)
     +-			die_errno("%s: sha1 file read error", f->name);
     +-		if (ret != count)
     +-			die("%s: sha1 file truncated", f->name);
     +-		if (memcmp(buf, check_buffer, count))
     +-			die("sha1 file '%s' validation error", f->name);
     +-	}
     ++	if (0 <= f->check_fd && count)
     ++		verify_buffer_or_die(f, buf, count);
     + 
     + 	if (write_in_full(f->fd, buf, count) < 0) {
     + 		if (errno == ENOSPC)
      
       ## csum-file.h ##
     +@@
     + 
     + struct progress;
     + 
     ++#define WRITE_BUFFER_SIZE (128 * 1024)
     ++
     + /* A SHA1-protected file */
     + struct hashfile {
     + 	int fd;
      @@ csum-file.h: struct hashfile {
       	const char *name;
       	int do_crc;
       	uint32_t crc32;
     -+	struct hashfile *base;
     - 	unsigned char buffer[8192];
     +-	unsigned char buffer[8192];
     ++	unsigned char buffer[WRITE_BUFFER_SIZE];
       };
       
     -@@ csum-file.h: void hashflush(struct hashfile *f);
     - void crc32_begin(struct hashfile *);
     - uint32_t crc32_end(struct hashfile *);
     - 
     -+/*
     -+ * A nested hashfile uses the same interface as a hashfile, and computes
     -+ * a hash for the input bytes while passing them to the base hashfile
     -+ * instead of writing them to its own file. This is useful for computing
     -+ * a hash of a region within a file during the write.
     -+ */
     -+struct hashfile *nested_hashfile(struct hashfile *f);
     -+
     - /*
     -  * Returns the total number of bytes fed to the hashfile so far (including ones
     -  * that have not been written out to the descriptor yet).
     + /* Checkpoint */
 2:  e2611bbc007a ! 3:  b94172ccf5e9 read-cache: use hashfile instead of git_hash_ctx
     @@ Commit message
          that uses the chunk-format API. That API uses a hashfile as its base, so
          it is incompatible with the custom code in read-cache.c.
      
     -    This rewrite of the logic is rather straightforward, except for the
     -    special case of creating a nested hashfile to handle computing the hash
     -    of the extension data just for the End of Index Entries extension. The
     -    previous change introduced the concept for just this purpose.
     +    This rewrite is rather straightforward. It replaces all writes to the
     +    temporary file with writes to the hashfile struct. This takes care of
     +    many of the direct interactions with the_hash_algo.
      
     -    The internals of the algorithms are mostly identical. The only
     -    meaningful change is that the buffer sizes are different: 8KB for
     -    hashfile and 128KB in read-cache.c. I was unable to find a performance
     -    difference in these two implementations, despite testing on several repo
     -    sizes. I also tried adjusting the buffer size of the hashfile struct for
     -    a variety of sizes between 8KB and 128KB, and did not see a performance
     -    change for any of the commands that currently use hashfiles.
     +    There are still some remaining: the extension headers are hashed for use
     +    in the End of Index Entries (EOIE) extension. This use of the
     +    git_hash_ctx is left as-is. There are multiple reasons to not use a
     +    hashfile here, including the fact that the data is not actually writing
     +    to a file, just a hash computation. These hashes do not block our
     +    adoption of the chunk-format API in a future change to the index, so
     +    leave it as-is.
     +
     +    The internals of the algorithms are mostly identical. Previously, the
     +    hashfile API used a smaller 8KB buffer instead of the 128KB buffer from
     +    read-cache.c. The previous change already unified these sizes.
     +
     +    There is one subtle point: we do not pass the CSUM_FSYNC to the
     +    finalize_hashfile() method, which differs from most consumers of the
     +    hashfile API. The extra fsync() call indicated by this flag causes a
     +    significant peformance degradation that is noticeable for quick
     +    commands that write the index, such as "git add". Other consumers can
     +    absorb this cost with their more complicated data structure
     +    organization, and further writing structures such as pack-files and
     +    commit-graphs is rarely in the critical path for common user
     +    interactions.
      
          Some static methods become orphaned in this diff, so I marked them as
          MAYBE_UNUSED. The diff is much harder to read if they are deleted during
     @@ Commit message
          In addition to the test suite passing, I computed indexes using the
          previous binaries and the binaries compiled after this change, and found
          the index data to be exactly equal. Finally, I did extensive performance
     -    testing of "git update-index --really-refresh" on repos of various
     -    sizes, including one with over 2 million paths at HEAD. These tests
     +    testing of "git update-index --force-write" on repos of various sizes,
     +    including one with over 2 million paths at HEAD. These tests
          demonstrated less than 1% difference in behavior, so the performance
          should be considered identical.
      
     @@ Commit message
      
       ## read-cache.c ##
      @@
     - #include "fsmonitor.h"
       #include "thread-utils.h"
       #include "progress.h"
     + #include "sparse-index.h"
      +#include "csum-file.h"
       
       /* Mask for the name length in ce_flags in the on-disk index */
       
     -@@ read-cache.c: static struct index_entry_offset_table *read_ieot_extension(const char *mmap, si
     - static void write_ieot_extension(struct strbuf *sb, struct index_entry_offset_table *ieot);
     - 
     - static size_t read_eoie_extension(const char *mmap, size_t mmap_size);
     --static void write_eoie_extension(struct strbuf *sb, git_hash_ctx *eoie_context, size_t offset);
     -+static void write_eoie_extension(struct strbuf *sb, const unsigned char *hash, size_t offset);
     - 
     - struct load_index_extensions
     - {
      @@ read-cache.c: int repo_index_has_changes(struct repository *repo,
       static unsigned char write_buffer[WRITE_BUFFER_SIZE];
       static unsigned long write_buffer_len;
     @@ read-cache.c: static int ce_write(git_hash_ctx *context, int fd, void *data, uns
       
      -static int write_index_ext_header(git_hash_ctx *context, git_hash_ctx *eoie_context,
      -				  int fd, unsigned int ext, unsigned int sz)
     -+static int write_index_ext_header(struct hashfile *f, unsigned int ext, unsigned int sz)
     ++static int write_index_ext_header(struct hashfile *f,
     ++				  git_hash_ctx *eoie_f,
     ++				  unsigned int ext,
     ++				  unsigned int sz)
       {
      -	ext = htonl(ext);
      -	sz = htonl(sz);
      -	if (eoie_context) {
      -		the_hash_algo->update_fn(eoie_context, &ext, 4);
      -		the_hash_algo->update_fn(eoie_context, &sz, 4);
     --	}
     --	return ((ce_write(context, fd, &ext, 4) < 0) ||
     --		(ce_write(context, fd, &sz, 4) < 0)) ? -1 : 0;
      +	hashwrite_be32(f, ext);
      +	hashwrite_be32(f, sz);
     ++
     ++	if (eoie_f) {
     ++		ext = htonl(ext);
     ++		sz = htonl(sz);
     ++		the_hash_algo->update_fn(eoie_f, &ext, sizeof(ext));
     ++		the_hash_algo->update_fn(eoie_f, &sz, sizeof(sz));
     + 	}
     +-	return ((ce_write(context, fd, &ext, 4) < 0) ||
     +-		(ce_write(context, fd, &sz, 4) < 0)) ? -1 : 0;
      +	return 0;
       }
       
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       	uint64_t start = getnanotime();
      -	int newfd = tempfile->fd;
      -	git_hash_ctx c, eoie_c;
     -+	struct hashfile *f, *eoie_f;
     ++	struct hashfile *f;
     ++	git_hash_ctx *eoie_c = NULL;
       	struct cache_header hdr;
       	int i, err = 0, removed, extended, hdr_version;
       	struct cache_entry **cache = istate->cache;
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       	}
       
      -	offset = lseek(newfd, 0, SEEK_CUR);
     -+	offset = lseek(f->fd, 0, SEEK_CUR);
     - 	if (offset < 0) {
     - 		free(ieot);
     - 		return -1;
     - 	}
     +-	if (offset < 0) {
     +-		free(ieot);
     +-		return -1;
     +-	}
      -	offset += write_buffer_len;
     ++	offset = hashfile_total(f);
      +
       	nr = 0;
       	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       				previous_name->buf[0] = 0;
       			nr = 0;
      -			offset = lseek(newfd, 0, SEEK_CUR);
     -+
     -+			offset = lseek(f->fd, 0, SEEK_CUR);
     - 			if (offset < 0) {
     - 				free(ieot);
     - 				return -1;
     - 			}
     +-			if (offset < 0) {
     +-				free(ieot);
     +-				return -1;
     +-			}
      -			offset += write_buffer_len;
     ++
     ++			offset = hashfile_total(f);
       		}
      -		if (ce_write_entry(&c, newfd, ce, previous_name, (struct ondisk_cache_entry *)&ondisk) < 0)
      +		if (ce_write_entry(f, ce, previous_name, (struct ondisk_cache_entry *)&ondisk) < 0)
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       
      -	/* Write extension data here */
      -	offset = lseek(newfd, 0, SEEK_CUR);
     -+	offset = lseek(f->fd, 0, SEEK_CUR);
     - 	if (offset < 0) {
     - 		free(ieot);
     - 		return -1;
     - 	}
     --	offset += write_buffer_len;
     --	the_hash_algo->init_fn(&eoie_c);
     +-	if (offset < 0) {
     +-		free(ieot);
     +-		return -1;
     ++	offset = hashfile_total(f);
      +
      +	/*
     -+	 * The extensions must be hashed on their own for use in the EOIE
     -+	 * extension. Use a nested hashfile to compute the hash for this
     -+	 * region while passing the buffer to the original hashfile.
     ++	 * The extension headers must be hashed on their own for the
     ++	 * EOIE extension. Create a hashfile here to compute that hash.
      +	 */
     -+	if (offset && record_eoie())
     -+		eoie_f = nested_hashfile(f);
     -+	else
     -+		eoie_f = f;
     ++	if (offset && record_eoie()) {
     ++		CALLOC_ARRAY(eoie_c, 1);
     ++		the_hash_algo->init_fn(eoie_c);
     + 	}
     +-	offset += write_buffer_len;
     +-	the_hash_algo->init_fn(&eoie_c);
       
       	/*
       	 * Lets write out CACHE_EXT_INDEXENTRYOFFSETTABLE first so that we
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       		write_ieot_extension(&sb, ieot);
      -		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_INDEXENTRYOFFSETTABLE, sb.len) < 0
      -			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
     -+		err = write_index_ext_header(eoie_f, CACHE_EXT_INDEXENTRYOFFSETTABLE, sb.len) < 0;
     -+		hashwrite(eoie_f, sb.buf, sb.len);
     ++		err = write_index_ext_header(f, eoie_c, CACHE_EXT_INDEXENTRYOFFSETTABLE, sb.len) < 0;
     ++		hashwrite(f, sb.buf, sb.len);
       		strbuf_release(&sb);
       		free(ieot);
       		if (err)
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
      -			write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_LINK,
      -					       sb.len) < 0 ||
      -			ce_write(&c, newfd, sb.buf, sb.len) < 0;
     -+			write_index_ext_header(eoie_f, CACHE_EXT_LINK,
     ++			write_index_ext_header(f, eoie_c, CACHE_EXT_LINK,
      +					       sb.len) < 0;
     -+		hashwrite(eoie_f, sb.buf, sb.len);
     ++		hashwrite(f, sb.buf, sb.len);
       		strbuf_release(&sb);
       		if (err)
       			return -1;
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       		cache_tree_write(&sb, istate->cache_tree);
      -		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_TREE, sb.len) < 0
      -			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
     -+		err = write_index_ext_header(eoie_f, CACHE_EXT_TREE, sb.len) < 0;
     -+		hashwrite(eoie_f, sb.buf, sb.len);
     ++		err = write_index_ext_header(f, eoie_c, CACHE_EXT_TREE, sb.len) < 0;
     ++		hashwrite(f, sb.buf, sb.len);
       		strbuf_release(&sb);
       		if (err)
       			return -1;
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
      -		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_RESOLVE_UNDO,
      -					     sb.len) < 0
      -			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
     -+		err = write_index_ext_header(eoie_f, CACHE_EXT_RESOLVE_UNDO,
     ++		err = write_index_ext_header(f, eoie_c, CACHE_EXT_RESOLVE_UNDO,
      +					     sb.len) < 0;
     -+		hashwrite(eoie_f, sb.buf, sb.len);
     ++		hashwrite(f, sb.buf, sb.len);
       		strbuf_release(&sb);
       		if (err)
       			return -1;
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
      -		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_UNTRACKED,
      -					     sb.len) < 0 ||
      -			ce_write(&c, newfd, sb.buf, sb.len) < 0;
     -+		err = write_index_ext_header(eoie_f, CACHE_EXT_UNTRACKED,
     ++		err = write_index_ext_header(f, eoie_c, CACHE_EXT_UNTRACKED,
      +					     sb.len) < 0;
     -+		hashwrite(eoie_f, sb.buf, sb.len);
     ++		hashwrite(f, sb.buf, sb.len);
       		strbuf_release(&sb);
       		if (err)
       			return -1;
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       		write_fsmonitor_extension(&sb, istate);
      -		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_FSMONITOR, sb.len) < 0
      -			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
     -+		err = write_index_ext_header(eoie_f, CACHE_EXT_FSMONITOR, sb.len) < 0;
     -+		hashwrite(eoie_f, sb.buf, sb.len);
     ++		err = write_index_ext_header(f, eoie_c, CACHE_EXT_FSMONITOR, sb.len) < 0;
     ++		hashwrite(f, sb.buf, sb.len);
       		strbuf_release(&sb);
       		if (err)
       			return -1;
     + 	}
     + 	if (istate->sparse_index) {
     +-		if (write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_SPARSE_DIRECTORIES, 0) < 0)
     ++		if (write_index_ext_header(f, eoie_c, CACHE_EXT_SPARSE_DIRECTORIES, 0) < 0)
     + 			return -1;
     + 	}
     + 
      @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
       	 * read.  Write it out regardless of the strip_extensions parameter as we need it
       	 * when loading the shared index.
       	 */
      -	if (offset && record_eoie()) {
     -+	if (f != eoie_f) {
     ++	if (eoie_c) {
       		struct strbuf sb = STRBUF_INIT;
     -+		unsigned char hash[GIT_MAX_RAWSZ];
       
      -		write_eoie_extension(&sb, &eoie_c, offset);
      -		err = write_index_ext_header(&c, NULL, newfd, CACHE_EXT_ENDOFINDEXENTRIES, sb.len) < 0
      -			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
     -+		finalize_hashfile(eoie_f, hash, 0);
     -+
     -+		write_eoie_extension(&sb, hash, offset);
     -+		err = write_index_ext_header(f, CACHE_EXT_ENDOFINDEXENTRIES, sb.len) < 0;
     ++		write_eoie_extension(&sb, eoie_c, offset);
     ++		err = write_index_ext_header(f, NULL, CACHE_EXT_ENDOFINDEXENTRIES, sb.len) < 0;
      +		hashwrite(f, sb.buf, sb.len);
       		strbuf_release(&sb);
       		if (err)
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempf
       
      -	if (ce_flush(&c, newfd, istate->oid.hash))
      -		return -1;
     -+	finalize_hashfile(f, istate->oid.hash, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
     ++	finalize_hashfile(f, istate->oid.hash, CSUM_HASH_IN_STREAM);
       	if (close_tempfile_gently(tempfile)) {
       		error(_("could not close '%s'"), get_tempfile_path(tempfile));
       		return -1;
     -@@ read-cache.c: static size_t read_eoie_extension(const char *mmap, size_t mmap_size)
     - 	return offset;
     - }
     - 
     --static void write_eoie_extension(struct strbuf *sb, git_hash_ctx *eoie_context, size_t offset)
     -+static void write_eoie_extension(struct strbuf *sb, const unsigned char *hash, size_t offset)
     - {
     - 	uint32_t buffer;
     --	unsigned char hash[GIT_MAX_RAWSZ];
     - 
     - 	/* offset */
     - 	put_be32(&buffer, offset);
     - 	strbuf_add(sb, &buffer, sizeof(uint32_t));
     - 
     - 	/* hash */
     --	the_hash_algo->final_fn(hash, eoie_context);
     - 	strbuf_add(sb, hash, the_hash_algo->rawsz);
     - }
     - 
 3:  e2d5a8dc919b ! 4:  4b3814eb4c80 read-cache: delete unused hashing methods
     @@ read-cache.c: int repo_index_has_changes(struct repository *repo,
      -	return 0;
      -}
      -
     - static int write_index_ext_header(struct hashfile *f, unsigned int ext, unsigned int sz)
     - {
     - 	hashwrite_be32(f, ext);
     -@@ read-cache.c: static int write_index_ext_header(struct hashfile *f, unsigned int ext, unsigned
     + static int write_index_ext_header(struct hashfile *f,
     + 				  git_hash_ctx *eoie_f,
     + 				  unsigned int ext,
     +@@ read-cache.c: static int write_index_ext_header(struct hashfile *f,
       	return 0;
       }
       

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 1/4] hashfile: use write_in_full()
  2021-05-17 12:24 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
@ 2021-05-17 12:24   ` Derrick Stolee via GitGitGadget
  2021-05-17 12:24   ` [PATCH v2 2/4] csum-file.h: increase hashfile buffer size Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-05-17 12:24 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The flush() logic in csum-file.c was introduced originally by c38138c
(git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26)
and a portion of the logic performs similar utility to write_in_full()
in wrapper.c. The history of write_in_full() is full of moves and
renames, but was originally introduced by 7230e6d (Add write_or_die(), a
helper function, 2006-08-21).

The point of these sections of code are to flush a write buffer using
xwrite() and report errors in the case of disk space issues or other
generic input/output errors. The logic in flush() can interpret the
output of write_in_full() to provide the correct error messages to
users.

The logic in the hashfile API has an additional set of logic to augment
the progress indicator between calls to xwrite(). This was introduced by
2a128d6 (add throughput display to git-push, 2007-10-30). It seems that
since the hashfile's buffer is only 8KB, these additional progress
indicators might not be incredibly necessary. Instead, update the
progress only when write_in_full() complete.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 csum-file.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 7510950fa3e9..3c26389d4914 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -25,21 +25,14 @@ static void flush(struct hashfile *f, const void *buf, unsigned int count)
 			die("sha1 file '%s' validation error", f->name);
 	}
 
-	for (;;) {
-		int ret = xwrite(f->fd, buf, count);
-		if (ret > 0) {
-			f->total += ret;
-			display_throughput(f->tp, f->total);
-			buf = (char *) buf + ret;
-			count -= ret;
-			if (count)
-				continue;
-			return;
-		}
-		if (!ret)
+	if (write_in_full(f->fd, buf, count) < 0) {
+		if (errno == ENOSPC)
 			die("sha1 file '%s' write error. Out of diskspace", f->name);
 		die_errno("sha1 file '%s' write error", f->name);
 	}
+
+	f->total += count;
+	display_throughput(f->tp, f->total);
 }
 
 void hashflush(struct hashfile *f)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 2/4] csum-file.h: increase hashfile buffer size
  2021-05-17 12:24 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
  2021-05-17 12:24   ` [PATCH v2 1/4] hashfile: use write_in_full() Derrick Stolee via GitGitGadget
@ 2021-05-17 12:24   ` Derrick Stolee via GitGitGadget
  2021-05-17 21:54     ` Junio C Hamano
  2021-05-18  7:31     ` Jeff King
  2021-05-17 12:24   ` [PATCH v2 3/4] read-cache: use hashfile instead of git_hash_ctx Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-05-17 12:24 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s.

There is a buffer, check_buffer, that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. By moving this to a static buffer, we stop using stack data
for this purpose, but we lose some thread-safety. This change makes it
unsafe to write to multiple hashfiles across different threads.

By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 12 ++++++++----
 csum-file.c    | 28 +++++++++++++++++-----------
 csum-file.h    |  4 +++-
 3 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index da191e59a29d..1c3dca62e205 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -58,9 +58,11 @@ void add_chunk(struct chunkfile *cf,
 
 int write_chunkfile(struct chunkfile *cf, void *data)
 {
-	int i;
+	int i, result = 0;
 	uint64_t cur_offset = hashfile_total(cf->f);
 
+	trace2_region_enter("chunkfile", "write", the_repository);
+
 	/* Add the table of contents to the current offset */
 	cur_offset += (cf->chunks_nr + 1) * CHUNK_TOC_ENTRY_SIZE;
 
@@ -77,10 +79,10 @@ int write_chunkfile(struct chunkfile *cf, void *data)
 
 	for (i = 0; i < cf->chunks_nr; i++) {
 		off_t start_offset = hashfile_total(cf->f);
-		int result = cf->chunks[i].write_fn(cf->f, data);
+		result = cf->chunks[i].write_fn(cf->f, data);
 
 		if (result)
-			return result;
+			goto cleanup;
 
 		if (hashfile_total(cf->f) - start_offset != cf->chunks[i].size)
 			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
@@ -88,7 +90,9 @@ int write_chunkfile(struct chunkfile *cf, void *data)
 			    hashfile_total(cf->f) - start_offset);
 	}
 
-	return 0;
+cleanup:
+	trace2_region_leave("chunkfile", "write", the_repository);
+	return result;
 }
 
 int read_table_of_contents(struct chunkfile *cf,
diff --git a/csum-file.c b/csum-file.c
index 3c26389d4914..bd9939c49efa 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -11,19 +11,25 @@
 #include "progress.h"
 #include "csum-file.h"
 
+static void verify_buffer_or_die(struct hashfile *f,
+				 const void *buf,
+				 unsigned int count)
+{
+	static unsigned char check_buffer[WRITE_BUFFER_SIZE];
+	ssize_t ret = read_in_full(f->check_fd, check_buffer, count);
+
+	if (ret < 0)
+		die_errno("%s: sha1 file read error", f->name);
+	if (ret != count)
+		die("%s: sha1 file truncated", f->name);
+	if (memcmp(buf, check_buffer, count))
+		die("sha1 file '%s' validation error", f->name);
+}
+
 static void flush(struct hashfile *f, const void *buf, unsigned int count)
 {
-	if (0 <= f->check_fd && count)  {
-		unsigned char check_buffer[8192];
-		ssize_t ret = read_in_full(f->check_fd, check_buffer, count);
-
-		if (ret < 0)
-			die_errno("%s: sha1 file read error", f->name);
-		if (ret != count)
-			die("%s: sha1 file truncated", f->name);
-		if (memcmp(buf, check_buffer, count))
-			die("sha1 file '%s' validation error", f->name);
-	}
+	if (0 <= f->check_fd && count)
+		verify_buffer_or_die(f, buf, count);
 
 	if (write_in_full(f->fd, buf, count) < 0) {
 		if (errno == ENOSPC)
diff --git a/csum-file.h b/csum-file.h
index e54d53d1d0b3..bc88eb86fc28 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -5,6 +5,8 @@
 
 struct progress;
 
+#define WRITE_BUFFER_SIZE (128 * 1024)
+
 /* A SHA1-protected file */
 struct hashfile {
 	int fd;
@@ -16,7 +18,7 @@ struct hashfile {
 	const char *name;
 	int do_crc;
 	uint32_t crc32;
-	unsigned char buffer[8192];
+	unsigned char buffer[WRITE_BUFFER_SIZE];
 };
 
 /* Checkpoint */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 3/4] read-cache: use hashfile instead of git_hash_ctx
  2021-05-17 12:24 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
  2021-05-17 12:24   ` [PATCH v2 1/4] hashfile: use write_in_full() Derrick Stolee via GitGitGadget
  2021-05-17 12:24   ` [PATCH v2 2/4] csum-file.h: increase hashfile buffer size Derrick Stolee via GitGitGadget
@ 2021-05-17 12:24   ` Derrick Stolee via GitGitGadget
  2021-05-17 22:13     ` Junio C Hamano
  2021-05-17 12:24   ` [PATCH v2 4/4] read-cache: delete unused hashing methods Derrick Stolee via GitGitGadget
  2021-05-18 18:32   ` [PATCH v3 0/4] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
  4 siblings, 1 reply; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-05-17 12:24 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The do_write_index() method in read-cache.c has its own hashing logic
and buffering mechanism. Specifically, the ce_write() method was
introduced by 4990aadc (Speed up index file writing by chunking it
nicely, 2005-04-20) and similar mechanisms were introduced a few months
later in c38138cd (git-pack-objects: write the pack files with a SHA1
csum, 2005-06-26). Based on the timing, in the early days of the Git
codebase, I figured that these roughly equivalent code paths were never
unified only because it got lost in the shuffle. The hashfile API has
since been used extensively in other file formats, such as pack-indexes,
mult-pack-indexes, and commit-graphs. Therefore, it seems prudent to
unify the index writing code to use the same mechanism.

I discovered this disparity while trying to create a new index format
that uses the chunk-format API. That API uses a hashfile as its base, so
it is incompatible with the custom code in read-cache.c.

This rewrite is rather straightforward. It replaces all writes to the
temporary file with writes to the hashfile struct. This takes care of
many of the direct interactions with the_hash_algo.

There are still some remaining: the extension headers are hashed for use
in the End of Index Entries (EOIE) extension. This use of the
git_hash_ctx is left as-is. There are multiple reasons to not use a
hashfile here, including the fact that the data is not actually writing
to a file, just a hash computation. These hashes do not block our
adoption of the chunk-format API in a future change to the index, so
leave it as-is.

The internals of the algorithms are mostly identical. Previously, the
hashfile API used a smaller 8KB buffer instead of the 128KB buffer from
read-cache.c. The previous change already unified these sizes.

There is one subtle point: we do not pass the CSUM_FSYNC to the
finalize_hashfile() method, which differs from most consumers of the
hashfile API. The extra fsync() call indicated by this flag causes a
significant peformance degradation that is noticeable for quick
commands that write the index, such as "git add". Other consumers can
absorb this cost with their more complicated data structure
organization, and further writing structures such as pack-files and
commit-graphs is rarely in the critical path for common user
interactions.

Some static methods become orphaned in this diff, so I marked them as
MAYBE_UNUSED. The diff is much harder to read if they are deleted during
this change. Instead, they will be deleted in the following change.

In addition to the test suite passing, I computed indexes using the
previous binaries and the binaries compiled after this change, and found
the index data to be exactly equal. Finally, I did extensive performance
testing of "git update-index --force-write" on repos of various sizes,
including one with over 2 million paths at HEAD. These tests
demonstrated less than 1% difference in behavior, so the performance
should be considered identical.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 read-cache.c | 137 +++++++++++++++++++++++++--------------------------
 1 file changed, 66 insertions(+), 71 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index fbf3a4ce7d5d..1c0bda81e7e7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -26,6 +26,7 @@
 #include "thread-utils.h"
 #include "progress.h"
 #include "sparse-index.h"
+#include "csum-file.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -2519,6 +2520,7 @@ int repo_index_has_changes(struct repository *repo,
 static unsigned char write_buffer[WRITE_BUFFER_SIZE];
 static unsigned long write_buffer_len;
 
+MAYBE_UNUSED
 static int ce_write_flush(git_hash_ctx *context, int fd)
 {
 	unsigned int buffered = write_buffer_len;
@@ -2531,6 +2533,7 @@ static int ce_write_flush(git_hash_ctx *context, int fd)
 	return 0;
 }
 
+MAYBE_UNUSED
 static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len)
 {
 	while (len) {
@@ -2553,19 +2556,24 @@ static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len)
 	return 0;
 }
 
-static int write_index_ext_header(git_hash_ctx *context, git_hash_ctx *eoie_context,
-				  int fd, unsigned int ext, unsigned int sz)
+static int write_index_ext_header(struct hashfile *f,
+				  git_hash_ctx *eoie_f,
+				  unsigned int ext,
+				  unsigned int sz)
 {
-	ext = htonl(ext);
-	sz = htonl(sz);
-	if (eoie_context) {
-		the_hash_algo->update_fn(eoie_context, &ext, 4);
-		the_hash_algo->update_fn(eoie_context, &sz, 4);
+	hashwrite_be32(f, ext);
+	hashwrite_be32(f, sz);
+
+	if (eoie_f) {
+		ext = htonl(ext);
+		sz = htonl(sz);
+		the_hash_algo->update_fn(eoie_f, &ext, sizeof(ext));
+		the_hash_algo->update_fn(eoie_f, &sz, sizeof(sz));
 	}
-	return ((ce_write(context, fd, &ext, 4) < 0) ||
-		(ce_write(context, fd, &sz, 4) < 0)) ? -1 : 0;
+	return 0;
 }
 
+MAYBE_UNUSED
 static int ce_flush(git_hash_ctx *context, int fd, unsigned char *hash)
 {
 	unsigned int left = write_buffer_len;
@@ -2667,11 +2675,10 @@ static void copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
 	}
 }
 
-static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
+static int ce_write_entry(struct hashfile *f, struct cache_entry *ce,
 			  struct strbuf *previous_name, struct ondisk_cache_entry *ondisk)
 {
 	int size;
-	int result;
 	unsigned int saved_namelen;
 	int stripped_name = 0;
 	static unsigned char padding[8] = { 0x00 };
@@ -2687,11 +2694,9 @@ static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
 	if (!previous_name) {
 		int len = ce_namelen(ce);
 		copy_cache_entry_to_ondisk(ondisk, ce);
-		result = ce_write(c, fd, ondisk, size);
-		if (!result)
-			result = ce_write(c, fd, ce->name, len);
-		if (!result)
-			result = ce_write(c, fd, padding, align_padding_size(size, len));
+		hashwrite(f, ondisk, size);
+		hashwrite(f, ce->name, len);
+		hashwrite(f, padding, align_padding_size(size, len));
 	} else {
 		int common, to_remove, prefix_size;
 		unsigned char to_remove_vi[16];
@@ -2705,13 +2710,10 @@ static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
 		prefix_size = encode_varint(to_remove, to_remove_vi);
 
 		copy_cache_entry_to_ondisk(ondisk, ce);
-		result = ce_write(c, fd, ondisk, size);
-		if (!result)
-			result = ce_write(c, fd, to_remove_vi, prefix_size);
-		if (!result)
-			result = ce_write(c, fd, ce->name + common, ce_namelen(ce) - common);
-		if (!result)
-			result = ce_write(c, fd, padding, 1);
+		hashwrite(f, ondisk, size);
+		hashwrite(f, to_remove_vi, prefix_size);
+		hashwrite(f, ce->name + common, ce_namelen(ce) - common);
+		hashwrite(f, padding, 1);
 
 		strbuf_splice(previous_name, common, to_remove,
 			      ce->name + common, ce_namelen(ce) - common);
@@ -2721,7 +2723,7 @@ static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
 		ce->ce_flags &= ~CE_STRIP_NAME;
 	}
 
-	return result;
+	return 0;
 }
 
 /*
@@ -2833,8 +2835,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 			  int strip_extensions)
 {
 	uint64_t start = getnanotime();
-	int newfd = tempfile->fd;
-	git_hash_ctx c, eoie_c;
+	struct hashfile *f;
+	git_hash_ctx *eoie_c = NULL;
 	struct cache_header hdr;
 	int i, err = 0, removed, extended, hdr_version;
 	struct cache_entry **cache = istate->cache;
@@ -2848,6 +2850,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	struct index_entry_offset_table *ieot = NULL;
 	int nr, nr_threads;
 
+	f = hashfd(tempfile->fd, tempfile->filename.buf);
+
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
@@ -2876,9 +2880,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	hdr.hdr_version = htonl(hdr_version);
 	hdr.hdr_entries = htonl(entries - removed);
 
-	the_hash_algo->init_fn(&c);
-	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
-		return -1;
+	hashwrite(f, &hdr, sizeof(hdr));
 
 	if (!HAVE_THREADS || git_config_get_index_threads(&nr_threads))
 		nr_threads = 1;
@@ -2913,12 +2915,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		}
 	}
 
-	offset = lseek(newfd, 0, SEEK_CUR);
-	if (offset < 0) {
-		free(ieot);
-		return -1;
-	}
-	offset += write_buffer_len;
+	offset = hashfile_total(f);
+
 	nr = 0;
 	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
 
@@ -2953,14 +2951,10 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 			if (previous_name)
 				previous_name->buf[0] = 0;
 			nr = 0;
-			offset = lseek(newfd, 0, SEEK_CUR);
-			if (offset < 0) {
-				free(ieot);
-				return -1;
-			}
-			offset += write_buffer_len;
+
+			offset = hashfile_total(f);
 		}
-		if (ce_write_entry(&c, newfd, ce, previous_name, (struct ondisk_cache_entry *)&ondisk) < 0)
+		if (ce_write_entry(f, ce, previous_name, (struct ondisk_cache_entry *)&ondisk) < 0)
 			err = -1;
 
 		if (err)
@@ -2979,14 +2973,16 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		return err;
 	}
 
-	/* Write extension data here */
-	offset = lseek(newfd, 0, SEEK_CUR);
-	if (offset < 0) {
-		free(ieot);
-		return -1;
+	offset = hashfile_total(f);
+
+	/*
+	 * The extension headers must be hashed on their own for the
+	 * EOIE extension. Create a hashfile here to compute that hash.
+	 */
+	if (offset && record_eoie()) {
+		CALLOC_ARRAY(eoie_c, 1);
+		the_hash_algo->init_fn(eoie_c);
 	}
-	offset += write_buffer_len;
-	the_hash_algo->init_fn(&eoie_c);
 
 	/*
 	 * Lets write out CACHE_EXT_INDEXENTRYOFFSETTABLE first so that we
@@ -2999,8 +2995,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		write_ieot_extension(&sb, ieot);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_INDEXENTRYOFFSETTABLE, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(f, eoie_c, CACHE_EXT_INDEXENTRYOFFSETTABLE, sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		free(ieot);
 		if (err)
@@ -3012,9 +3008,9 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		err = write_link_extension(&sb, istate) < 0 ||
-			write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_LINK,
-					       sb.len) < 0 ||
-			ce_write(&c, newfd, sb.buf, sb.len) < 0;
+			write_index_ext_header(f, eoie_c, CACHE_EXT_LINK,
+					       sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -3023,8 +3019,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		cache_tree_write(&sb, istate->cache_tree);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_TREE, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(f, eoie_c, CACHE_EXT_TREE, sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -3033,9 +3029,9 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		resolve_undo_write(&sb, istate->resolve_undo);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_RESOLVE_UNDO,
-					     sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(f, eoie_c, CACHE_EXT_RESOLVE_UNDO,
+					     sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -3044,9 +3040,9 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		write_untracked_extension(&sb, istate->untracked);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_UNTRACKED,
-					     sb.len) < 0 ||
-			ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(f, eoie_c, CACHE_EXT_UNTRACKED,
+					     sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -3055,14 +3051,14 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		write_fsmonitor_extension(&sb, istate);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_FSMONITOR, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(f, eoie_c, CACHE_EXT_FSMONITOR, sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
 	}
 	if (istate->sparse_index) {
-		if (write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_SPARSE_DIRECTORIES, 0) < 0)
+		if (write_index_ext_header(f, eoie_c, CACHE_EXT_SPARSE_DIRECTORIES, 0) < 0)
 			return -1;
 	}
 
@@ -3072,19 +3068,18 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	 * read.  Write it out regardless of the strip_extensions parameter as we need it
 	 * when loading the shared index.
 	 */
-	if (offset && record_eoie()) {
+	if (eoie_c) {
 		struct strbuf sb = STRBUF_INIT;
 
-		write_eoie_extension(&sb, &eoie_c, offset);
-		err = write_index_ext_header(&c, NULL, newfd, CACHE_EXT_ENDOFINDEXENTRIES, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		write_eoie_extension(&sb, eoie_c, offset);
+		err = write_index_ext_header(f, NULL, CACHE_EXT_ENDOFINDEXENTRIES, sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
 	}
 
-	if (ce_flush(&c, newfd, istate->oid.hash))
-		return -1;
+	finalize_hashfile(f, istate->oid.hash, CSUM_HASH_IN_STREAM);
 	if (close_tempfile_gently(tempfile)) {
 		error(_("could not close '%s'"), get_tempfile_path(tempfile));
 		return -1;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 4/4] read-cache: delete unused hashing methods
  2021-05-17 12:24 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-05-17 12:24   ` [PATCH v2 3/4] read-cache: use hashfile instead of git_hash_ctx Derrick Stolee via GitGitGadget
@ 2021-05-17 12:24   ` Derrick Stolee via GitGitGadget
  2021-05-18 18:32   ` [PATCH v3 0/4] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-05-17 12:24 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

These methods were marked as MAYBE_UNUSED in the previous change to
avoid a complicated diff. Delete them entirely, since we now use the
hashfile API instead of this custom hashing code.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 read-cache.c | 64 ----------------------------------------------------
 1 file changed, 64 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 1c0bda81e7e7..aa6751c6a092 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2516,46 +2516,6 @@ int repo_index_has_changes(struct repository *repo,
 	}
 }
 
-#define WRITE_BUFFER_SIZE (128 * 1024)
-static unsigned char write_buffer[WRITE_BUFFER_SIZE];
-static unsigned long write_buffer_len;
-
-MAYBE_UNUSED
-static int ce_write_flush(git_hash_ctx *context, int fd)
-{
-	unsigned int buffered = write_buffer_len;
-	if (buffered) {
-		the_hash_algo->update_fn(context, write_buffer, buffered);
-		if (write_in_full(fd, write_buffer, buffered) < 0)
-			return -1;
-		write_buffer_len = 0;
-	}
-	return 0;
-}
-
-MAYBE_UNUSED
-static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len)
-{
-	while (len) {
-		unsigned int buffered = write_buffer_len;
-		unsigned int partial = WRITE_BUFFER_SIZE - buffered;
-		if (partial > len)
-			partial = len;
-		memcpy(write_buffer + buffered, data, partial);
-		buffered += partial;
-		if (buffered == WRITE_BUFFER_SIZE) {
-			write_buffer_len = buffered;
-			if (ce_write_flush(context, fd))
-				return -1;
-			buffered = 0;
-		}
-		write_buffer_len = buffered;
-		len -= partial;
-		data = (char *) data + partial;
-	}
-	return 0;
-}
-
 static int write_index_ext_header(struct hashfile *f,
 				  git_hash_ctx *eoie_f,
 				  unsigned int ext,
@@ -2573,30 +2533,6 @@ static int write_index_ext_header(struct hashfile *f,
 	return 0;
 }
 
-MAYBE_UNUSED
-static int ce_flush(git_hash_ctx *context, int fd, unsigned char *hash)
-{
-	unsigned int left = write_buffer_len;
-
-	if (left) {
-		write_buffer_len = 0;
-		the_hash_algo->update_fn(context, write_buffer, left);
-	}
-
-	/* Flush first if not enough space for hash signature */
-	if (left + the_hash_algo->rawsz > WRITE_BUFFER_SIZE) {
-		if (write_in_full(fd, write_buffer, left) < 0)
-			return -1;
-		left = 0;
-	}
-
-	/* Append the hash signature at the end */
-	the_hash_algo->final_fn(write_buffer + left, context);
-	hashcpy(hash, write_buffer + left);
-	left += the_hash_algo->rawsz;
-	return (write_in_full(fd, write_buffer, left) < 0) ? -1 : 0;
-}
-
 static void ce_smudge_racily_clean_entry(struct index_state *istate,
 					 struct cache_entry *ce)
 {
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/4] csum-file.h: increase hashfile buffer size
  2021-05-17 12:24   ` [PATCH v2 2/4] csum-file.h: increase hashfile buffer size Derrick Stolee via GitGitGadget
@ 2021-05-17 21:54     ` Junio C Hamano
  2021-05-18  7:33       ` Jeff King
  2021-05-18  7:31     ` Jeff King
  1 sibling, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2021-05-17 21:54 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, stolee, git, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> The hashfile API uses a hard-coded buffer size of 8KB and has ever since
> it was introduced in c38138c (git-pack-objects: write the pack files
> with a SHA1 csum, 2005-06-26). It performs a similar function to the
> hashing buffers in read-cache.c, but that code was updated from 8KB to
> 128KB in f279894 (read-cache: make the index write buffer size 128K,
> 2021-02-18). The justification there was that do_write_index() improves
> from 1.02s to 0.72s.
>
> There is a buffer, check_buffer, that is used to verify the check_fd
> file descriptor. When this buffer increases to 128K to fit the data
> being flushed, it causes the stack to overflow the limits placed in the
> test suite. By moving this to a static buffer, we stop using stack data
> for this purpose, but we lose some thread-safety. This change makes it
> unsafe to write to multiple hashfiles across different threads.

So... is the longer term plan to heap-allocate the buffer itself,
and replace the .buffer member of hashfile from an embedded char
array to a pointer?  The hashfile structure is initialized once per
the entire file (like the on-disk index), and the same buffer will
be used throughout the life of that hashfile instance, so it may not
be too bad to allocate on the heap.

Just after the previous step justified its simplification of its
progress logic based on how small the buffer is, this step makes it
16 times as big, which felt a tiny bit dishonest.  We probably
should say somewhere that 128k is still small enough that the
rewrite in the previous step is still valid ;-)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/4] read-cache: use hashfile instead of git_hash_ctx
  2021-05-17 12:24   ` [PATCH v2 3/4] read-cache: use hashfile instead of git_hash_ctx Derrick Stolee via GitGitGadget
@ 2021-05-17 22:13     ` Junio C Hamano
  2021-05-18 14:16       ` Derrick Stolee
  0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2021-05-17 22:13 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, stolee, git, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> ...
> mult-pack-indexes, and commit-graphs. Therefore, it seems prudent to

multi-pack, I would say.

> There are still some remaining: the extension headers are hashed for use

some remaining what?  I first read an unwritten word as "issues",
but I think the answer is "uses of git_hash_ctx".

> in the End of Index Entries (EOIE) extension. This use of the
> git_hash_ctx is left as-is. There are multiple reasons to not use a
> hashfile here, including ...

> In addition to the test suite passing, I computed indexes using the
> previous binaries and the binaries compiled after this change, and found
> the index data to be exactly equal. Finally, I did extensive performance
> testing of "git update-index --force-write" on repos of various sizes,
> including one with over 2 million paths at HEAD. These tests
> demonstrated less than 1% difference in behavior, so the performance
> should be considered identical.

Hmph, does that mean 128k buffer is overkill and if we wanted to
unify the buffer sizes we should have used 8k instead?

Wait, the removal of fsync has made things faster in general, hasn't
it?  Did something else degrade performance to cancel that gain?

The patch looks an obvious improvement.  What was open-coded in
longhand is now a well structured series of API calls and the result
is much easier to follow and maintain.

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/4] csum-file.h: increase hashfile buffer size
  2021-05-17 12:24   ` [PATCH v2 2/4] csum-file.h: increase hashfile buffer size Derrick Stolee via GitGitGadget
  2021-05-17 21:54     ` Junio C Hamano
@ 2021-05-18  7:31     ` Jeff King
  2021-05-18  7:42       ` Jeff King
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff King @ 2021-05-18  7:31 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, stolee, git, Derrick Stolee, Derrick Stolee

On Mon, May 17, 2021 at 12:24:50PM +0000, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The hashfile API uses a hard-coded buffer size of 8KB and has ever since
> it was introduced in c38138c (git-pack-objects: write the pack files
> with a SHA1 csum, 2005-06-26). It performs a similar function to the
> hashing buffers in read-cache.c, but that code was updated from 8KB to
> 128KB in f279894 (read-cache: make the index write buffer size 128K,
> 2021-02-18). The justification there was that do_write_index() improves
> from 1.02s to 0.72s.
> 
> There is a buffer, check_buffer, that is used to verify the check_fd
> file descriptor. When this buffer increases to 128K to fit the data
> being flushed, it causes the stack to overflow the limits placed in the
> test suite. By moving this to a static buffer, we stop using stack data
> for this purpose, but we lose some thread-safety. This change makes it
> unsafe to write to multiple hashfiles across different threads.
> 
> By adding a new trace2 region in the chunk-format API, we can see that
> the writing portion of 'git multi-pack-index write' lowers from ~1.49s
> to ~1.47s on a Linux machine. These effects may be more pronounced or
> diminished on other filesystems. The end-to-end timing is too noisy to
> have a definitive change either way.

I think there is one thing missing from this commit message: why we want
to do this. You mentioned that read-cache got larger by using a bigger
buffer. But here we use a bigger buffer, and it produces no improvement
larger than the noise. And on top of it, you describe the static-buffer
downsides. So why not just skip it? :)

And the answer is in the larger series: we want to be able to make use
of the hashfile API in read-cache, but without regressing the
performance. One sentence at the end of the first paragraph would
clarify that quite a bit, I think.

> +static void verify_buffer_or_die(struct hashfile *f,
> +				 const void *buf,
> +				 unsigned int count)
> +{
> +	static unsigned char check_buffer[WRITE_BUFFER_SIZE];
> +	ssize_t ret = read_in_full(f->check_fd, check_buffer, count);
> +
> +	if (ret < 0)
> +		die_errno("%s: sha1 file read error", f->name);
> +	if (ret != count)
> +		die("%s: sha1 file truncated", f->name);
> +	if (memcmp(buf, check_buffer, count))
> +		die("sha1 file '%s' validation error", f->name);
> +}

Does this have to use the same-size buffer? We could read and check
smaller chunks, like:

  while (count > 0) {
	static unsigned char chunk[1024];
	unsigned int chunk_len = sizeof(chunk) < count ? sizeof(chunk) : count;
	ssize_t ret = read_in_full(f->check_fd, chunk, chunk_len);

        if (ret < 0)
	   ...
	if (ret != count)
	   ...
	if (memcmp(buf, chunk, chunk_len))
	   ...
	buf += chunk_len;
	count -= chunk_len;
  }

We may prefer to use the larger buffer size for performance, but I think
this "check" mode is only used for "index-pack --verify" and similar.
The performance may matter a lot less to us there than for more
frequently used code paths like index writing.

I don't have a strong preference either way, but it's nice to avoid
introducing non-reentrancy to a function (Junio's heap suggestion is
also quite reasonable).

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/4] csum-file.h: increase hashfile buffer size
  2021-05-17 21:54     ` Junio C Hamano
@ 2021-05-18  7:33       ` Jeff King
  2021-05-18 14:44         ` Derrick Stolee
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2021-05-18  7:33 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, stolee, git,
	Derrick Stolee, Derrick Stolee

On Tue, May 18, 2021 at 06:54:57AM +0900, Junio C Hamano wrote:

> Just after the previous step justified its simplification of its
> progress logic based on how small the buffer is, this step makes it
> 16 times as big, which felt a tiny bit dishonest.  We probably
> should say somewhere that 128k is still small enough that the
> rewrite in the previous step is still valid ;-)

I noticed that, too. I'm not sure if still is small enough. For local
pack writes, etc, it seems fine. But what about "index-pack --stdin"
reading over the network?

Updating progress every 8k instead of every 128k seems like it would be
more responsive, especially if the network is slow or jittery. I dunno.
Maybe that is too small to care about for the modern world, but I just
want to make sure we are not being blinded by the fast networks all of
us presumably enjoy. :)

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/4] csum-file.h: increase hashfile buffer size
  2021-05-18  7:31     ` Jeff King
@ 2021-05-18  7:42       ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2021-05-18  7:42 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, stolee, git, Derrick Stolee, Derrick Stolee

On Tue, May 18, 2021 at 03:31:28AM -0400, Jeff King wrote:

> Does this have to use the same-size buffer? We could read and check
> smaller chunks, like:
> 
>   while (count > 0) {
> 	static unsigned char chunk[1024];
> 	unsigned int chunk_len = sizeof(chunk) < count ? sizeof(chunk) : count;
> 	ssize_t ret = read_in_full(f->check_fd, chunk, chunk_len);
> 
> 	if (ret < 0)
> 	   ...
> 	if (ret != count)
> 	   ...
> 	if (memcmp(buf, chunk, chunk_len))
> 	   ...
> 	buf += chunk_len;
> 	count -= chunk_len;
>   }

That should be "ret != chunk_len" in the middle conditional, of course.
In case you do go this route (I typed this straight into my email, so
other bugs may be lurking. But I noticed that one. :) ).

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/4] read-cache: use hashfile instead of git_hash_ctx
  2021-05-17 22:13     ` Junio C Hamano
@ 2021-05-18 14:16       ` Derrick Stolee
  0 siblings, 0 replies; 27+ messages in thread
From: Derrick Stolee @ 2021-05-18 14:16 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, peff, git, Derrick Stolee, Derrick Stolee

On 5/17/2021 6:13 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> ...
>> mult-pack-indexes, and commit-graphs. Therefore, it seems prudent to
> 
> multi-pack, I would say.
> 
>> There are still some remaining: the extension headers are hashed for use
> 
> some remaining what?  I first read an unwritten word as "issues",
> but I think the answer is "uses of git_hash_ctx".

Thanks for pointing these out. I will fix them.

>> in the End of Index Entries (EOIE) extension. This use of the
>> git_hash_ctx is left as-is. There are multiple reasons to not use a
>> hashfile here, including ...
> 
>> In addition to the test suite passing, I computed indexes using the
>> previous binaries and the binaries compiled after this change, and found
>> the index data to be exactly equal. Finally, I did extensive performance
>> testing of "git update-index --force-write" on repos of various sizes,
>> including one with over 2 million paths at HEAD. These tests
>> demonstrated less than 1% difference in behavior, so the performance
>> should be considered identical.
> 
> Hmph, does that mean 128k buffer is overkill and if we wanted to
> unify the buffer sizes we should have used 8k instead?

The buffer was previously increased to 128k because it makes a
difference in performance when writing the index.

The thing I'm measuring here is the difference between the old
writing code and the new hashfile code. Using the hashfile API
(with an identical buffer size) does not have a meaningful
performance impact, as it should.

I can make this clearer.

> Wait, the removal of fsync has made things faster in general, hasn't
> it?  Did something else degrade performance to cancel that gain?

Are you thinking about [1], which originally was talking about a
change to fsync() calls, but really ended up just making the same
behavior more readable?

[1] https://lore.kernel.org/git/pull.914.v2.git.1616762291574.gitgitgadget@gmail.com/

I was focused on that because I had initially seen a performance
degradation when I did this refactor. It turns out that my measurements
were not robust enough to the noise, which has been remedied.

> The patch looks an obvious improvement.  What was open-coded in
> longhand is now a well structured series of API calls and the result
> is much easier to follow and maintain.

That is the goal. I'm glad you agree.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/4] csum-file.h: increase hashfile buffer size
  2021-05-18  7:33       ` Jeff King
@ 2021-05-18 14:44         ` Derrick Stolee
  0 siblings, 0 replies; 27+ messages in thread
From: Derrick Stolee @ 2021-05-18 14:44 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, git, Derrick Stolee,
	Derrick Stolee

On 5/18/2021 3:33 AM, Jeff King wrote:
> On Tue, May 18, 2021 at 06:54:57AM +0900, Junio C Hamano wrote:
> 
>> Just after the previous step justified its simplification of its
>> progress logic based on how small the buffer is, this step makes it
>> 16 times as big, which felt a tiny bit dishonest.  We probably
>> should say somewhere that 128k is still small enough that the
>> rewrite in the previous step is still valid ;-)
> 
> I noticed that, too. I'm not sure if still is small enough. For local
> pack writes, etc, it seems fine. But what about "index-pack --stdin"
> reading over the network?
> 
> Updating progress every 8k instead of every 128k seems like it would be
> more responsive, especially if the network is slow or jittery. I dunno.
> Maybe that is too small to care about for the modern world, but I just
> want to make sure we are not being blinded by the fast networks all of
> us presumably enjoy. :)

This is a good point.

If we combine the earlier suggestion of using the heap to store the
buffer, then we can change the buffer size based on the use case: we
can use 8k for data that might be streaming from a network, and 128k
for local-only data (index, commit-graph, multi-pack-index are all
probably safe).

For the case of NFS, I think we should probably assume that the NFS
server is not across a slow connection, which is a case we cannot
make for streaming a pack-file from a remote server.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 0/4] Convert index writes to use hashfile API
  2021-05-17 12:24 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-05-17 12:24   ` [PATCH v2 4/4] read-cache: delete unused hashing methods Derrick Stolee via GitGitGadget
@ 2021-05-18 18:32   ` Derrick Stolee via GitGitGadget
  2021-05-18 18:32     ` [PATCH v3 1/4] hashfile: use write_in_full() Derrick Stolee via GitGitGadget
                       ` (3 more replies)
  4 siblings, 4 replies; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-05-18 18:32 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee

As I prepare some ideas on index v5, one thing that strikes me as an
interesting direction to try is to use the chunk-format API. This would make
our extension model extremely simple (they become optional chunks, easily
identified by the table of contents).

But there is a huge hurdle to even starting that investigation: the index
uses its own hashing methods, separate from the hashfile API in csum-file.c!

The internals of the algorithms are mostly identical. The only possible
change is that the buffer sizes are different: 8KB for hashfile and 128KB in
read-cache.c. This change is actually relatively recent! I was able to see a
performance difference before changing that size, so I do make them equal.
This improves some other commands, too, such as git multi-pack-index write.

There is a subtle point about how the EOIE extension works in that it needs
a hash of just the previous extension headers. This won't be needed by the
chunk-format API, so leave this mostly as-is, except where it collides
somewhat with the hashfile changes.


Updates in v3
=============

 * Fixed typos in commit messages.

 * Updated buffer and check_buffer to be heap-allocated members of struct
   hashfile.

 * These buffers have variable size, allowing the uses that calculate a
   throughput progress bar can still use an 8k buffer, while the others grow
   to 128k.

 * The heap-allocated check_buffer allows the hashfile API to remain as
   thread safe as before (parallel threads can write to independent
   hashfiles, not the same one).


Updates in v2
=============

 * Sorry for the false start on v1. I'm correctly testing index writes this
   time, but I also verified other commands, such as "git add .", to be sure
   I was exercising everything. I also previously had incorrect assumptions
   about the EOIE extension, since it is not triggered with "small" repos
   like the Linux kernel repository ;).

 * This version is rebased onto a recent version of 'master' so the
   conflicts with ds/sparse-index-protections are resolved.

 * This version abandons the nested hashfile concept, since that was only
   needed for the incorrect interpretation of the EOIE extension's hash.

 * This version does unify the buffer size to 128 KB with justification.

 * There is also some more unification of writing logic to use
   write_in_full() in the hashfile API.

 * When using check_fd, the hashfile API is no longer thread safe due to a
   stack-allocated buffer moving to the global scope. I'm not sure if there
   is a better way to handle this case.

Thanks, -Stolee

Derrick Stolee (4):
  hashfile: use write_in_full()
  csum-file.h: increase hashfile buffer size
  read-cache: use hashfile instead of git_hash_ctx
  read-cache: delete unused hashing methods

 chunk-format.c |  12 ++--
 csum-file.c    |  94 +++++++++++++++---------
 csum-file.h    |   4 +-
 read-cache.c   | 191 ++++++++++++++++---------------------------------
 4 files changed, 134 insertions(+), 167 deletions(-)


base-commit: 97eea85a0a1ec66d356567808a1e4ca2367e0ce7
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-916%2Fderrickstolee%2Findex-hashfile-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-916/derrickstolee/index-hashfile-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/916

Range-diff vs v2:

 1:  b3e578ac0365 = 1:  b3e578ac0365 hashfile: use write_in_full()
 2:  9dc602f6c422 ! 2:  64ffddd79116 csum-file.h: increase hashfile buffer size
     @@ Commit message
          hashing buffers in read-cache.c, but that code was updated from 8KB to
          128KB in f279894 (read-cache: make the index write buffer size 128K,
          2021-02-18). The justification there was that do_write_index() improves
     -    from 1.02s to 0.72s.
     +    from 1.02s to 0.72s. Since our end goal is to have the index writing
     +    code use the hashfile API, we need to unify this buffer size to avoid a
     +    performance regression.
      
     -    There is a buffer, check_buffer, that is used to verify the check_fd
     +    There is a buffer, 'check_buffer', that is used to verify the check_fd
          file descriptor. When this buffer increases to 128K to fit the data
          being flushed, it causes the stack to overflow the limits placed in the
     -    test suite. By moving this to a static buffer, we stop using stack data
     -    for this purpose, but we lose some thread-safety. This change makes it
     -    unsafe to write to multiple hashfiles across different threads.
     +    test suite. To avoid issues with stack size, move both 'buffer' and
     +    'check_buffer' to be heap pointers within 'struct hashfile'. The
     +    'check_buffer' member is left as NULL unless check_fd is set in
     +    hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
     +    which also frees the full structure.
     +
     +    Since these buffers are now on the heap, we can adjust their size based
     +    on the needs of the consumer. In particular, callers to
     +    hashfd_throughput() are expecting to report progress indicators as the
     +    buffer flushes. These callers would prefer the smaller 8k buffer to
     +    avoid large delays between updates, especially for users with slower
     +    networks. When the progress indicator is not used, the larger buffer is
     +    preferrable.
      
          By adding a new trace2 region in the chunk-format API, we can see that
          the writing portion of 'git multi-pack-index write' lowers from ~1.49s
     @@ csum-file.c
      +				 const void *buf,
      +				 unsigned int count)
      +{
     -+	static unsigned char check_buffer[WRITE_BUFFER_SIZE];
     -+	ssize_t ret = read_in_full(f->check_fd, check_buffer, count);
     ++	ssize_t ret = read_in_full(f->check_fd, f->check_buffer, count);
      +
      +	if (ret < 0)
      +		die_errno("%s: sha1 file read error", f->name);
      +	if (ret != count)
      +		die("%s: sha1 file truncated", f->name);
     -+	if (memcmp(buf, check_buffer, count))
     ++	if (memcmp(buf, f->check_buffer, count))
      +		die("sha1 file '%s' validation error", f->name);
      +}
      +
     @@ csum-file.c
       
       	if (write_in_full(f->fd, buf, count) < 0) {
       		if (errno == ENOSPC)
     -
     - ## csum-file.h ##
     -@@
     - 
     - struct progress;
     +@@ csum-file.c: void hashflush(struct hashfile *f)
     + 	}
     + }
       
     -+#define WRITE_BUFFER_SIZE (128 * 1024)
     ++static void free_hashfile(struct hashfile *f)
     ++{
     ++	free(f->buffer);
     ++	free(f->check_buffer);
     ++	free(f);
     ++}
      +
     - /* A SHA1-protected file */
     - struct hashfile {
     + int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int flags)
     + {
       	int fd;
     +@@ csum-file.c: int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int fl
     + 		if (close(f->check_fd))
     + 			die_errno("%s: sha1 file error on close", f->name);
     + 	}
     +-	free(f);
     ++	free_hashfile(f);
     + 	return fd;
     + }
     + 
     + void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
     + {
     + 	while (count) {
     +-		unsigned left = sizeof(f->buffer) - f->offset;
     ++		unsigned left = f->buffer_len - f->offset;
     + 		unsigned nr = count > left ? left : count;
     + 
     + 		if (f->do_crc)
     + 			f->crc32 = crc32(f->crc32, buf, nr);
     + 
     +-		if (nr == sizeof(f->buffer)) {
     ++		if (nr == f->buffer_len) {
     + 			/*
     + 			 * Flush a full batch worth of data directly
     + 			 * from the input, skipping the memcpy() to
     +@@ csum-file.c: void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
     + 	}
     + }
     + 
     +-struct hashfile *hashfd(int fd, const char *name)
     +-{
     +-	return hashfd_throughput(fd, name, NULL);
     +-}
     +-
     + struct hashfile *hashfd_check(const char *name)
     + {
     + 	int sink, check;
     +@@ csum-file.c: struct hashfile *hashfd_check(const char *name)
     + 		die_errno("unable to open '%s'", name);
     + 	f = hashfd(sink, name);
     + 	f->check_fd = check;
     ++	f->check_buffer = xmalloc(f->buffer_len);
     ++
     + 	return f;
     + }
     + 
     +-struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp)
     ++static struct hashfile *hashfd_internal(int fd, const char *name,
     ++					struct progress *tp,
     ++					size_t buffer_len)
     + {
     + 	struct hashfile *f = xmalloc(sizeof(*f));
     + 	f->fd = fd;
     +@@ csum-file.c: struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
     + 	f->name = name;
     + 	f->do_crc = 0;
     + 	the_hash_algo->init_fn(&f->ctx);
     ++
     ++	f->buffer_len = buffer_len;
     ++	f->buffer = xmalloc(buffer_len);
     ++	f->check_buffer = NULL;
     ++
     + 	return f;
     + }
     + 
     ++struct hashfile *hashfd(int fd, const char *name)
     ++{
     ++	/*
     ++	 * Since we are not going to use a progress meter to
     ++	 * measure the rate of data passing through this hashfile,
     ++	 * use a larger buffer size to reduce fsync() calls.
     ++	 */
     ++	return hashfd_internal(fd, name, NULL, 128 * 1024);
     ++}
     ++
     ++struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp)
     ++{
     ++	/*
     ++	 * Since we are expecting to report progress of the
     ++	 * write into this hashfile, use a smaller buffer
     ++	 * size so the progress indicators arrive at a more
     ++	 * frequent rate.
     ++	 */
     ++	return hashfd_internal(fd, name, tp, 8 * 1024);
     ++}
     ++
     + void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
     + {
     + 	hashflush(f);
     +
     + ## csum-file.h ##
      @@ csum-file.h: struct hashfile {
       	const char *name;
       	int do_crc;
       	uint32_t crc32;
      -	unsigned char buffer[8192];
     -+	unsigned char buffer[WRITE_BUFFER_SIZE];
     ++	size_t buffer_len;
     ++	unsigned char *buffer;
     ++	unsigned char *check_buffer;
       };
       
       /* Checkpoint */
 3:  b94172ccf5e9 ! 3:  afd278bef3ab read-cache: use hashfile instead of git_hash_ctx
     @@ Commit message
          codebase, I figured that these roughly equivalent code paths were never
          unified only because it got lost in the shuffle. The hashfile API has
          since been used extensively in other file formats, such as pack-indexes,
     -    mult-pack-indexes, and commit-graphs. Therefore, it seems prudent to
     +    multi-pack-indexes, and commit-graphs. Therefore, it seems prudent to
          unify the index writing code to use the same mechanism.
      
          I discovered this disparity while trying to create a new index format
     @@ Commit message
          temporary file with writes to the hashfile struct. This takes care of
          many of the direct interactions with the_hash_algo.
      
     -    There are still some remaining: the extension headers are hashed for use
     -    in the End of Index Entries (EOIE) extension. This use of the
     -    git_hash_ctx is left as-is. There are multiple reasons to not use a
     -    hashfile here, including the fact that the data is not actually writing
     -    to a file, just a hash computation. These hashes do not block our
     -    adoption of the chunk-format API in a future change to the index, so
     +    There are still some git_hash_ctx uses remaining: the extension headers
     +    are hashed for use in the End of Index Entries (EOIE) extension. This
     +    use of the git_hash_ctx is left as-is. There are multiple reasons to not
     +    use a hashfile here, including the fact that the data is not actually
     +    writing to a file, just a hash computation. These hashes do not block
     +    our adoption of the chunk-format API in a future change to the index, so
          leave it as-is.
      
          The internals of the algorithms are mostly identical. Previously, the
     @@ Commit message
          the index data to be exactly equal. Finally, I did extensive performance
          testing of "git update-index --force-write" on repos of various sizes,
          including one with over 2 million paths at HEAD. These tests
     -    demonstrated less than 1% difference in behavior, so the performance
     -    should be considered identical.
     +    demonstrated less than 1% difference in behavior. As expected, the
     +    performance should be considered unchanged. The previous changes to
     +    increase the hashfile buffer size from 8K to 128K ensured this change
     +    would not create a peformance regression.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
 4:  4b3814eb4c80 = 4:  42fb10fb2998 read-cache: delete unused hashing methods

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 1/4] hashfile: use write_in_full()
  2021-05-18 18:32   ` [PATCH v3 0/4] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
@ 2021-05-18 18:32     ` Derrick Stolee via GitGitGadget
  2021-05-18 18:32     ` [PATCH v3 2/4] csum-file.h: increase hashfile buffer size Derrick Stolee via GitGitGadget
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-05-18 18:32 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The flush() logic in csum-file.c was introduced originally by c38138c
(git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26)
and a portion of the logic performs similar utility to write_in_full()
in wrapper.c. The history of write_in_full() is full of moves and
renames, but was originally introduced by 7230e6d (Add write_or_die(), a
helper function, 2006-08-21).

The point of these sections of code are to flush a write buffer using
xwrite() and report errors in the case of disk space issues or other
generic input/output errors. The logic in flush() can interpret the
output of write_in_full() to provide the correct error messages to
users.

The logic in the hashfile API has an additional set of logic to augment
the progress indicator between calls to xwrite(). This was introduced by
2a128d6 (add throughput display to git-push, 2007-10-30). It seems that
since the hashfile's buffer is only 8KB, these additional progress
indicators might not be incredibly necessary. Instead, update the
progress only when write_in_full() complete.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 csum-file.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 7510950fa3e9..3c26389d4914 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -25,21 +25,14 @@ static void flush(struct hashfile *f, const void *buf, unsigned int count)
 			die("sha1 file '%s' validation error", f->name);
 	}
 
-	for (;;) {
-		int ret = xwrite(f->fd, buf, count);
-		if (ret > 0) {
-			f->total += ret;
-			display_throughput(f->tp, f->total);
-			buf = (char *) buf + ret;
-			count -= ret;
-			if (count)
-				continue;
-			return;
-		}
-		if (!ret)
+	if (write_in_full(f->fd, buf, count) < 0) {
+		if (errno == ENOSPC)
 			die("sha1 file '%s' write error. Out of diskspace", f->name);
 		die_errno("sha1 file '%s' write error", f->name);
 	}
+
+	f->total += count;
+	display_throughput(f->tp, f->total);
 }
 
 void hashflush(struct hashfile *f)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 2/4] csum-file.h: increase hashfile buffer size
  2021-05-18 18:32   ` [PATCH v3 0/4] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
  2021-05-18 18:32     ` [PATCH v3 1/4] hashfile: use write_in_full() Derrick Stolee via GitGitGadget
@ 2021-05-18 18:32     ` Derrick Stolee via GitGitGadget
  2021-11-25 12:14       ` t4216-log-bloom.sh fails with -v (but not --verbose-log) Ævar Arnfjörð Bjarmason
  2021-05-18 18:32     ` [PATCH v3 3/4] read-cache: use hashfile instead of git_hash_ctx Derrick Stolee via GitGitGadget
  2021-05-18 18:32     ` [PATCH v3 4/4] read-cache: delete unused hashing methods Derrick Stolee via GitGitGadget
  3 siblings, 1 reply; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-05-18 18:32 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s. Since our end goal is to have the index writing
code use the hashfile API, we need to unify this buffer size to avoid a
performance regression.

There is a buffer, 'check_buffer', that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. To avoid issues with stack size, move both 'buffer' and
'check_buffer' to be heap pointers within 'struct hashfile'. The
'check_buffer' member is left as NULL unless check_fd is set in
hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
which also frees the full structure.

Since these buffers are now on the heap, we can adjust their size based
on the needs of the consumer. In particular, callers to
hashfd_throughput() are expecting to report progress indicators as the
buffer flushes. These callers would prefer the smaller 8k buffer to
avoid large delays between updates, especially for users with slower
networks. When the progress indicator is not used, the larger buffer is
preferrable.

By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 12 +++++---
 csum-file.c    | 77 +++++++++++++++++++++++++++++++++++++-------------
 csum-file.h    |  4 ++-
 3 files changed, 68 insertions(+), 25 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index da191e59a29d..1c3dca62e205 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -58,9 +58,11 @@ void add_chunk(struct chunkfile *cf,
 
 int write_chunkfile(struct chunkfile *cf, void *data)
 {
-	int i;
+	int i, result = 0;
 	uint64_t cur_offset = hashfile_total(cf->f);
 
+	trace2_region_enter("chunkfile", "write", the_repository);
+
 	/* Add the table of contents to the current offset */
 	cur_offset += (cf->chunks_nr + 1) * CHUNK_TOC_ENTRY_SIZE;
 
@@ -77,10 +79,10 @@ int write_chunkfile(struct chunkfile *cf, void *data)
 
 	for (i = 0; i < cf->chunks_nr; i++) {
 		off_t start_offset = hashfile_total(cf->f);
-		int result = cf->chunks[i].write_fn(cf->f, data);
+		result = cf->chunks[i].write_fn(cf->f, data);
 
 		if (result)
-			return result;
+			goto cleanup;
 
 		if (hashfile_total(cf->f) - start_offset != cf->chunks[i].size)
 			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
@@ -88,7 +90,9 @@ int write_chunkfile(struct chunkfile *cf, void *data)
 			    hashfile_total(cf->f) - start_offset);
 	}
 
-	return 0;
+cleanup:
+	trace2_region_leave("chunkfile", "write", the_repository);
+	return result;
 }
 
 int read_table_of_contents(struct chunkfile *cf,
diff --git a/csum-file.c b/csum-file.c
index 3c26389d4914..3487d28ed7ad 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -11,19 +11,24 @@
 #include "progress.h"
 #include "csum-file.h"
 
+static void verify_buffer_or_die(struct hashfile *f,
+				 const void *buf,
+				 unsigned int count)
+{
+	ssize_t ret = read_in_full(f->check_fd, f->check_buffer, count);
+
+	if (ret < 0)
+		die_errno("%s: sha1 file read error", f->name);
+	if (ret != count)
+		die("%s: sha1 file truncated", f->name);
+	if (memcmp(buf, f->check_buffer, count))
+		die("sha1 file '%s' validation error", f->name);
+}
+
 static void flush(struct hashfile *f, const void *buf, unsigned int count)
 {
-	if (0 <= f->check_fd && count)  {
-		unsigned char check_buffer[8192];
-		ssize_t ret = read_in_full(f->check_fd, check_buffer, count);
-
-		if (ret < 0)
-			die_errno("%s: sha1 file read error", f->name);
-		if (ret != count)
-			die("%s: sha1 file truncated", f->name);
-		if (memcmp(buf, check_buffer, count))
-			die("sha1 file '%s' validation error", f->name);
-	}
+	if (0 <= f->check_fd && count)
+		verify_buffer_or_die(f, buf, count);
 
 	if (write_in_full(f->fd, buf, count) < 0) {
 		if (errno == ENOSPC)
@@ -46,6 +51,13 @@ void hashflush(struct hashfile *f)
 	}
 }
 
+static void free_hashfile(struct hashfile *f)
+{
+	free(f->buffer);
+	free(f->check_buffer);
+	free(f);
+}
+
 int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int flags)
 {
 	int fd;
@@ -75,20 +87,20 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int fl
 		if (close(f->check_fd))
 			die_errno("%s: sha1 file error on close", f->name);
 	}
-	free(f);
+	free_hashfile(f);
 	return fd;
 }
 
 void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 {
 	while (count) {
-		unsigned left = sizeof(f->buffer) - f->offset;
+		unsigned left = f->buffer_len - f->offset;
 		unsigned nr = count > left ? left : count;
 
 		if (f->do_crc)
 			f->crc32 = crc32(f->crc32, buf, nr);
 
-		if (nr == sizeof(f->buffer)) {
+		if (nr == f->buffer_len) {
 			/*
 			 * Flush a full batch worth of data directly
 			 * from the input, skipping the memcpy() to
@@ -114,11 +126,6 @@ void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 	}
 }
 
-struct hashfile *hashfd(int fd, const char *name)
-{
-	return hashfd_throughput(fd, name, NULL);
-}
-
 struct hashfile *hashfd_check(const char *name)
 {
 	int sink, check;
@@ -132,10 +139,14 @@ struct hashfile *hashfd_check(const char *name)
 		die_errno("unable to open '%s'", name);
 	f = hashfd(sink, name);
 	f->check_fd = check;
+	f->check_buffer = xmalloc(f->buffer_len);
+
 	return f;
 }
 
-struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp)
+static struct hashfile *hashfd_internal(int fd, const char *name,
+					struct progress *tp,
+					size_t buffer_len)
 {
 	struct hashfile *f = xmalloc(sizeof(*f));
 	f->fd = fd;
@@ -146,9 +157,35 @@ struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
 	f->name = name;
 	f->do_crc = 0;
 	the_hash_algo->init_fn(&f->ctx);
+
+	f->buffer_len = buffer_len;
+	f->buffer = xmalloc(buffer_len);
+	f->check_buffer = NULL;
+
 	return f;
 }
 
+struct hashfile *hashfd(int fd, const char *name)
+{
+	/*
+	 * Since we are not going to use a progress meter to
+	 * measure the rate of data passing through this hashfile,
+	 * use a larger buffer size to reduce fsync() calls.
+	 */
+	return hashfd_internal(fd, name, NULL, 128 * 1024);
+}
+
+struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp)
+{
+	/*
+	 * Since we are expecting to report progress of the
+	 * write into this hashfile, use a smaller buffer
+	 * size so the progress indicators arrive at a more
+	 * frequent rate.
+	 */
+	return hashfd_internal(fd, name, tp, 8 * 1024);
+}
+
 void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
 {
 	hashflush(f);
diff --git a/csum-file.h b/csum-file.h
index e54d53d1d0b3..3044bd19ab65 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -16,7 +16,9 @@ struct hashfile {
 	const char *name;
 	int do_crc;
 	uint32_t crc32;
-	unsigned char buffer[8192];
+	size_t buffer_len;
+	unsigned char *buffer;
+	unsigned char *check_buffer;
 };
 
 /* Checkpoint */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 3/4] read-cache: use hashfile instead of git_hash_ctx
  2021-05-18 18:32   ` [PATCH v3 0/4] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
  2021-05-18 18:32     ` [PATCH v3 1/4] hashfile: use write_in_full() Derrick Stolee via GitGitGadget
  2021-05-18 18:32     ` [PATCH v3 2/4] csum-file.h: increase hashfile buffer size Derrick Stolee via GitGitGadget
@ 2021-05-18 18:32     ` Derrick Stolee via GitGitGadget
  2021-05-18 18:32     ` [PATCH v3 4/4] read-cache: delete unused hashing methods Derrick Stolee via GitGitGadget
  3 siblings, 0 replies; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-05-18 18:32 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The do_write_index() method in read-cache.c has its own hashing logic
and buffering mechanism. Specifically, the ce_write() method was
introduced by 4990aadc (Speed up index file writing by chunking it
nicely, 2005-04-20) and similar mechanisms were introduced a few months
later in c38138cd (git-pack-objects: write the pack files with a SHA1
csum, 2005-06-26). Based on the timing, in the early days of the Git
codebase, I figured that these roughly equivalent code paths were never
unified only because it got lost in the shuffle. The hashfile API has
since been used extensively in other file formats, such as pack-indexes,
multi-pack-indexes, and commit-graphs. Therefore, it seems prudent to
unify the index writing code to use the same mechanism.

I discovered this disparity while trying to create a new index format
that uses the chunk-format API. That API uses a hashfile as its base, so
it is incompatible with the custom code in read-cache.c.

This rewrite is rather straightforward. It replaces all writes to the
temporary file with writes to the hashfile struct. This takes care of
many of the direct interactions with the_hash_algo.

There are still some git_hash_ctx uses remaining: the extension headers
are hashed for use in the End of Index Entries (EOIE) extension. This
use of the git_hash_ctx is left as-is. There are multiple reasons to not
use a hashfile here, including the fact that the data is not actually
writing to a file, just a hash computation. These hashes do not block
our adoption of the chunk-format API in a future change to the index, so
leave it as-is.

The internals of the algorithms are mostly identical. Previously, the
hashfile API used a smaller 8KB buffer instead of the 128KB buffer from
read-cache.c. The previous change already unified these sizes.

There is one subtle point: we do not pass the CSUM_FSYNC to the
finalize_hashfile() method, which differs from most consumers of the
hashfile API. The extra fsync() call indicated by this flag causes a
significant peformance degradation that is noticeable for quick
commands that write the index, such as "git add". Other consumers can
absorb this cost with their more complicated data structure
organization, and further writing structures such as pack-files and
commit-graphs is rarely in the critical path for common user
interactions.

Some static methods become orphaned in this diff, so I marked them as
MAYBE_UNUSED. The diff is much harder to read if they are deleted during
this change. Instead, they will be deleted in the following change.

In addition to the test suite passing, I computed indexes using the
previous binaries and the binaries compiled after this change, and found
the index data to be exactly equal. Finally, I did extensive performance
testing of "git update-index --force-write" on repos of various sizes,
including one with over 2 million paths at HEAD. These tests
demonstrated less than 1% difference in behavior. As expected, the
performance should be considered unchanged. The previous changes to
increase the hashfile buffer size from 8K to 128K ensured this change
would not create a peformance regression.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 read-cache.c | 137 +++++++++++++++++++++++++--------------------------
 1 file changed, 66 insertions(+), 71 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index fbf3a4ce7d5d..1c0bda81e7e7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -26,6 +26,7 @@
 #include "thread-utils.h"
 #include "progress.h"
 #include "sparse-index.h"
+#include "csum-file.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -2519,6 +2520,7 @@ int repo_index_has_changes(struct repository *repo,
 static unsigned char write_buffer[WRITE_BUFFER_SIZE];
 static unsigned long write_buffer_len;
 
+MAYBE_UNUSED
 static int ce_write_flush(git_hash_ctx *context, int fd)
 {
 	unsigned int buffered = write_buffer_len;
@@ -2531,6 +2533,7 @@ static int ce_write_flush(git_hash_ctx *context, int fd)
 	return 0;
 }
 
+MAYBE_UNUSED
 static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len)
 {
 	while (len) {
@@ -2553,19 +2556,24 @@ static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len)
 	return 0;
 }
 
-static int write_index_ext_header(git_hash_ctx *context, git_hash_ctx *eoie_context,
-				  int fd, unsigned int ext, unsigned int sz)
+static int write_index_ext_header(struct hashfile *f,
+				  git_hash_ctx *eoie_f,
+				  unsigned int ext,
+				  unsigned int sz)
 {
-	ext = htonl(ext);
-	sz = htonl(sz);
-	if (eoie_context) {
-		the_hash_algo->update_fn(eoie_context, &ext, 4);
-		the_hash_algo->update_fn(eoie_context, &sz, 4);
+	hashwrite_be32(f, ext);
+	hashwrite_be32(f, sz);
+
+	if (eoie_f) {
+		ext = htonl(ext);
+		sz = htonl(sz);
+		the_hash_algo->update_fn(eoie_f, &ext, sizeof(ext));
+		the_hash_algo->update_fn(eoie_f, &sz, sizeof(sz));
 	}
-	return ((ce_write(context, fd, &ext, 4) < 0) ||
-		(ce_write(context, fd, &sz, 4) < 0)) ? -1 : 0;
+	return 0;
 }
 
+MAYBE_UNUSED
 static int ce_flush(git_hash_ctx *context, int fd, unsigned char *hash)
 {
 	unsigned int left = write_buffer_len;
@@ -2667,11 +2675,10 @@ static void copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
 	}
 }
 
-static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
+static int ce_write_entry(struct hashfile *f, struct cache_entry *ce,
 			  struct strbuf *previous_name, struct ondisk_cache_entry *ondisk)
 {
 	int size;
-	int result;
 	unsigned int saved_namelen;
 	int stripped_name = 0;
 	static unsigned char padding[8] = { 0x00 };
@@ -2687,11 +2694,9 @@ static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
 	if (!previous_name) {
 		int len = ce_namelen(ce);
 		copy_cache_entry_to_ondisk(ondisk, ce);
-		result = ce_write(c, fd, ondisk, size);
-		if (!result)
-			result = ce_write(c, fd, ce->name, len);
-		if (!result)
-			result = ce_write(c, fd, padding, align_padding_size(size, len));
+		hashwrite(f, ondisk, size);
+		hashwrite(f, ce->name, len);
+		hashwrite(f, padding, align_padding_size(size, len));
 	} else {
 		int common, to_remove, prefix_size;
 		unsigned char to_remove_vi[16];
@@ -2705,13 +2710,10 @@ static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
 		prefix_size = encode_varint(to_remove, to_remove_vi);
 
 		copy_cache_entry_to_ondisk(ondisk, ce);
-		result = ce_write(c, fd, ondisk, size);
-		if (!result)
-			result = ce_write(c, fd, to_remove_vi, prefix_size);
-		if (!result)
-			result = ce_write(c, fd, ce->name + common, ce_namelen(ce) - common);
-		if (!result)
-			result = ce_write(c, fd, padding, 1);
+		hashwrite(f, ondisk, size);
+		hashwrite(f, to_remove_vi, prefix_size);
+		hashwrite(f, ce->name + common, ce_namelen(ce) - common);
+		hashwrite(f, padding, 1);
 
 		strbuf_splice(previous_name, common, to_remove,
 			      ce->name + common, ce_namelen(ce) - common);
@@ -2721,7 +2723,7 @@ static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
 		ce->ce_flags &= ~CE_STRIP_NAME;
 	}
 
-	return result;
+	return 0;
 }
 
 /*
@@ -2833,8 +2835,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 			  int strip_extensions)
 {
 	uint64_t start = getnanotime();
-	int newfd = tempfile->fd;
-	git_hash_ctx c, eoie_c;
+	struct hashfile *f;
+	git_hash_ctx *eoie_c = NULL;
 	struct cache_header hdr;
 	int i, err = 0, removed, extended, hdr_version;
 	struct cache_entry **cache = istate->cache;
@@ -2848,6 +2850,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	struct index_entry_offset_table *ieot = NULL;
 	int nr, nr_threads;
 
+	f = hashfd(tempfile->fd, tempfile->filename.buf);
+
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
@@ -2876,9 +2880,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	hdr.hdr_version = htonl(hdr_version);
 	hdr.hdr_entries = htonl(entries - removed);
 
-	the_hash_algo->init_fn(&c);
-	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
-		return -1;
+	hashwrite(f, &hdr, sizeof(hdr));
 
 	if (!HAVE_THREADS || git_config_get_index_threads(&nr_threads))
 		nr_threads = 1;
@@ -2913,12 +2915,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		}
 	}
 
-	offset = lseek(newfd, 0, SEEK_CUR);
-	if (offset < 0) {
-		free(ieot);
-		return -1;
-	}
-	offset += write_buffer_len;
+	offset = hashfile_total(f);
+
 	nr = 0;
 	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
 
@@ -2953,14 +2951,10 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 			if (previous_name)
 				previous_name->buf[0] = 0;
 			nr = 0;
-			offset = lseek(newfd, 0, SEEK_CUR);
-			if (offset < 0) {
-				free(ieot);
-				return -1;
-			}
-			offset += write_buffer_len;
+
+			offset = hashfile_total(f);
 		}
-		if (ce_write_entry(&c, newfd, ce, previous_name, (struct ondisk_cache_entry *)&ondisk) < 0)
+		if (ce_write_entry(f, ce, previous_name, (struct ondisk_cache_entry *)&ondisk) < 0)
 			err = -1;
 
 		if (err)
@@ -2979,14 +2973,16 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		return err;
 	}
 
-	/* Write extension data here */
-	offset = lseek(newfd, 0, SEEK_CUR);
-	if (offset < 0) {
-		free(ieot);
-		return -1;
+	offset = hashfile_total(f);
+
+	/*
+	 * The extension headers must be hashed on their own for the
+	 * EOIE extension. Create a hashfile here to compute that hash.
+	 */
+	if (offset && record_eoie()) {
+		CALLOC_ARRAY(eoie_c, 1);
+		the_hash_algo->init_fn(eoie_c);
 	}
-	offset += write_buffer_len;
-	the_hash_algo->init_fn(&eoie_c);
 
 	/*
 	 * Lets write out CACHE_EXT_INDEXENTRYOFFSETTABLE first so that we
@@ -2999,8 +2995,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		write_ieot_extension(&sb, ieot);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_INDEXENTRYOFFSETTABLE, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(f, eoie_c, CACHE_EXT_INDEXENTRYOFFSETTABLE, sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		free(ieot);
 		if (err)
@@ -3012,9 +3008,9 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		err = write_link_extension(&sb, istate) < 0 ||
-			write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_LINK,
-					       sb.len) < 0 ||
-			ce_write(&c, newfd, sb.buf, sb.len) < 0;
+			write_index_ext_header(f, eoie_c, CACHE_EXT_LINK,
+					       sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -3023,8 +3019,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		cache_tree_write(&sb, istate->cache_tree);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_TREE, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(f, eoie_c, CACHE_EXT_TREE, sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -3033,9 +3029,9 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		resolve_undo_write(&sb, istate->resolve_undo);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_RESOLVE_UNDO,
-					     sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(f, eoie_c, CACHE_EXT_RESOLVE_UNDO,
+					     sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -3044,9 +3040,9 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		write_untracked_extension(&sb, istate->untracked);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_UNTRACKED,
-					     sb.len) < 0 ||
-			ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(f, eoie_c, CACHE_EXT_UNTRACKED,
+					     sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
@@ -3055,14 +3051,14 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		struct strbuf sb = STRBUF_INIT;
 
 		write_fsmonitor_extension(&sb, istate);
-		err = write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_FSMONITOR, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		err = write_index_ext_header(f, eoie_c, CACHE_EXT_FSMONITOR, sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
 	}
 	if (istate->sparse_index) {
-		if (write_index_ext_header(&c, &eoie_c, newfd, CACHE_EXT_SPARSE_DIRECTORIES, 0) < 0)
+		if (write_index_ext_header(f, eoie_c, CACHE_EXT_SPARSE_DIRECTORIES, 0) < 0)
 			return -1;
 	}
 
@@ -3072,19 +3068,18 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	 * read.  Write it out regardless of the strip_extensions parameter as we need it
 	 * when loading the shared index.
 	 */
-	if (offset && record_eoie()) {
+	if (eoie_c) {
 		struct strbuf sb = STRBUF_INIT;
 
-		write_eoie_extension(&sb, &eoie_c, offset);
-		err = write_index_ext_header(&c, NULL, newfd, CACHE_EXT_ENDOFINDEXENTRIES, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		write_eoie_extension(&sb, eoie_c, offset);
+		err = write_index_ext_header(f, NULL, CACHE_EXT_ENDOFINDEXENTRIES, sb.len) < 0;
+		hashwrite(f, sb.buf, sb.len);
 		strbuf_release(&sb);
 		if (err)
 			return -1;
 	}
 
-	if (ce_flush(&c, newfd, istate->oid.hash))
-		return -1;
+	finalize_hashfile(f, istate->oid.hash, CSUM_HASH_IN_STREAM);
 	if (close_tempfile_gently(tempfile)) {
 		error(_("could not close '%s'"), get_tempfile_path(tempfile));
 		return -1;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 4/4] read-cache: delete unused hashing methods
  2021-05-18 18:32   ` [PATCH v3 0/4] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-05-18 18:32     ` [PATCH v3 3/4] read-cache: use hashfile instead of git_hash_ctx Derrick Stolee via GitGitGadget
@ 2021-05-18 18:32     ` Derrick Stolee via GitGitGadget
  3 siblings, 0 replies; 27+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-05-18 18:32 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, stolee, git, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

These methods were marked as MAYBE_UNUSED in the previous change to
avoid a complicated diff. Delete them entirely, since we now use the
hashfile API instead of this custom hashing code.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 read-cache.c | 64 ----------------------------------------------------
 1 file changed, 64 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 1c0bda81e7e7..aa6751c6a092 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2516,46 +2516,6 @@ int repo_index_has_changes(struct repository *repo,
 	}
 }
 
-#define WRITE_BUFFER_SIZE (128 * 1024)
-static unsigned char write_buffer[WRITE_BUFFER_SIZE];
-static unsigned long write_buffer_len;
-
-MAYBE_UNUSED
-static int ce_write_flush(git_hash_ctx *context, int fd)
-{
-	unsigned int buffered = write_buffer_len;
-	if (buffered) {
-		the_hash_algo->update_fn(context, write_buffer, buffered);
-		if (write_in_full(fd, write_buffer, buffered) < 0)
-			return -1;
-		write_buffer_len = 0;
-	}
-	return 0;
-}
-
-MAYBE_UNUSED
-static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len)
-{
-	while (len) {
-		unsigned int buffered = write_buffer_len;
-		unsigned int partial = WRITE_BUFFER_SIZE - buffered;
-		if (partial > len)
-			partial = len;
-		memcpy(write_buffer + buffered, data, partial);
-		buffered += partial;
-		if (buffered == WRITE_BUFFER_SIZE) {
-			write_buffer_len = buffered;
-			if (ce_write_flush(context, fd))
-				return -1;
-			buffered = 0;
-		}
-		write_buffer_len = buffered;
-		len -= partial;
-		data = (char *) data + partial;
-	}
-	return 0;
-}
-
 static int write_index_ext_header(struct hashfile *f,
 				  git_hash_ctx *eoie_f,
 				  unsigned int ext,
@@ -2573,30 +2533,6 @@ static int write_index_ext_header(struct hashfile *f,
 	return 0;
 }
 
-MAYBE_UNUSED
-static int ce_flush(git_hash_ctx *context, int fd, unsigned char *hash)
-{
-	unsigned int left = write_buffer_len;
-
-	if (left) {
-		write_buffer_len = 0;
-		the_hash_algo->update_fn(context, write_buffer, left);
-	}
-
-	/* Flush first if not enough space for hash signature */
-	if (left + the_hash_algo->rawsz > WRITE_BUFFER_SIZE) {
-		if (write_in_full(fd, write_buffer, left) < 0)
-			return -1;
-		left = 0;
-	}
-
-	/* Append the hash signature at the end */
-	the_hash_algo->final_fn(write_buffer + left, context);
-	hashcpy(hash, write_buffer + left);
-	left += the_hash_algo->rawsz;
-	return (write_in_full(fd, write_buffer, left) < 0) ? -1 : 0;
-}
-
 static void ce_smudge_racily_clean_entry(struct index_state *istate,
 					 struct cache_entry *ce)
 {
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 27+ messages in thread

* t4216-log-bloom.sh fails with -v (but not --verbose-log)
  2021-05-18 18:32     ` [PATCH v3 2/4] csum-file.h: increase hashfile buffer size Derrick Stolee via GitGitGadget
@ 2021-11-25 12:14       ` Ævar Arnfjörð Bjarmason
  2021-11-26  4:08         ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-25 12:14 UTC (permalink / raw)
  To: Git ML; +Cc: Derrick Stolee, Taylor Blau, Jeff Hostetler

I haven't looked much into $subject, but there's an interesting
regression in 2ca245f8be5 (csum-file.h: increase hashfile buffer size,
2021-05-18) where it fails with -v, but not --verbose-log. Discovered
while running it manually.

This is a regression in v2.33.0 (not v2.34.0!), so nothing urgent, and
this is pretty obscure anyway.

For the original change see:
https://lore.kernel.org/git/64ffddd791160895b8e6730ebcddfac8458653f2.1621362768.git.gitgitgadget@gmail.com/

To reproduce it:

    make test T=t4216-log-bloom.sh GIT_TEST_OPTS=-v DEFAULT_TEST_TARGET=test

Removing the -v, or making it --verbose-log will make it succeed. It
fails on:
    
    + GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 GIT_TRACE2_EVENT=/home/avar/g/git/t/trash directory.t4216-log-bloom/limits/trace git commit-graph write --reachable --changed-paths
    + test_max_changed_paths 10 trace
    + grep "max_changed_paths":10 trace
    error: last command exited with $?=1
    not ok 135 - correctly report changes over limit

If it's run with/without a debugging shimmy to save away the "trace" we
can see that it doesn't include the max_changed_paths trace2 payload
with -v. I.e. for some reason we either don't run
trace2_bloom_filter_settings(), or don't log that data in the same way.

It's probably easy to figure out for someone familiar with this code,
but I couldn't see it from some quick skimming. Hence this
E-Mail.

FYI This is our only failure when running the whole test suite with -j1
-x (I didn't test a full run with just -v, takes a while, I was using
-j1 to get sensible output at the end without needing to scroll up).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: t4216-log-bloom.sh fails with -v (but not --verbose-log)
  2021-11-25 12:14       ` t4216-log-bloom.sh fails with -v (but not --verbose-log) Ævar Arnfjörð Bjarmason
@ 2021-11-26  4:08         ` Jeff King
  2021-11-29 13:49           ` Derrick Stolee
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2021-11-26  4:08 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git ML, Derrick Stolee, Taylor Blau, Jeff Hostetler

On Thu, Nov 25, 2021 at 01:14:45PM +0100, Ævar Arnfjörð Bjarmason wrote:

> I haven't looked much into $subject, but there's an interesting
> regression in 2ca245f8be5 (csum-file.h: increase hashfile buffer size,
> 2021-05-18) where it fails with -v, but not --verbose-log. Discovered
> while running it manually.
> 
> This is a regression in v2.33.0 (not v2.34.0!), so nothing urgent, and
> this is pretty obscure anyway.
> 
> For the original change see:
> https://lore.kernel.org/git/64ffddd791160895b8e6730ebcddfac8458653f2.1621362768.git.gitgitgadget@gmail.com/

Interesting. This patch makes it go away (the "5" is cargo-culted from
earlier in the script):

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 50f206db55..2f3a1cd210 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -231,6 +231,7 @@ test_expect_success 'correctly report changes over limit' '
 
 		# Commit has 7 file and 4 directory adds
 		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT_NESTING=5 \
 			GIT_TRACE2_EVENT="$(pwd)/trace" \
 			git commit-graph write --reachable --changed-paths &&
 		test_max_changed_paths 10 trace &&
@@ -263,6 +264,7 @@ test_expect_success 'correctly report changes over limit' '
 		# start from scratch and rebuild
 		rm -f .git/objects/info/commit-graph &&
 		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT_NESTING=5 \
 			GIT_TRACE2_EVENT="$(pwd)/trace-edit" \
 			git commit-graph write --reachable --changed-paths &&
 		test_max_changed_paths 10 trace-edit &&
@@ -280,6 +282,7 @@ test_expect_success 'correctly report changes over limit' '
 		# start from scratch and rebuild
 		rm -f .git/objects/info/commit-graph &&
 		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=11 \
+			GIT_TRACE2_EVENT_NESTING=5 \
 			GIT_TRACE2_EVENT="$(pwd)/trace-update" \
 			git commit-graph write --reachable --changed-paths &&
 		test_max_changed_paths 11 trace-update &&

The commit in question (2ca245f8be) puts the writing into a new trace2
region ("chunkfile"), so it makes sense that the nesting increases by
one. But what's interesting is that the nesting is different depending
on whether stderr is a terminal. I guess because the progress code
starts its own region.

The default nesting max for trace2 is 2. That seems kind of low given
this example, but I don't know enough about the tradeoffs to say what
bad things might happen if it's raised. But the above patch really seems
like a hack, and that this quiet omission would absolutely confuse real
users who are trying to use trace2 for debugging.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: t4216-log-bloom.sh fails with -v (but not --verbose-log)
  2021-11-26  4:08         ` Jeff King
@ 2021-11-29 13:49           ` Derrick Stolee
  0 siblings, 0 replies; 27+ messages in thread
From: Derrick Stolee @ 2021-11-29 13:49 UTC (permalink / raw)
  To: Jeff King, Ævar Arnfjörð Bjarmason
  Cc: Git ML, Derrick Stolee, Taylor Blau, Jeff Hostetler

On 11/25/2021 11:08 PM, Jeff King wrote:
> On Thu, Nov 25, 2021 at 01:14:45PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
>> I haven't looked much into $subject, but there's an interesting
>> regression in 2ca245f8be5 (csum-file.h: increase hashfile buffer size,
>> 2021-05-18) where it fails with -v, but not --verbose-log. Discovered
>> while running it manually.
>>
>> This is a regression in v2.33.0 (not v2.34.0!), so nothing urgent, and
>> this is pretty obscure anyway.
>>
>> For the original change see:
>> https://lore.kernel.org/git/64ffddd791160895b8e6730ebcddfac8458653f2.1621362768.git.gitgitgadget@gmail.com/
> 
> Interesting. This patch makes it go away (the "5" is cargo-culted from
> earlier in the script):
...
>  		# Commit has 7 file and 4 directory adds
>  		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
> +			GIT_TRACE2_EVENT_NESTING=5 \
>  			GIT_TRACE2_EVENT="$(pwd)/trace" \
>  			git commit-graph write --reachable --changed-paths &&
>  		test_max_changed_paths 10 trace &&
...
> The commit in question (2ca245f8be) puts the writing into a new trace2
> region ("chunkfile"), so it makes sense that the nesting increases by
> one. But what's interesting is that the nesting is different depending
> on whether stderr is a terminal. I guess because the progress code
> starts its own region.
> 
> The default nesting max for trace2 is 2. That seems kind of low given
> this example, but I don't know enough about the tradeoffs to say what
> bad things might happen if it's raised. But the above patch really seems
> like a hack, and that this quiet omission would absolutely confuse real
> users who are trying to use trace2 for debugging.

Thanks, both, for identifying the problem and the root cause. I have
sent a patch series [1] that sets a deeper GIT_TRACE2_EVENT_NESTING
across the test suite to avoid this kind of issue in the future (along
with removing the existing uses scattered across the tests).

Thanks,
-Stolee

[1] https://lore.kernel.org/git/pull.1085.git.1638193666.gitgitgadget@gmail.com/

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-11-29 15:42 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-26 19:12 [PATCH 0/3] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
2021-03-26 19:12 ` [PATCH 1/3] csum-file: add nested_hashfile() Derrick Stolee via GitGitGadget
2021-03-26 19:12 ` [PATCH 2/3] read-cache: use hashfile instead of git_hash_ctx Derrick Stolee via GitGitGadget
2021-03-29 15:04   ` Derrick Stolee
2021-03-29 19:10     ` Derrick Stolee
2021-03-26 19:12 ` [PATCH 3/3] read-cache: delete unused hashing methods Derrick Stolee via GitGitGadget
2021-03-26 20:16 ` [PATCH 0/3] Convert index writes to use hashfile API Derrick Stolee
2021-05-17 12:24 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
2021-05-17 12:24   ` [PATCH v2 1/4] hashfile: use write_in_full() Derrick Stolee via GitGitGadget
2021-05-17 12:24   ` [PATCH v2 2/4] csum-file.h: increase hashfile buffer size Derrick Stolee via GitGitGadget
2021-05-17 21:54     ` Junio C Hamano
2021-05-18  7:33       ` Jeff King
2021-05-18 14:44         ` Derrick Stolee
2021-05-18  7:31     ` Jeff King
2021-05-18  7:42       ` Jeff King
2021-05-17 12:24   ` [PATCH v2 3/4] read-cache: use hashfile instead of git_hash_ctx Derrick Stolee via GitGitGadget
2021-05-17 22:13     ` Junio C Hamano
2021-05-18 14:16       ` Derrick Stolee
2021-05-17 12:24   ` [PATCH v2 4/4] read-cache: delete unused hashing methods Derrick Stolee via GitGitGadget
2021-05-18 18:32   ` [PATCH v3 0/4] Convert index writes to use hashfile API Derrick Stolee via GitGitGadget
2021-05-18 18:32     ` [PATCH v3 1/4] hashfile: use write_in_full() Derrick Stolee via GitGitGadget
2021-05-18 18:32     ` [PATCH v3 2/4] csum-file.h: increase hashfile buffer size Derrick Stolee via GitGitGadget
2021-11-25 12:14       ` t4216-log-bloom.sh fails with -v (but not --verbose-log) Ævar Arnfjörð Bjarmason
2021-11-26  4:08         ` Jeff King
2021-11-29 13:49           ` Derrick Stolee
2021-05-18 18:32     ` [PATCH v3 3/4] read-cache: use hashfile instead of git_hash_ctx Derrick Stolee via GitGitGadget
2021-05-18 18:32     ` [PATCH v3 4/4] read-cache: delete unused hashing methods Derrick Stolee via GitGitGadget

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).