git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/17] Refactor chunk-format into an API
@ 2021-01-26 16:01 Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
                   ` (19 more replies)
  0 siblings, 20 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee

This is a restart on the topic previously submitted [1] but dropped because
ak/corrected-commit-date was still in progress. This version is based on
that branch.

[1]
https://lore.kernel.org/git/pull.804.git.1607012215.gitgitgadget@gmail.com/

This version also changes the approach to use a more dynamic interaction
with a struct chunkfile pointer. This idea is credited to Taylor Blau [2],
but I started again from scratch. I also go further to make struct chunkfile
anonymous to API consumers. It is defined only in chunk-format.c, which
should hopefully deter future users from interacting with that data
directly.

[2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

This combined API is beneficial to reduce duplicated logic. Or rather, to
ensure that similar file formats have similar protections against bad data.
The multi-pack-index code did not have as many guards as the commit-graph
code did, but now they both share a common base that checks for things like
duplicate chunks or offsets outside the size of the file.

Here are some stats for the end-to-end change:

 * 638 insertions(+), 456 deletions(-).
 * commit-graph.c: 171 insertions(+), 192 deletions(-)
 * midx.c: 196 insertions(+), 260 deletions(-)

While there is an overall increase to the code size, the consumers do get a
bit smaller. Boilerplate things like abstracting method to match
chunk_write_fn and chunk_read_fn make up a lot of these insertions. The
"interesting" code gets a lot smaller and cleaner.

Thanks, -Stolee

Derrick Stolee (17):
  commit-graph: anonymize data in chunk_write_fn
  chunk-format: create chunk format write API
  commit-graph: use chunk-format write API
  midx: rename pack_info to write_midx_context
  midx: use context in write_midx_pack_names()
  midx: add entries to write_midx_context
  midx: add pack_perm to write_midx_context
  midx: add num_large_offsets to write_midx_context
  midx: return success/failure in chunk write methods
  midx: drop chunk progress during write
  midx: use chunk-format API in write_midx_internal()
  chunk-format: create read chunk API
  commit-graph: use chunk-format read API
  midx: use chunk-format read API
  midx: use 64-bit multiplication for chunk sizes
  chunk-format: restore duplicate chunk checks
  chunk-format: add technical docs

 Documentation/technical/chunk-format.txt      |  54 +++
 .../technical/commit-graph-format.txt         |   3 +
 Documentation/technical/pack-format.txt       |   3 +
 Makefile                                      |   1 +
 chunk-format.c                                | 165 +++++++
 chunk-format.h                                |  41 ++
 commit-graph.c                                | 363 +++++++-------
 midx.c                                        | 456 ++++++++----------
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5319-multi-pack-index.sh                   |   6 +-
 10 files changed, 638 insertions(+), 456 deletions(-)
 create mode 100644 Documentation/technical/chunk-format.txt
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h


base-commit: 5a3b130cad0d5c770f766e3af6d32b41766374c0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-848%2Fderrickstolee%2Fchunk-format%2Frefactor-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/848
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH 01/17] commit-graph: anonymize data in chunk_write_fn
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-27  1:53   ` Chris Torek
  2021-01-26 16:01 ` [PATCH 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In preparation for creating an API around file formats using chunks and
tables of contents, prepare the commit-graph write code to use
prototypes that will match this new API.

Specifically, convert chunk_write_fn to take a "void *data" parameter
instead of the commit-graph-specific "struct write_commit_graph_context"
pointer.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 38 ++++++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f3bde2ad95a..b26ed72396e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1040,8 +1040,10 @@ struct write_commit_graph_context {
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
-				    struct write_commit_graph_context *ctx)
+				    void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	int i, count = 0;
 	struct commit **list = ctx->commits.list;
 
@@ -1066,8 +1068,10 @@ static int write_graph_chunk_fanout(struct hashfile *f,
 }
 
 static int write_graph_chunk_oids(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				  void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
@@ -1085,8 +1089,10 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
 }
 
 static int write_graph_chunk_data(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				  void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t num_extra_edges = 0;
@@ -1187,8 +1193,10 @@ static int write_graph_chunk_data(struct hashfile *f,
 }
 
 static int write_graph_chunk_generation_data(struct hashfile *f,
-					      struct write_commit_graph_context *ctx)
+					     void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	int i, num_generation_data_overflows = 0;
 
 	for (i = 0; i < ctx->commits.nr; i++) {
@@ -1208,8 +1216,10 @@ static int write_graph_chunk_generation_data(struct hashfile *f,
 }
 
 static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
-						       struct write_commit_graph_context *ctx)
+						      void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	int i;
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = ctx->commits.list[i];
@@ -1226,8 +1236,10 @@ static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
 }
 
 static int write_graph_chunk_extra_edges(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+					 void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	struct commit_list *parent;
@@ -1280,8 +1292,10 @@ static int write_graph_chunk_extra_edges(struct hashfile *f,
 }
 
 static int write_graph_chunk_bloom_indexes(struct hashfile *f,
-					   struct write_commit_graph_context *ctx)
+					   void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t cur_pos = 0;
@@ -1315,8 +1329,10 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 }
 
 static int write_graph_chunk_bloom_data(struct hashfile *f,
-					struct write_commit_graph_context *ctx)
+					void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 
@@ -1737,8 +1753,10 @@ static int write_graph_chunk_base_1(struct hashfile *f,
 }
 
 static int write_graph_chunk_base(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				    void *data)
 {
+	struct write_commit_graph_context *ctx =
+		(struct write_commit_graph_context *)data;
 	int num = write_graph_chunk_base_1(f, ctx->new_base_graph);
 
 	if (num != ctx->num_commit_graphs_after - 1) {
@@ -1750,7 +1768,7 @@ static int write_graph_chunk_base(struct hashfile *f,
 }
 
 typedef int (*chunk_write_fn)(struct hashfile *f,
-			      struct write_commit_graph_context *ctx);
+			      void *data);
 
 struct chunk_info {
 	uint32_t id;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 02/17] chunk-format: create chunk format write API
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-27  2:42   ` Taylor Blau
  2021-01-26 16:01 ` [PATCH 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In anticipation of combining the logic from the commit-graph and
multi-pack-index file formats, create a new chunk-format API. Use a
'struct chunkfile' pointer to keep track of data that has been
registered for writes. This struct is anonymous outside of
chunk-format.c to ensure no user attempts to interfere with the data.

The next change will use this API in commit-graph.c, but the general
approach is:

 1. initialize the chunkfile with init_chunkfile(f).
 2. add chunks in the intended writing order with add_chunk().
 3. write any header information to the hashfile f.
 4. write the chunkfile data using write_chunkfile().
 5. free the chunkfile struct using free_chunkfile().

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile       |  1 +
 chunk-format.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++
 chunk-format.h | 20 +++++++++++
 3 files changed, 112 insertions(+)
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h

diff --git a/Makefile b/Makefile
index 7b64106930a..50a7663841e 100644
--- a/Makefile
+++ b/Makefile
@@ -854,6 +854,7 @@ LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += chdir-notify.o
 LIB_OBJS += checkout.o
+LIB_OBJS += chunk-format.o
 LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
diff --git a/chunk-format.c b/chunk-format.c
new file mode 100644
index 00000000000..2ce37ecc6bb
--- /dev/null
+++ b/chunk-format.c
@@ -0,0 +1,91 @@
+#include "cache.h"
+#include "chunk-format.h"
+#include "csum-file.h"
+#define CHUNK_LOOKUP_WIDTH 12
+
+/*
+ * When writing a chunk-based file format, collect the chunks in
+ * an array of chunk_info structs. The size stores the _expected_
+ * amount of data that will be written by write_fn.
+ */
+struct chunk_info {
+	uint32_t id;
+	uint64_t size;
+	chunk_write_fn write_fn;
+};
+
+struct chunkfile {
+	struct hashfile *f;
+
+	struct chunk_info *chunks;
+	size_t chunks_nr;
+	size_t chunks_alloc;
+};
+
+struct chunkfile *init_chunkfile(struct hashfile *f)
+{
+	struct chunkfile *cf = xcalloc(1, sizeof(*cf));
+	cf->f = f;
+	return cf;
+}
+
+void free_chunkfile(struct chunkfile *cf)
+{
+	if (!cf)
+		return;
+	free(cf->chunks);
+	free(cf);
+}
+
+int get_num_chunks(struct chunkfile *cf)
+{
+	return cf->chunks_nr;
+}
+
+void add_chunk(struct chunkfile *cf,
+	       uint64_t id,
+	       chunk_write_fn fn,
+	       size_t size)
+{
+	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
+
+	cf->chunks[cf->chunks_nr].id = id;
+	cf->chunks[cf->chunks_nr].write_fn = fn;
+	cf->chunks[cf->chunks_nr].size = size;
+	cf->chunks_nr++;
+}
+
+int write_chunkfile(struct chunkfile *cf, void *data)
+{
+	int i;
+	size_t cur_offset = cf->f->offset + cf->f->total;
+
+	/* Add the table of contents to the current offset */
+	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		hashwrite_be32(cf->f, cf->chunks[i].id);
+		hashwrite_be64(cf->f, cur_offset);
+
+		cur_offset += cf->chunks[i].size;
+	}
+
+	/* Trailing entry marks the end of the chunks */
+	hashwrite_be32(cf->f, 0);
+	hashwrite_be64(cf->f, cur_offset);
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		uint64_t start_offset = cf->f->total + cf->f->offset;
+		int result = cf->chunks[i].write_fn(cf->f, data);
+
+		if (result)
+			return result;
+
+		if (cf->f->total + cf->f->offset != start_offset + cf->chunks[i].size)
+			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
+			    cf->chunks[i].size, cf->chunks[i].id,
+			    cf->f->total + cf->f->offset - start_offset);
+	}
+
+	return 0;
+}
diff --git a/chunk-format.h b/chunk-format.h
new file mode 100644
index 00000000000..bfaed672813
--- /dev/null
+++ b/chunk-format.h
@@ -0,0 +1,20 @@
+#ifndef CHUNK_FORMAT_H
+#define CHUNK_FORMAT_H
+
+#include "git-compat-util.h"
+
+struct hashfile;
+struct chunkfile;
+
+struct chunkfile *init_chunkfile(struct hashfile *f);
+void free_chunkfile(struct chunkfile *cf);
+int get_num_chunks(struct chunkfile *cf);
+typedef int (*chunk_write_fn)(struct hashfile *f,
+			      void *data);
+void add_chunk(struct chunkfile *cf,
+	       uint64_t id,
+	       chunk_write_fn fn,
+	       size_t size);
+int write_chunkfile(struct chunkfile *cf, void *data);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 03/17] commit-graph: use chunk-format write API
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-27  2:47   ` Taylor Blau
  2021-01-26 16:01 ` [PATCH 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The commit-graph write logic is ready to make use of the chunk-format
write API. Each chunk write method is already in the correct prototype.
We only need to use the 'struct chunkfile' pointer and the correct API
calls.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 118 ++++++++++++++++---------------------------------
 1 file changed, 37 insertions(+), 81 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index b26ed72396e..b2c0f233eab 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -19,6 +19,7 @@
 #include "shallow.h"
 #include "json-writer.h"
 #include "trace2.h"
+#include "chunk-format.h"
 
 void git_test_write_commit_graph_or_die(void)
 {
@@ -1767,27 +1768,17 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
-typedef int (*chunk_write_fn)(struct hashfile *f,
-			      void *data);
-
-struct chunk_info {
-	uint32_t id;
-	uint64_t size;
-	chunk_write_fn write_fn;
-};
-
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
 	int fd;
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
-	struct chunk_info chunks[MAX_NUM_CHUNKS + 1];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
-	uint64_t chunk_offset;
 	struct object_id file_hash;
+	struct chunkfile *cf;
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -1833,76 +1824,50 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	}
 
-	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
-	chunks[0].size = GRAPH_FANOUT_SIZE;
-	chunks[0].write_fn = write_graph_chunk_fanout;
-	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
-	chunks[1].size = hashsz * ctx->commits.nr;
-	chunks[1].write_fn = write_graph_chunk_oids;
-	chunks[2].id = GRAPH_CHUNKID_DATA;
-	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
-	chunks[2].write_fn = write_graph_chunk_data;
+	cf = init_chunkfile(f);
+
+	add_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
+		  write_graph_chunk_fanout, GRAPH_FANOUT_SIZE);
+	add_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP,
+		  write_graph_chunk_oids, hashsz * ctx->commits.nr);
+	add_chunk(cf, GRAPH_CHUNKID_DATA,
+		  write_graph_chunk_data, (hashsz + 16) * ctx->commits.nr);
 
 	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
 		ctx->write_generation_data = 0;
-	if (ctx->write_generation_data) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
-		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
-		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
-		num_chunks++;
-	}
-	if (ctx->num_generation_data_overflows) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
-		chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
-		chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
-		num_chunks++;
-	}
-	if (ctx->num_extra_edges) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
-		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
-		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
-		num_chunks++;
-	}
+	if (ctx->write_generation_data)
+		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
+			  write_graph_chunk_generation_data,
+			  sizeof(uint32_t) * ctx->commits.nr);
+	if (ctx->num_generation_data_overflows)
+		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
+			  write_graph_chunk_generation_data_overflow,
+			  sizeof(timestamp_t) * ctx->num_generation_data_overflows);
+	if (ctx->num_extra_edges)
+		add_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES,
+			  write_graph_chunk_extra_edges,
+			  4 * ctx->num_extra_edges);
 	if (ctx->changed_paths) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
-		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
-		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
-		num_chunks++;
-		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
-		chunks[num_chunks].size = sizeof(uint32_t) * 3
-					  + ctx->total_bloom_filter_data_size;
-		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
-		num_chunks++;
-	}
-	if (ctx->num_commit_graphs_after > 1) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
-		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
-		chunks[num_chunks].write_fn = write_graph_chunk_base;
-		num_chunks++;
-	}
-
-	chunks[num_chunks].id = 0;
-	chunks[num_chunks].size = 0;
+		add_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+			  write_graph_chunk_bloom_indexes,
+			  sizeof(uint32_t) * ctx->commits.nr);
+		add_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
+			  write_graph_chunk_bloom_data,
+			  sizeof(uint32_t) * 3
+				+ ctx->total_bloom_filter_data_size);
+	}
+	if (ctx->num_commit_graphs_after > 1)
+		add_chunk(cf, GRAPH_CHUNKID_BASE,
+			  write_graph_chunk_base,
+			  hashsz * (ctx->num_commit_graphs_after - 1));
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
 	hashwrite_u8(f, oid_version());
-	hashwrite_u8(f, num_chunks);
+	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
-	chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
-	for (i = 0; i <= num_chunks; i++) {
-		uint32_t chunk_write[3];
-
-		chunk_write[0] = htonl(chunks[i].id);
-		chunk_write[1] = htonl(chunk_offset >> 32);
-		chunk_write[2] = htonl(chunk_offset & 0xffffffff);
-		hashwrite(f, chunk_write, 12);
-
-		chunk_offset += chunks[i].size;
-	}
-
 	if (ctx->report_progress) {
 		strbuf_addf(&progress_title,
 			    Q_("Writing out commit graph in %d pass",
@@ -1914,17 +1879,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			num_chunks * ctx->commits.nr);
 	}
 
-	for (i = 0; i < num_chunks; i++) {
-		uint64_t start_offset = f->total + f->offset;
-
-		if (chunks[i].write_fn(f, ctx))
-			return -1;
-
-		if (f->total + f->offset != start_offset + chunks[i].size)
-			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
-			    chunks[i].size, chunks[i].id,
-			    f->total + f->offset - start_offset);
-	}
+	write_chunkfile(cf, ctx);
 
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
@@ -1941,6 +1896,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	close_commit_graph(ctx->r->objects);
 	finalize_hashfile(f, file_hash.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+	free_chunkfile(cf);
 
 	if (ctx->split) {
 		FILE *chainf = fdopen_lock_file(&lk, "w");
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 04/17] midx: rename pack_info to write_midx_context
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-27  2:49   ` Taylor Blau
  2021-01-26 16:01 ` [PATCH 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to streamline our chunk-based file formats, align some of
the code structure in write_midx_internal() to be similar to the
patterns in write_commit_graph_file().

Specifically, let's create a "struct write_midx_context" that can be
used as a data parameter to abstract function types.

This change only renames "struct pack_info" to "struct
write_midx_context" and the names of instances from "packs" to "ctx". In
future changes, we will expand the data inside "struct
write_midx_context" and align our chunk-writing method with the
chunk-format API.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 130 ++++++++++++++++++++++++++++-----------------------------
 1 file changed, 65 insertions(+), 65 deletions(-)

diff --git a/midx.c b/midx.c
index 79c282b070d..dfc1a289246 100644
--- a/midx.c
+++ b/midx.c
@@ -451,7 +451,7 @@ static int pack_info_compare(const void *_a, const void *_b)
 	return strcmp(a->pack_name, b->pack_name);
 }
 
-struct pack_list {
+struct write_midx_context {
 	struct pack_info *info;
 	uint32_t nr;
 	uint32_t alloc;
@@ -463,37 +463,37 @@ struct pack_list {
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			     const char *file_name, void *data)
 {
-	struct pack_list *packs = (struct pack_list *)data;
+	struct write_midx_context *ctx = (struct write_midx_context *)data;
 
 	if (ends_with(file_name, ".idx")) {
-		display_progress(packs->progress, ++packs->pack_paths_checked);
-		if (packs->m && midx_contains_pack(packs->m, file_name))
+		display_progress(ctx->progress, ++ctx->pack_paths_checked);
+		if (ctx->m && midx_contains_pack(ctx->m, file_name))
 			return;
 
-		ALLOC_GROW(packs->info, packs->nr + 1, packs->alloc);
+		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
-		packs->info[packs->nr].p = add_packed_git(full_path,
-							  full_path_len,
-							  0);
+		ctx->info[ctx->nr].p = add_packed_git(full_path,
+						      full_path_len,
+						      0);
 
-		if (!packs->info[packs->nr].p) {
+		if (!ctx->info[ctx->nr].p) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
 			return;
 		}
 
-		if (open_pack_index(packs->info[packs->nr].p)) {
+		if (open_pack_index(ctx->info[ctx->nr].p)) {
 			warning(_("failed to open pack-index '%s'"),
 				full_path);
-			close_pack(packs->info[packs->nr].p);
-			FREE_AND_NULL(packs->info[packs->nr].p);
+			close_pack(ctx->info[ctx->nr].p);
+			FREE_AND_NULL(ctx->info[ctx->nr].p);
 			return;
 		}
 
-		packs->info[packs->nr].pack_name = xstrdup(file_name);
-		packs->info[packs->nr].orig_pack_int_id = packs->nr;
-		packs->info[packs->nr].expired = 0;
-		packs->nr++;
+		ctx->info[ctx->nr].pack_name = xstrdup(file_name);
+		ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
+		ctx->info[ctx->nr].expired = 0;
+		ctx->nr++;
 	}
 }
 
@@ -801,7 +801,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
-	struct pack_list packs;
+	struct write_midx_context ctx = { 0 };
 	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
@@ -820,40 +820,40 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 			  midx_name);
 
 	if (m)
-		packs.m = m;
+		ctx.m = m;
 	else
-		packs.m = load_multi_pack_index(object_dir, 1);
-
-	packs.nr = 0;
-	packs.alloc = packs.m ? packs.m->num_packs : 16;
-	packs.info = NULL;
-	ALLOC_ARRAY(packs.info, packs.alloc);
-
-	if (packs.m) {
-		for (i = 0; i < packs.m->num_packs; i++) {
-			ALLOC_GROW(packs.info, packs.nr + 1, packs.alloc);
-
-			packs.info[packs.nr].orig_pack_int_id = i;
-			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
-			packs.info[packs.nr].p = NULL;
-			packs.info[packs.nr].expired = 0;
-			packs.nr++;
+		ctx.m = load_multi_pack_index(object_dir, 1);
+
+	ctx.nr = 0;
+	ctx.alloc = ctx.m ? ctx.m->num_packs : 16;
+	ctx.info = NULL;
+	ALLOC_ARRAY(ctx.info, ctx.alloc);
+
+	if (ctx.m) {
+		for (i = 0; i < ctx.m->num_packs; i++) {
+			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
+
+			ctx.info[ctx.nr].orig_pack_int_id = i;
+			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
+			ctx.info[ctx.nr].p = NULL;
+			ctx.info[ctx.nr].expired = 0;
+			ctx.nr++;
 		}
 	}
 
-	packs.pack_paths_checked = 0;
+	ctx.pack_paths_checked = 0;
 	if (flags & MIDX_PROGRESS)
-		packs.progress = start_delayed_progress(_("Adding packfiles to multi-pack-index"), 0);
+		ctx.progress = start_delayed_progress(_("Adding packfiles to multi-pack-index"), 0);
 	else
-		packs.progress = NULL;
+		ctx.progress = NULL;
 
-	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
-	stop_progress(&packs.progress);
+	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
+	stop_progress(&ctx.progress);
 
-	if (packs.m && packs.nr == packs.m->num_packs && !packs_to_drop)
+	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
-	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
+	entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &nr_entries);
 
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
@@ -862,19 +862,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 			large_offsets_needed = 1;
 	}
 
-	QSORT(packs.info, packs.nr, pack_info_compare);
+	QSORT(ctx.info, ctx.nr, pack_info_compare);
 
 	if (packs_to_drop && packs_to_drop->nr) {
 		int drop_index = 0;
 		int missing_drops = 0;
 
-		for (i = 0; i < packs.nr && drop_index < packs_to_drop->nr; i++) {
-			int cmp = strcmp(packs.info[i].pack_name,
+		for (i = 0; i < ctx.nr && drop_index < packs_to_drop->nr; i++) {
+			int cmp = strcmp(ctx.info[i].pack_name,
 					 packs_to_drop->items[drop_index].string);
 
 			if (!cmp) {
 				drop_index++;
-				packs.info[i].expired = 1;
+				ctx.info[i].expired = 1;
 			} else if (cmp > 0) {
 				error(_("did not see pack-file %s to drop"),
 				      packs_to_drop->items[drop_index].string);
@@ -882,7 +882,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				missing_drops++;
 				i--;
 			} else {
-				packs.info[i].expired = 0;
+				ctx.info[i].expired = 0;
 			}
 		}
 
@@ -898,19 +898,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	 *
 	 * pack_perm[old_id] = new_id
 	 */
-	ALLOC_ARRAY(pack_perm, packs.nr);
-	for (i = 0; i < packs.nr; i++) {
-		if (packs.info[i].expired) {
+	ALLOC_ARRAY(pack_perm, ctx.nr);
+	for (i = 0; i < ctx.nr; i++) {
+		if (ctx.info[i].expired) {
 			dropped_packs++;
-			pack_perm[packs.info[i].orig_pack_int_id] = PACK_EXPIRED;
+			pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
 		} else {
-			pack_perm[packs.info[i].orig_pack_int_id] = i - dropped_packs;
+			pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
 		}
 	}
 
-	for (i = 0; i < packs.nr; i++) {
-		if (!packs.info[i].expired)
-			pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	for (i = 0; i < ctx.nr; i++) {
+		if (!ctx.info[i].expired)
+			pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
 	}
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
@@ -921,19 +921,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	if (packs.m)
-		close_midx(packs.m);
+	if (ctx.m)
+		close_midx(ctx.m);
 
 	cur_chunk = 0;
 	num_chunks = large_offsets_needed ? 5 : 4;
 
-	if (packs.nr - dropped_packs == 0) {
+	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
 		result = 1;
 		goto cleanup;
 	}
 
-	written = write_midx_header(f, num_chunks, packs.nr - dropped_packs);
+	written = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
@@ -990,7 +990,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, packs.info, packs.nr);
+				written += write_midx_pack_names(f, ctx.info, ctx.nr);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
@@ -1027,15 +1027,15 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	commit_lock_file(&lk);
 
 cleanup:
-	for (i = 0; i < packs.nr; i++) {
-		if (packs.info[i].p) {
-			close_pack(packs.info[i].p);
-			free(packs.info[i].p);
+	for (i = 0; i < ctx.nr; i++) {
+		if (ctx.info[i].p) {
+			close_pack(ctx.info[i].p);
+			free(ctx.info[i].p);
 		}
-		free(packs.info[i].pack_name);
+		free(ctx.info[i].pack_name);
 	}
 
-	free(packs.info);
+	free(ctx.info);
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 05/17] midx: use context in write_midx_pack_names()
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align the write_midx_internal() to use the chunk-format
API, start converting chunk writing methods to match chunk_write_fn. The
first case is to convert write_midx_pack_names() to take "void *data".
We already have the necessary data in "struct write_midx_context", so
this conversion is rather mechanical.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/midx.c b/midx.c
index dfc1a289246..f348a70e018 100644
--- a/midx.c
+++ b/midx.c
@@ -643,27 +643,26 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	return deduplicated_entries;
 }
 
-static size_t write_midx_pack_names(struct hashfile *f,
-				    struct pack_info *info,
-				    uint32_t num_packs)
+static size_t write_midx_pack_names(struct hashfile *f, void *data)
 {
+	struct write_midx_context *ctx = (struct write_midx_context *)data;
 	uint32_t i;
 	unsigned char padding[MIDX_CHUNK_ALIGNMENT];
 	size_t written = 0;
 
-	for (i = 0; i < num_packs; i++) {
+	for (i = 0; i < ctx->nr; i++) {
 		size_t writelen;
 
-		if (info[i].expired)
+		if (ctx->info[i].expired)
 			continue;
 
-		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
+		if (i && strcmp(ctx->info[i].pack_name, ctx->info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
-			    info[i - 1].pack_name,
-			    info[i].pack_name);
+			    ctx->info[i - 1].pack_name,
+			    ctx->info[i].pack_name);
 
-		writelen = strlen(info[i].pack_name) + 1;
-		hashwrite(f, info[i].pack_name, writelen);
+		writelen = strlen(ctx->info[i].pack_name) + 1;
+		hashwrite(f, ctx->info[i].pack_name, writelen);
 		written += writelen;
 	}
 
@@ -990,7 +989,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, ctx.info, ctx.nr);
+				written += write_midx_pack_names(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 06/17] midx: add entries to write_midx_context
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "struct pack_midx_entry *entries" list and its count
into the context.

Update write_midx_oid_fanout() and write_midx_oid_lookup() to take the
context directly, as these are easy conversions with this new data.

Only the callers of write_midx_object_offsets() and
write_midx_large_offsets() are updated here, since additional data in
the context before those methods can match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 49 ++++++++++++++++++++++++++-----------------------
 1 file changed, 26 insertions(+), 23 deletions(-)

diff --git a/midx.c b/midx.c
index f348a70e018..ee6f3504c6a 100644
--- a/midx.c
+++ b/midx.c
@@ -458,6 +458,9 @@ struct write_midx_context {
 	struct multi_pack_index *m;
 	struct progress *progress;
 	unsigned pack_paths_checked;
+
+	struct pack_midx_entry *entries;
+	uint32_t entries_nr;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -678,11 +681,11 @@ static size_t write_midx_pack_names(struct hashfile *f, void *data)
 }
 
 static size_t write_midx_oid_fanout(struct hashfile *f,
-				    struct pack_midx_entry *objects,
-				    uint32_t nr_objects)
+				    void *data)
 {
-	struct pack_midx_entry *list = objects;
-	struct pack_midx_entry *last = objects + nr_objects;
+	struct write_midx_context *ctx = (struct write_midx_context *)data;
+	struct pack_midx_entry *list = ctx->entries;
+	struct pack_midx_entry *last = ctx->entries + ctx->entries_nr;
 	uint32_t count = 0;
 	uint32_t i;
 
@@ -706,18 +709,19 @@ static size_t write_midx_oid_fanout(struct hashfile *f,
 	return MIDX_CHUNK_FANOUT_SIZE;
 }
 
-static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
-				    struct pack_midx_entry *objects,
-				    uint32_t nr_objects)
+static size_t write_midx_oid_lookup(struct hashfile *f,
+				    void *data)
 {
-	struct pack_midx_entry *list = objects;
+	struct write_midx_context *ctx = (struct write_midx_context *)data;
+	unsigned char hash_len = the_hash_algo->rawsz;
+	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i;
 	size_t written = 0;
 
-	for (i = 0; i < nr_objects; i++) {
+	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		if (i < nr_objects - 1) {
+		if (i < ctx->entries_nr - 1) {
 			struct pack_midx_entry *next = list;
 			if (oidcmp(&obj->oid, &next->oid) >= 0)
 				BUG("OIDs not in order: %s >= %s",
@@ -805,8 +809,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	uint32_t nr_entries, num_large_offsets = 0;
-	struct pack_midx_entry *entries = NULL;
+	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
 	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
@@ -852,12 +855,12 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
-	entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &nr_entries);
+	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr);
 
-	for (i = 0; i < nr_entries; i++) {
-		if (entries[i].offset > 0x7fffffff)
+	for (i = 0; i < ctx.entries_nr; i++) {
+		if (ctx.entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
-		if (entries[i].offset > 0xffffffff)
+		if (ctx.entries[i].offset > 0xffffffff)
 			large_offsets_needed = 1;
 	}
 
@@ -947,10 +950,10 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * the_hash_algo->rawsz;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;
 
 	cur_chunk++;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_CHUNK_OFFSET_WIDTH;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
 	if (large_offsets_needed) {
 		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
 
@@ -993,19 +996,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-				written += write_midx_oid_fanout(f, entries, nr_entries);
+				written += write_midx_oid_fanout(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDLOOKUP:
-				written += write_midx_oid_lookup(f, the_hash_algo->rawsz, entries, nr_entries);
+				written += write_midx_oid_lookup(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, entries, nr_entries);
+				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, ctx.entries, ctx.entries_nr);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, num_large_offsets, entries, nr_entries);
+				written += write_midx_large_offsets(f, num_large_offsets, ctx.entries, ctx.entries_nr);
 				break;
 
 			default:
@@ -1035,7 +1038,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	}
 
 	free(ctx.info);
-	free(entries);
+	free(ctx.entries);
 	free(pack_perm);
 	free(midx_name);
 	return result;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 07/17] midx: add pack_perm to write_midx_context
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t *pack_perm" and large_offsets_needed bit
into the context.

Update write_midx_object_offsets() to match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/midx.c b/midx.c
index ee6f3504c6a..66feff096e8 100644
--- a/midx.c
+++ b/midx.c
@@ -461,6 +461,9 @@ struct write_midx_context {
 
 	struct pack_midx_entry *entries;
 	uint32_t entries_nr;
+
+	uint32_t *pack_perm;
+	unsigned large_offsets_needed:1;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -736,27 +739,27 @@ static size_t write_midx_oid_lookup(struct hashfile *f,
 	return written;
 }
 
-static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
-					uint32_t *perm,
-					struct pack_midx_entry *objects, uint32_t nr_objects)
+static size_t write_midx_object_offsets(struct hashfile *f,
+					void *data)
 {
-	struct pack_midx_entry *list = objects;
+	struct write_midx_context *ctx = (struct write_midx_context *)data;
+	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i, nr_large_offset = 0;
 	size_t written = 0;
 
-	for (i = 0; i < nr_objects; i++) {
+	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		if (perm[obj->pack_int_id] == PACK_EXPIRED)
+		if (ctx->pack_perm[obj->pack_int_id] == PACK_EXPIRED)
 			BUG("object %s is in an expired pack with int-id %d",
 			    oid_to_hex(&obj->oid),
 			    obj->pack_int_id);
 
-		hashwrite_be32(f, perm[obj->pack_int_id]);
+		hashwrite_be32(f, ctx->pack_perm[obj->pack_int_id]);
 
-		if (large_offset_needed && obj->offset >> 31)
+		if (ctx->large_offsets_needed && obj->offset >> 31)
 			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
-		else if (!large_offset_needed && obj->offset >> 32)
+		else if (!ctx->large_offsets_needed && obj->offset >> 32)
 			BUG("object %s requires a large offset (%"PRIx64") but the MIDX is not writing large offsets!",
 			    oid_to_hex(&obj->oid),
 			    obj->offset);
@@ -805,13 +808,11 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
-	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -857,11 +858,12 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr);
 
+	ctx.large_offsets_needed = 0;
 	for (i = 0; i < ctx.entries_nr; i++) {
 		if (ctx.entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
 		if (ctx.entries[i].offset > 0xffffffff)
-			large_offsets_needed = 1;
+			ctx.large_offsets_needed = 1;
 	}
 
 	QSORT(ctx.info, ctx.nr, pack_info_compare);
@@ -900,13 +902,13 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	 *
 	 * pack_perm[old_id] = new_id
 	 */
-	ALLOC_ARRAY(pack_perm, ctx.nr);
+	ALLOC_ARRAY(ctx.pack_perm, ctx.nr);
 	for (i = 0; i < ctx.nr; i++) {
 		if (ctx.info[i].expired) {
 			dropped_packs++;
-			pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
+			ctx.pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
 		} else {
-			pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
+			ctx.pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
 		}
 	}
 
@@ -927,7 +929,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		close_midx(ctx.m);
 
 	cur_chunk = 0;
-	num_chunks = large_offsets_needed ? 5 : 4;
+	num_chunks = ctx.large_offsets_needed ? 5 : 4;
 
 	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
@@ -954,7 +956,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	cur_chunk++;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
-	if (large_offsets_needed) {
+	if (ctx.large_offsets_needed) {
 		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
 
 		cur_chunk++;
@@ -1004,7 +1006,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, ctx.entries, ctx.entries_nr);
+				written += write_midx_object_offsets(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
@@ -1039,7 +1041,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	free(ctx.info);
 	free(ctx.entries);
-	free(pack_perm);
+	free(ctx.pack_perm);
 	free(midx_name);
 	return result;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 08/17] midx: add num_large_offsets to write_midx_context
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t num_large_offsets" into the context. With
this new data, write_midx_large_offsets() now matches the
chunk_write_fn type.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/midx.c b/midx.c
index 66feff096e8..40b815f8877 100644
--- a/midx.c
+++ b/midx.c
@@ -464,6 +464,7 @@ struct write_midx_context {
 
 	uint32_t *pack_perm;
 	unsigned large_offsets_needed:1;
+	uint32_t num_large_offsets;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -772,11 +773,14 @@ static size_t write_midx_object_offsets(struct hashfile *f,
 	return written;
 }
 
-static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_offset,
-				       struct pack_midx_entry *objects, uint32_t nr_objects)
+static size_t write_midx_large_offsets(struct hashfile *f,
+				       void *data)
 {
-	struct pack_midx_entry *list = objects, *end = objects + nr_objects;
+	struct write_midx_context *ctx = (struct write_midx_context *)data;
+	struct pack_midx_entry *list = ctx->entries;
+	struct pack_midx_entry *end = ctx->entries + ctx->entries_nr;
 	size_t written = 0;
+	uint32_t nr_large_offset = ctx->num_large_offsets;
 
 	while (nr_large_offset) {
 		struct pack_midx_entry *obj;
@@ -811,7 +815,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
@@ -861,7 +864,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	ctx.large_offsets_needed = 0;
 	for (i = 0; i < ctx.entries_nr; i++) {
 		if (ctx.entries[i].offset > 0x7fffffff)
-			num_large_offsets++;
+			ctx.num_large_offsets++;
 		if (ctx.entries[i].offset > 0xffffffff)
 			ctx.large_offsets_needed = 1;
 	}
@@ -961,7 +964,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		cur_chunk++;
 		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
-					   num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
+					   ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
 	}
 
 	chunk_ids[cur_chunk] = 0;
@@ -1010,7 +1013,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, num_large_offsets, ctx.entries, ctx.entries_nr);
+				written += write_midx_large_offsets(f, &ctx);
 				break;
 
 			default:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 09/17] midx: return success/failure in chunk write methods
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Historically, the chunk-writing methods in midx.c have returned the
amount of data written so the writer method could compare this with the
table of contents. This presents with some interesting issues:

1. If a chunk writing method has a bug that miscalculates the written
   bytes, then we can satisfy the table of contents without actually
   writing the right amount of data to the hashfile. The commit-graph
   writing code checks the hashfile struct directly for a more robust
   verification.

2. There is no way for a chunk writing method to gracefully fail.
   Returning an int presents an opportunity to fail without a die().

3. The current pattern doesn't match chunk_write_fn type exactly, so we
   cannot share code with commit-graph.c

For these reasons, convert the midx chunk writer methods to return an
'int'. Since none of them fail at the moment, they all return 0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 63 +++++++++++++++++++++++++---------------------------------
 1 file changed, 27 insertions(+), 36 deletions(-)

diff --git a/midx.c b/midx.c
index 40b815f8877..852dd5b776e 100644
--- a/midx.c
+++ b/midx.c
@@ -650,7 +650,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	return deduplicated_entries;
 }
 
-static size_t write_midx_pack_names(struct hashfile *f, void *data)
+static int write_midx_pack_names(struct hashfile *f, void *data)
 {
 	struct write_midx_context *ctx = (struct write_midx_context *)data;
 	uint32_t i;
@@ -678,14 +678,13 @@ static size_t write_midx_pack_names(struct hashfile *f, void *data)
 	if (i < MIDX_CHUNK_ALIGNMENT) {
 		memset(padding, 0, sizeof(padding));
 		hashwrite(f, padding, i);
-		written += i;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_oid_fanout(struct hashfile *f,
-				    void *data)
+static int write_midx_oid_fanout(struct hashfile *f,
+				 void *data)
 {
 	struct write_midx_context *ctx = (struct write_midx_context *)data;
 	struct pack_midx_entry *list = ctx->entries;
@@ -710,17 +709,16 @@ static size_t write_midx_oid_fanout(struct hashfile *f,
 		list = next;
 	}
 
-	return MIDX_CHUNK_FANOUT_SIZE;
+	return 0;
 }
 
-static size_t write_midx_oid_lookup(struct hashfile *f,
-				    void *data)
+static int write_midx_oid_lookup(struct hashfile *f,
+				 void *data)
 {
 	struct write_midx_context *ctx = (struct write_midx_context *)data;
 	unsigned char hash_len = the_hash_algo->rawsz;
 	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i;
-	size_t written = 0;
 
 	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
@@ -734,19 +732,17 @@ static size_t write_midx_oid_lookup(struct hashfile *f,
 		}
 
 		hashwrite(f, obj->oid.hash, (int)hash_len);
-		written += hash_len;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_object_offsets(struct hashfile *f,
-					void *data)
+static int write_midx_object_offsets(struct hashfile *f,
+				     void *data)
 {
 	struct write_midx_context *ctx = (struct write_midx_context *)data;
 	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i, nr_large_offset = 0;
-	size_t written = 0;
 
 	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
@@ -766,20 +762,17 @@ static size_t write_midx_object_offsets(struct hashfile *f,
 			    obj->offset);
 		else
 			hashwrite_be32(f, (uint32_t)obj->offset);
-
-		written += MIDX_CHUNK_OFFSET_WIDTH;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_large_offsets(struct hashfile *f,
-				       void *data)
+static int write_midx_large_offsets(struct hashfile *f,
+				    void *data)
 {
 	struct write_midx_context *ctx = (struct write_midx_context *)data;
 	struct pack_midx_entry *list = ctx->entries;
 	struct pack_midx_entry *end = ctx->entries + ctx->entries_nr;
-	size_t written = 0;
 	uint32_t nr_large_offset = ctx->num_large_offsets;
 
 	while (nr_large_offset) {
@@ -795,12 +788,12 @@ static size_t write_midx_large_offsets(struct hashfile *f,
 		if (!(offset >> 31))
 			continue;
 
-		written += hashwrite_be64(f, offset);
+		hashwrite_be64(f, offset);
 
 		nr_large_offset--;
 	}
 
-	return written;
+	return 0;
 }
 
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
@@ -812,7 +805,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint64_t written = 0;
+	uint64_t header_size = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	struct progress *progress = NULL;
@@ -940,10 +933,10 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		goto cleanup;
 	}
 
-	written = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
+	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
-	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
+	chunk_offsets[cur_chunk] = header_size + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
@@ -981,39 +974,37 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		hashwrite_be32(f, chunk_ids[i]);
 		hashwrite_be64(f, chunk_offsets[i]);
-
-		written += MIDX_CHUNKLOOKUP_WIDTH;
 	}
 
 	if (flags & MIDX_PROGRESS)
 		progress = start_delayed_progress(_("Writing chunks to multi-pack-index"),
 					  num_chunks);
 	for (i = 0; i < num_chunks; i++) {
-		if (written != chunk_offsets[i])
+		if (f->total + f->offset != chunk_offsets[i])
 			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
 			    chunk_offsets[i],
-			    written,
+			    f->total + f->offset,
 			    chunk_ids[i]);
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, &ctx);
+				write_midx_pack_names(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-				written += write_midx_oid_fanout(f, &ctx);
+				write_midx_oid_fanout(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDLOOKUP:
-				written += write_midx_oid_lookup(f, &ctx);
+				write_midx_oid_lookup(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, &ctx);
+				write_midx_object_offsets(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, &ctx);
+				write_midx_large_offsets(f, &ctx);
 				break;
 
 			default:
@@ -1025,9 +1016,9 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	}
 	stop_progress(&progress);
 
-	if (written != chunk_offsets[num_chunks])
+	if (f->total + f->offset != chunk_offsets[num_chunks])
 		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-		    written,
+		    f->total + f->offset,
 		    chunk_offsets[num_chunks]);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 10/17] midx: drop chunk progress during write
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Most expensive operations in write_midx_internal() use the context
struct's progress member, and these indicate the process of the
expensive operations within the chunk writing methods. However, there is
a competing progress struct that counts the progress over all chunks.
This is not very helpful compared to the others, so drop it.

This also reduces our barriers to combining the chunk writing code with
chunk-format.c.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/midx.c b/midx.c
index 852dd5b776e..145c6bd0913 100644
--- a/midx.c
+++ b/midx.c
@@ -808,7 +808,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t header_size = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	struct progress *progress = NULL;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -976,9 +975,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		hashwrite_be64(f, chunk_offsets[i]);
 	}
 
-	if (flags & MIDX_PROGRESS)
-		progress = start_delayed_progress(_("Writing chunks to multi-pack-index"),
-					  num_chunks);
 	for (i = 0; i < num_chunks; i++) {
 		if (f->total + f->offset != chunk_offsets[i])
 			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
@@ -1011,10 +1007,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
 		}
-
-		display_progress(progress, i + 1);
 	}
-	stop_progress(&progress);
 
 	if (f->total + f->offset != chunk_offsets[num_chunks])
 		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 11/17] midx: use chunk-format API in write_midx_internal()
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (9 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The chunk-format API allows writing the table of contents and all chunks
using the anonymous 'struct chunkfile' type. We only need to convert our
local chunk logic to this API for the multi-pack-index writes to share
that logic with the commit-graph file writes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 104 +++++++++++----------------------------------------------
 1 file changed, 19 insertions(+), 85 deletions(-)

diff --git a/midx.c b/midx.c
index 145c6bd0913..0bfd2d802b6 100644
--- a/midx.c
+++ b/midx.c
@@ -11,6 +11,7 @@
 #include "trace2.h"
 #include "run-command.h"
 #include "repository.h"
+#include "chunk-format.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -799,18 +800,15 @@ static int write_midx_large_offsets(struct hashfile *f,
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
 			       struct string_list *packs_to_drop, unsigned flags)
 {
-	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
 	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint64_t header_size = 0;
-	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
-	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
+	struct chunkfile *cf;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name))
@@ -923,98 +921,34 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.m)
 		close_midx(ctx.m);
 
-	cur_chunk = 0;
-	num_chunks = ctx.large_offsets_needed ? 5 : 4;
-
 	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
 		result = 1;
 		goto cleanup;
 	}
 
-	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
-
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
-	chunk_offsets[cur_chunk] = header_size + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
-
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
-
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
-
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;
-
-	cur_chunk++;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
-	if (ctx.large_offsets_needed) {
-		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
-
-		cur_chunk++;
-		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
-					   ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
-	}
-
-	chunk_ids[cur_chunk] = 0;
-
-	for (i = 0; i <= num_chunks; i++) {
-		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
-			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
-			    chunk_offsets[i - 1],
-			    chunk_offsets[i]);
-
-		if (chunk_offsets[i] % MIDX_CHUNK_ALIGNMENT)
-			BUG("chunk offset %"PRIu64" is not properly aligned",
-			    chunk_offsets[i]);
-
-		hashwrite_be32(f, chunk_ids[i]);
-		hashwrite_be64(f, chunk_offsets[i]);
-	}
-
-	for (i = 0; i < num_chunks; i++) {
-		if (f->total + f->offset != chunk_offsets[i])
-			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
-			    chunk_offsets[i],
-			    f->total + f->offset,
-			    chunk_ids[i]);
+	cf = init_chunkfile(f);
 
-		switch (chunk_ids[i]) {
-			case MIDX_CHUNKID_PACKNAMES:
-				write_midx_pack_names(f, &ctx);
-				break;
+	add_chunk(cf, MIDX_CHUNKID_PACKNAMES,
+		  write_midx_pack_names, pack_name_concat_len);
+	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
+		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
+	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
+		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
+	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
+		  write_midx_object_offsets,
+		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
 
-			case MIDX_CHUNKID_OIDFANOUT:
-				write_midx_oid_fanout(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_OIDLOOKUP:
-				write_midx_oid_lookup(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_OBJECTOFFSETS:
-				write_midx_object_offsets(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_LARGEOFFSETS:
-				write_midx_large_offsets(f, &ctx);
-				break;
-
-			default:
-				BUG("trying to write unknown chunk id %"PRIx32,
-				    chunk_ids[i]);
-		}
-	}
+	if (ctx.large_offsets_needed)
+		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
+			write_midx_large_offsets,
+			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
 
-	if (f->total + f->offset != chunk_offsets[num_chunks])
-		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-		    f->total + f->offset,
-		    chunk_offsets[num_chunks]);
+	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
+	write_chunkfile(cf, &ctx);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
+	free_chunkfile(cf);
 	commit_lock_file(&lk);
 
 cleanup:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 12/17] chunk-format: create read chunk API
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (10 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-27  3:02   ` Taylor Blau
  2021-01-26 16:01 ` [PATCH 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add the capability to read the table of contents, then pair the chunks
with necessary logic using read_chunk_fn pointers. Callers will be added
in future changes, but the typical outline will be:

 1. initialize a 'struct chunkfile' with init_chunkfile(NULL).
 2. call read_table_of_contents().
 3. for each chunk to parse, call pair_chunk() with appropriate pointers.
 4. call free_chunkfile() to clear the 'struct chunkfile' data.

We are re-using the anonymous 'struct chunkfile' data, as it is internal
to the chunk-format API. This gives it essentially two modes: write and
read. If the same struct instance was used for both reads and writes,
then there would be failures.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
 chunk-format.h | 21 +++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/chunk-format.c b/chunk-format.c
index 2ce37ecc6bb..674d31d5e58 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -12,6 +12,8 @@ struct chunk_info {
 	uint32_t id;
 	uint64_t size;
 	chunk_write_fn write_fn;
+
+	const void *start;
 };
 
 struct chunkfile {
@@ -89,3 +91,65 @@ int write_chunkfile(struct chunkfile *cf, void *data)
 
 	return 0;
 }
+
+int read_table_of_contents(struct chunkfile *cf,
+			   const unsigned char *mfile,
+			   size_t mfile_size,
+			   uint64_t toc_offset,
+			   int toc_length)
+{
+	uint32_t chunk_id;
+	const unsigned char *table_of_contents = mfile + toc_offset;
+
+	ALLOC_GROW(cf->chunks, toc_length, cf->chunks_alloc);
+
+	while (toc_length--) {
+		uint64_t chunk_offset, next_chunk_offset;
+
+		chunk_id = get_be32(table_of_contents);
+		chunk_offset = get_be64(table_of_contents + 4);
+
+		if (!chunk_id) {
+			error(_("terminating chunk id appears earlier than expected"));
+			return 1;
+		}
+
+		table_of_contents += CHUNK_LOOKUP_WIDTH;
+		next_chunk_offset = get_be64(table_of_contents + 4);
+
+		if (next_chunk_offset < chunk_offset ||
+		    next_chunk_offset > mfile_size - the_hash_algo->rawsz) {
+			error(_("improper chunk offset(s) %"PRIx64" and %"PRIx64""),
+			      chunk_offset, next_chunk_offset);
+			return -1;
+		}
+
+		cf->chunks[cf->chunks_nr].id = chunk_id;
+		cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
+		cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset;
+		cf->chunks_nr++;
+	}
+
+	chunk_id = get_be32(table_of_contents);
+	if (chunk_id) {
+		error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
+		return -1;
+	}
+
+	return 0;
+}
+
+int pair_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       chunk_read_fn fn,
+	       void *data)
+{
+	int i;
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		if (cf->chunks[i].id == chunk_id)
+			return fn(cf->chunks[i].start, cf->chunks[i].size, data);
+	}
+
+	return CHUNK_NOT_FOUND;
+}
diff --git a/chunk-format.h b/chunk-format.h
index bfaed672813..250e08b8e6a 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -17,4 +17,25 @@ void add_chunk(struct chunkfile *cf,
 	       size_t size);
 int write_chunkfile(struct chunkfile *cf, void *data);
 
+int read_table_of_contents(struct chunkfile *cf,
+			   const unsigned char *mfile,
+			   size_t mfile_size,
+			   uint64_t toc_offset,
+			   int toc_length);
+
+/*
+ * When reading a table of contents, we find the chunk with matching 'id'
+ * then call its read_fn to populate the necessary 'data' based on the
+ * chunk start and size.
+ */
+typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
+			     size_t chunk_size, void *data);
+
+
+#define CHUNK_NOT_FOUND (-2)
+int pair_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       chunk_read_fn fn,
+	       void *data);
+
 #endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 13/17] commit-graph: use chunk-format read API
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (11 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 14/17] midx: " Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). While the current
implementation loses the duplicate-chunk detection, that will be added
in a future change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 209 ++++++++++++++++++++--------------------
 t/t5318-commit-graph.sh |   2 +-
 2 files changed, 108 insertions(+), 103 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index b2c0f233eab..44c06d0fb67 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -299,15 +299,99 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
+static int graph_read_oid_fanout(const unsigned char *chunk_start,
+				 size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_oid_fanout = (uint32_t*)chunk_start;
+	return 0;
+}
+
+static int graph_read_oid_lookup(const unsigned char *chunk_start,
+				 size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_oid_lookup = chunk_start;
+	g->num_commits = chunk_size / g->hash_len;
+	return 0;
+}
+
+static int graph_read_data(const unsigned char *chunk_start,
+				 size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_commit_data = chunk_start;
+	return 0;
+}
+
+static int graph_read_extra_edges(const unsigned char *chunk_start,
+				  size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_extra_edges = chunk_start;
+	return 0;
+}
+
+static int graph_read_base_graphs(const unsigned char *chunk_start,
+				  size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_base_graphs = chunk_start;
+	return 0;
+}
+
+static int graph_read_generation_data(const unsigned char *chunk_start,
+				      size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_generation_data = chunk_start;
+	return 0;
+}
+
+static int graph_read_generation_overflow(const unsigned char *chunk_start,
+					  size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_generation_data_overflow = chunk_start;
+	return 0;
+}
+
+static int graph_read_bloom_indices(const unsigned char *chunk_start,
+				    size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	g->chunk_bloom_indexes = chunk_start;
+	return 0;
+}
+
+static int graph_read_bloom_data(const unsigned char *chunk_start,
+				  size_t chunk_size, void *data)
+{
+	struct commit_graph *g = (struct commit_graph *)data;
+	uint32_t hash_version;
+	g->chunk_bloom_data = chunk_start;
+	hash_version = get_be32(chunk_start);
+
+	if (hash_version != 1)
+		return 0;
+
+	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
+	g->bloom_filter_settings->hash_version = hash_version;
+	g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4);
+	g->bloom_filter_settings->bits_per_entry = get_be32(chunk_start + 8);
+	g->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
+
+	return 0;
+}
+
 struct commit_graph *parse_commit_graph(struct repository *r,
 					void *graph_map, size_t graph_size)
 {
-	const unsigned char *data, *chunk_lookup;
-	uint32_t i;
+	const unsigned char *data;
 	struct commit_graph *graph;
-	uint64_t next_chunk_offset;
 	uint32_t graph_signature;
 	unsigned char graph_version, hash_version;
+	struct chunkfile *cf = NULL;
 
 	if (!graph_map)
 		return NULL;
@@ -356,108 +440,27 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		return NULL;
 	}
 
-	chunk_lookup = data + 8;
-	next_chunk_offset = get_be64(chunk_lookup + 4);
-	for (i = 0; i < graph->num_chunks; i++) {
-		uint32_t chunk_id;
-		uint64_t chunk_offset = next_chunk_offset;
-		int chunk_repeated = 0;
-
-		chunk_id = get_be32(chunk_lookup + 0);
-
-		chunk_lookup += GRAPH_CHUNKLOOKUP_WIDTH;
-		next_chunk_offset = get_be64(chunk_lookup + 4);
-
-		if (chunk_offset > graph_size - the_hash_algo->rawsz) {
-			error(_("commit-graph improper chunk offset %08x%08x"), (uint32_t)(chunk_offset >> 32),
-			      (uint32_t)chunk_offset);
-			goto free_and_return;
-		}
-
-		switch (chunk_id) {
-		case GRAPH_CHUNKID_OIDFANOUT:
-			if (graph->chunk_oid_fanout)
-				chunk_repeated = 1;
-			else
-				graph->chunk_oid_fanout = (uint32_t*)(data + chunk_offset);
-			break;
-
-		case GRAPH_CHUNKID_OIDLOOKUP:
-			if (graph->chunk_oid_lookup)
-				chunk_repeated = 1;
-			else {
-				graph->chunk_oid_lookup = data + chunk_offset;
-				graph->num_commits = (next_chunk_offset - chunk_offset)
-						     / graph->hash_len;
-			}
-			break;
-
-		case GRAPH_CHUNKID_DATA:
-			if (graph->chunk_commit_data)
-				chunk_repeated = 1;
-			else
-				graph->chunk_commit_data = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_GENERATION_DATA:
-			if (graph->chunk_generation_data)
-				chunk_repeated = 1;
-			else
-				graph->chunk_generation_data = data + chunk_offset;
-			break;
+	cf = init_chunkfile(NULL);
 
-		case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
-			if (graph->chunk_generation_data_overflow)
-				chunk_repeated = 1;
-			else
-				graph->chunk_generation_data_overflow = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_EXTRAEDGES:
-			if (graph->chunk_extra_edges)
-				chunk_repeated = 1;
-			else
-				graph->chunk_extra_edges = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BASE:
-			if (graph->chunk_base_graphs)
-				chunk_repeated = 1;
-			else
-				graph->chunk_base_graphs = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BLOOMINDEXES:
-			if (graph->chunk_bloom_indexes)
-				chunk_repeated = 1;
-			else if (r->settings.commit_graph_read_changed_paths)
-				graph->chunk_bloom_indexes = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BLOOMDATA:
-			if (graph->chunk_bloom_data)
-				chunk_repeated = 1;
-			else if (r->settings.commit_graph_read_changed_paths) {
-				uint32_t hash_version;
-				graph->chunk_bloom_data = data + chunk_offset;
-				hash_version = get_be32(data + chunk_offset);
-
-				if (hash_version != 1)
-					break;
+	if (read_table_of_contents(cf, graph->data, graph_size,
+				   GRAPH_HEADER_SIZE, graph->num_chunks))
+		goto free_and_return;
 
-				graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
-				graph->bloom_filter_settings->hash_version = hash_version;
-				graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
-				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
-				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
-			}
-			break;
-		}
+	pair_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph);
+	pair_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
+	pair_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_data, graph);
+	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, graph_read_extra_edges, graph);
+	pair_chunk(cf, GRAPH_CHUNKID_BASE, graph_read_base_graphs, graph);
+	pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
+		   graph_read_generation_data, graph);
+	pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
+		   graph_read_generation_overflow, graph);
 
-		if (chunk_repeated) {
-			error(_("commit-graph chunk id %08x appears multiple times"), chunk_id);
-			goto free_and_return;
-		}
+	if (r->settings.commit_graph_read_changed_paths) {
+		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+			   graph_read_bloom_indices, graph);
+		pair_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
+			   graph_read_bloom_data, graph);
 	}
 
 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
@@ -474,9 +477,11 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	if (verify_commit_graph_lite(graph))
 		goto free_and_return;
 
+	free_chunkfile(cf);
 	return graph;
 
 free_and_return:
+	free_chunkfile(cf);
 	free(graph->bloom_filter_settings);
 	free(graph);
 	return NULL;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index fa27df579a5..c7da741284e 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -564,7 +564,7 @@ test_expect_success 'detect bad hash version' '
 
 test_expect_success 'detect low chunk count' '
 	corrupt_graph_and_verify $GRAPH_BYTE_CHUNK_COUNT "\01" \
-		"missing the .* chunk"
+		"final chunk has non-zero id"
 '
 
 test_expect_success 'detect missing OID fanout chunk' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 14/17] midx: use chunk-format read API
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (12 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-27  3:06   ` Taylor Blau
  2021-01-26 16:01 ` [PATCH 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). In particular, we
can use the return value of pair_chunk() to generate an error when a
required chunk is missing.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 103 ++++++++++++++++++++----------------
 t/t5319-multi-pack-index.sh |   6 +--
 2 files changed, 60 insertions(+), 49 deletions(-)

diff --git a/midx.c b/midx.c
index 0bfd2d802b6..dd019c00795 100644
--- a/midx.c
+++ b/midx.c
@@ -54,6 +54,51 @@ static char *get_midx_filename(const char *object_dir)
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
 
+static int midx_read_pack_names(const unsigned char *chunk_start,
+				size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = (struct multi_pack_index *)data;
+	m->chunk_pack_names = chunk_start;
+	return 0;
+}
+
+static int midx_read_oid_fanout(const unsigned char *chunk_start,
+				size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = (struct multi_pack_index *)data;
+	m->chunk_oid_fanout = (uint32_t *)chunk_start;
+
+	if (chunk_size != 4 * 256) {
+		error(_("multi-pack-index OID fanout is of the wrong size"));
+		return 1;
+	}
+	return 0;
+}
+
+static int midx_read_oid_lookup(const unsigned char *chunk_start,
+				size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = (struct multi_pack_index *)data;
+	m->chunk_oid_lookup = chunk_start;
+	return 0;
+}
+
+static int midx_read_offsets(const unsigned char *chunk_start,
+			     size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = (struct multi_pack_index *)data;
+	m->chunk_object_offsets = chunk_start;
+	return 0;
+}
+
+static int midx_read_large_offsets(const unsigned char *chunk_start,
+				   size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = (struct multi_pack_index *)data;
+	m->chunk_large_offsets = chunk_start;
+	return 0;
+}
+
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local)
 {
 	struct multi_pack_index *m = NULL;
@@ -65,6 +110,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 	char *midx_name = get_midx_filename(object_dir);
 	uint32_t i;
 	const char *cur_pack_name;
+	struct chunkfile *cf = NULL;
 
 	fd = git_open(midx_name);
 
@@ -114,58 +160,23 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 
 	m->num_packs = get_be32(m->data + MIDX_BYTE_NUM_PACKS);
 
-	for (i = 0; i < m->num_chunks; i++) {
-		uint32_t chunk_id = get_be32(m->data + MIDX_HEADER_SIZE +
-					     MIDX_CHUNKLOOKUP_WIDTH * i);
-		uint64_t chunk_offset = get_be64(m->data + MIDX_HEADER_SIZE + 4 +
-						 MIDX_CHUNKLOOKUP_WIDTH * i);
-
-		if (chunk_offset >= m->data_len)
-			die(_("invalid chunk offset (too large)"));
-
-		switch (chunk_id) {
-			case MIDX_CHUNKID_PACKNAMES:
-				m->chunk_pack_names = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_OIDFANOUT:
-				m->chunk_oid_fanout = (uint32_t *)(m->data + chunk_offset);
-				break;
-
-			case MIDX_CHUNKID_OIDLOOKUP:
-				m->chunk_oid_lookup = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_OBJECTOFFSETS:
-				m->chunk_object_offsets = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_LARGEOFFSETS:
-				m->chunk_large_offsets = m->data + chunk_offset;
-				break;
-
-			case 0:
-				die(_("terminating multi-pack-index chunk id appears earlier than expected"));
-				break;
-
-			default:
-				/*
-				 * Do nothing on unrecognized chunks, allowing future
-				 * extensions to add optional chunks.
-				 */
-				break;
-		}
-	}
+	cf = init_chunkfile(NULL);
 
-	if (!m->chunk_pack_names)
+	if (read_table_of_contents(cf, m->data, midx_size,
+				   MIDX_HEADER_SIZE, m->num_chunks))
+		goto cleanup_fail;
+
+	if (pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, midx_read_pack_names, m) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required pack-name chunk"));
-	if (!m->chunk_oid_fanout)
+	if (pair_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required OID fanout chunk"));
-	if (!m->chunk_oid_lookup)
+	if (pair_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, midx_read_oid_lookup, m) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required OID lookup chunk"));
-	if (!m->chunk_object_offsets)
+	if (pair_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_offsets, m) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required object offsets chunk"));
 
+	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, midx_read_large_offsets, m);
+
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
 	m->pack_names = xcalloc(m->num_packs, sizeof(*m->pack_names));
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 297de502a94..ad4e878b65b 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -314,12 +314,12 @@ test_expect_success 'verify bad OID version' '
 
 test_expect_success 'verify truncated chunk count' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\01" $objdir \
-		"missing required"
+		"final chunk has non-zero id"
 '
 
 test_expect_success 'verify extended chunk count' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\07" $objdir \
-		"terminating multi-pack-index chunk id appears earlier than expected"
+		"terminating chunk id appears earlier than expected"
 '
 
 test_expect_success 'verify missing required chunk' '
@@ -329,7 +329,7 @@ test_expect_success 'verify missing required chunk' '
 
 test_expect_success 'verify invalid chunk offset' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_OFFSET "\01" $objdir \
-		"invalid chunk offset (too large)"
+		"improper chunk offset(s)"
 '
 
 test_expect_success 'verify packnames out of order' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 15/17] midx: use 64-bit multiplication for chunk sizes
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (13 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 14/17] midx: " Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When calculating the sizes of certain chunks, we should use 64-bit
multiplication always. This allows us to properly predict the chunk
sizes without risk of overflow.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index dd019c00795..47aaeb804b8 100644
--- a/midx.c
+++ b/midx.c
@@ -945,7 +945,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
 		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
 	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
-		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
+		  write_midx_oid_lookup, (uint64_t)ctx.entries_nr * the_hash_algo->rawsz);
 	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
 		  write_midx_object_offsets,
 		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
@@ -953,7 +953,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.large_offsets_needed)
 		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
 			write_midx_large_offsets,
-			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
+			(uint64_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
 
 	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
 	write_chunkfile(cf, &ctx);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 16/17] chunk-format: restore duplicate chunk checks
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (14 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-26 16:01 ` [PATCH 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Before refactoring into the chunk-format API, the commit-graph parsing
logic included checks for duplicate chunks. It is unlikely that we would
desire a chunk-based file format that allows duplicate chunk IDs in the
table of contents, so add duplicate checks into
read_table_of_contents().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/chunk-format.c b/chunk-format.c
index 674d31d5e58..3c833038096 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -14,6 +14,7 @@ struct chunk_info {
 	chunk_write_fn write_fn;
 
 	const void *start;
+	unsigned found:1;
 };
 
 struct chunkfile {
@@ -98,6 +99,7 @@ int read_table_of_contents(struct chunkfile *cf,
 			   uint64_t toc_offset,
 			   int toc_length)
 {
+	int i;
 	uint32_t chunk_id;
 	const unsigned char *table_of_contents = mfile + toc_offset;
 
@@ -124,6 +126,14 @@ int read_table_of_contents(struct chunkfile *cf,
 			return -1;
 		}
 
+		for (i = 0; i < cf->chunks_nr; i++) {
+			if (cf->chunks[i].id == chunk_id) {
+				error(_("duplicate chunk ID %"PRIx32" found"),
+					chunk_id);
+				return -1;
+			}
+		}
+
 		cf->chunks[cf->chunks_nr].id = chunk_id;
 		cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
 		cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH 17/17] chunk-format: add technical docs
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (15 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
@ 2021-01-26 16:01 ` Derrick Stolee via GitGitGadget
  2021-01-26 22:37 ` [PATCH 00/17] Refactor chunk-format into an API Junio C Hamano
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-26 16:01 UTC (permalink / raw)
  To: git; +Cc: me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The chunk-based file format is now an API in the code, but we should
also take time to document it as a file format. Specifically, it matches
the CHUNK LOOKUP sections of the commit-graph and multi-pack-index
files, but there are some commonalities that should be grouped in this
document.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/chunk-format.txt      | 54 +++++++++++++++++++
 .../technical/commit-graph-format.txt         |  3 ++
 Documentation/technical/pack-format.txt       |  3 ++
 3 files changed, 60 insertions(+)
 create mode 100644 Documentation/technical/chunk-format.txt

diff --git a/Documentation/technical/chunk-format.txt b/Documentation/technical/chunk-format.txt
new file mode 100644
index 00000000000..3db3792dea2
--- /dev/null
+++ b/Documentation/technical/chunk-format.txt
@@ -0,0 +1,54 @@
+Chunk-based file formats
+========================
+
+Some file formats in Git use a common concept of "chunks" to describe
+sections of the file. This allows structured access to a large file by
+scanning a small "table of contents" for the remaining data. This common
+format is used by the `commit-graph` and `multi-pack-index` files. See
+link:technical/pack-format.html[the `multi-pack-index` format] and
+link:technical/commit-graph-format.html[the `commit-graph` format] for
+how they use the chunks to describe structured data.
+
+A chunk-based file format begins with some header information custom to
+that format. That header should include enough information to identify
+the file type, format version, and number of chunks in the file. From this
+information, that file can determine the start of the chunk-based region.
+
+The chunk-based region starts with a table of contents describing where
+each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
+where C is the number of chunks. Consider the following table:
+
+  | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
+  |--------------------|------------------------|
+  | ID[0]              | OFFSET[0]              |
+  | ...                | ...                    |
+  | ID[C]              | OFFSET[C]              |
+  | 0x0000             | OFFSET[C+1]            |
+
+Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
+Each integer is stored in network-byte order.
+
+The chunk identifier `ID[i]` is a label for the data stored within this
+fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
+size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
+and `OFFSET[i]`. This requires that the chunk data appears contiguously
+in the same order as the table of contents.
+
+The final entry in the table of contents must be four zero bytes. This
+confirms that the table of contents is ending and provides the offset for
+the end of the chunk-based data.
+
+Note: The chunk-based format expects that the file contains _at least_ a
+trailing hash after `OFFSET[C+1]`.
+
+Functions for working with chunk-based file formats are declared in
+`chunk-format.h`. Using these methods provide extra checks that assist
+developers when creating new file formats, including:
+
+ 1. Writing and reading the table of contents.
+
+ 2. Verifying that the data written in a chunk matches the expected size
+    that was recorded in the table of contents.
+
+ 3. Checking that a table of contents describes offsets properly within
+    the file boundaries.
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b6658eff188..87971c27dd7 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -61,6 +61,9 @@ CHUNK LOOKUP:
       the length using the next chunk position if necessary.) Each chunk
       ID appears at most once.
 
+  The CHUNK LOOKUP matches the table of contents from
+  link:technical/chunk-format.html[the chunk-based file format].
+
   The remaining data in the body is described one chunk at a time, and
   these chunks may be given in any order. Chunks are required unless
   otherwise specified.
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index f96b2e605f3..2fb1e60d29e 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -301,6 +301,9 @@ CHUNK LOOKUP:
 	    (Chunks are provided in file-order, so you can infer the length
 	    using the next chunk position if necessary.)
 
+	The CHUNK LOOKUP matches the table of contents from
+	link:technical/chunk-format.html[the chunk-based file format].
+
 	The remaining data in the body is described one chunk at a time, and
 	these chunks may be given in any order. Chunks are required unless
 	otherwise specified.
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH 00/17] Refactor chunk-format into an API
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (16 preceding siblings ...)
  2021-01-26 16:01 ` [PATCH 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
@ 2021-01-26 22:37 ` Junio C Hamano
  2021-01-27  2:29 ` Taylor Blau
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  19 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-01-26 22:37 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This is a restart on the topic previously submitted [1] but dropped because
> ak/corrected-commit-date was still in progress. This version is based on
> that branch.

Nice to see that we have an endorsement on ak/corrected-commit-date
topic ;-)

I've scanned this round of the topic and they were pleasant read.

I may have other comments after a more careful reading, but so far,
I am happy with what I see here.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 01/17] commit-graph: anonymize data in chunk_write_fn
  2021-01-26 16:01 ` [PATCH 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
@ 2021-01-27  1:53   ` Chris Torek
  2021-01-27  2:36     ` Taylor Blau
  0 siblings, 1 reply; 120+ messages in thread
From: Chris Torek @ 2021-01-27  1:53 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git List, Taylor Blau, Junio C Hamano, René Scharfe,
	szeder.dev, Derrick Stolee, Derrick Stolee

Note: this is purely style, and minor, but I'll ask...

On Tue, Jan 26, 2021 at 8:08 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>  static int write_graph_chunk_fanout(struct hashfile *f,
> -                                   struct write_commit_graph_context *ctx)
> +                                   void *data)
>  {
> +       struct write_commit_graph_context *ctx =
> +               (struct write_commit_graph_context *)data;

Why bother with the cast on the last line here?  In C,
conversion from `void *` to `struct whatever *` is fine.

(the change itself looks fine, btw)

Chris

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 00/17] Refactor chunk-format into an API
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (17 preceding siblings ...)
  2021-01-26 22:37 ` [PATCH 00/17] Refactor chunk-format into an API Junio C Hamano
@ 2021-01-27  2:29 ` Taylor Blau
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  19 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-27  2:29 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, l.s.r, szeder.dev, Derrick Stolee

On Tue, Jan 26, 2021 at 04:01:09PM +0000, Derrick Stolee via GitGitGadget wrote:
> This version also changes the approach to use a more dynamic interaction
> with a struct chunkfile pointer. This idea is credited to Taylor Blau [2],
> but I started again from scratch. I also go further to make struct chunkfile
> anonymous to API consumers. It is defined only in chunk-format.c, which
> should hopefully deter future users from interacting with that data
> directly.
>
> [2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

Great; I am very happy that you found my patch to be useful. I'm glad
that you decided to start from scratch, too, since as I recall there
were some unresolved test issues that I punted on in case you decided to
abandon the topic altogether.

> This combined API is beneficial to reduce duplicated logic. Or rather, to
> ensure that similar file formats have similar protections against bad data.
> The multi-pack-index code did not have as many guards as the commit-graph
> code did, but now they both share a common base that checks for things like
> duplicate chunks or offsets outside the size of the file.

Definitely good.

> Here are some stats for the end-to-end change:
>
>  * 638 insertions(+), 456 deletions(-).
>  * commit-graph.c: 171 insertions(+), 192 deletions(-)
>  * midx.c: 196 insertions(+), 260 deletions(-)
>
> While there is an overall increase to the code size, the consumers do get a
> bit smaller. Boilerplate things like abstracting method to match
> chunk_write_fn and chunk_read_fn make up a lot of these insertions. The
> "interesting" code gets a lot smaller and cleaner.

Like I said in [1], I don't think a net +182 line diff is reason alone
not to pursue this topic. I don't think that an chunked index v3 will
come as part of my work on the on-disk revindex format, but I do think
that it's something brian may be interested in. So, I'm feeling rather
certain that we'll eventually have new callers, at which point this will
reduce duplication overall.

[1]: https://lore.kernel.org/git/X8%2FK1dUgUmwp8ZOv@nand.local/

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 01/17] commit-graph: anonymize data in chunk_write_fn
  2021-01-27  1:53   ` Chris Torek
@ 2021-01-27  2:36     ` Taylor Blau
  0 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-27  2:36 UTC (permalink / raw)
  To: Chris Torek
  Cc: Derrick Stolee via GitGitGadget, Git List, Taylor Blau,
	Junio C Hamano, René Scharfe, szeder.dev, Derrick Stolee,
	Derrick Stolee

On Tue, Jan 26, 2021 at 05:53:39PM -0800, Chris Torek wrote:
> Note: this is purely style, and minor, but I'll ask...
>
> On Tue, Jan 26, 2021 at 8:08 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >  static int write_graph_chunk_fanout(struct hashfile *f,
> > -                                   struct write_commit_graph_context *ctx)
> > +                                   void *data)
> >  {
> > +       struct write_commit_graph_context *ctx =
> > +               (struct write_commit_graph_context *)data;
>
> Why bother with the cast on the last line here?  In C,
> conversion from `void *` to `struct whatever *` is fine.
>
> (the change itself looks fine, btw)

Agreed. It's not a correctness issue, but I find these unnecessary casts
to detract from readability. If you do end up rerolling this series,
I'd rather see

    struct write_commit_graph_context *ctx = data;

...but I don't think that this (non-)issue alone is worth a reroll.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 02/17] chunk-format: create chunk format write API
  2021-01-26 16:01 ` [PATCH 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
@ 2021-01-27  2:42   ` Taylor Blau
  2021-01-27 13:49     ` Derrick Stolee
  0 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-01-27  2:42 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

On Tue, Jan 26, 2021 at 04:01:11PM +0000, Derrick Stolee via GitGitGadget wrote:
> +/*
> + * When writing a chunk-based file format, collect the chunks in
> + * an array of chunk_info structs. The size stores the _expected_
> + * amount of data that will be written by write_fn.
> + */
> +struct chunk_info {
> +	uint32_t id;
> +	uint64_t size;

Hmm. Would we not want an off_t to indicate the size here?

I wondered briefly if we even needed a size field at all, since calling
write_fn would tell us the number of bytes written. But I suppose you
want to know ahead of time so that you can write the file in one pass
(beginning with the table of contents, which certainly needs to know the
size).

> +	/* Trailing entry marks the end of the chunks */
> +	hashwrite_be32(cf->f, 0);
> +	hashwrite_be64(cf->f, cur_offset);
> +
> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		uint64_t start_offset = cf->f->total + cf->f->offset;
> +		int result = cf->chunks[i].write_fn(cf->f, data);
> +
> +		if (result)
> +			return result;
> +
> +		if (cf->f->total + cf->f->offset != start_offset + cf->chunks[i].size)

I don't think this is a practical concern, but a malicious caller could
overflow this by passing a bogus "size" parameter. Maybe:

    uint64_t end_offset = ...;

    if (end_offset - start_offset != cf->chunks[i].size)
      BUG(...)

?

> diff --git a/chunk-format.h b/chunk-format.h
> new file mode 100644
> index 00000000000..bfaed672813
> --- /dev/null
> +++ b/chunk-format.h
> @@ -0,0 +1,20 @@
> +#ifndef CHUNK_FORMAT_H
> +#define CHUNK_FORMAT_H
> +
> +#include "git-compat-util.h"
> +
> +struct hashfile;
> +struct chunkfile;
> +
> +struct chunkfile *init_chunkfile(struct hashfile *f);
> +void free_chunkfile(struct chunkfile *cf);
> +int get_num_chunks(struct chunkfile *cf);
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> +			      void *data);
> +void add_chunk(struct chunkfile *cf,
> +	       uint64_t id,
> +	       chunk_write_fn fn,
> +	       size_t size);
> +int write_chunkfile(struct chunkfile *cf, void *data);

Very clean API.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 03/17] commit-graph: use chunk-format write API
  2021-01-26 16:01 ` [PATCH 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
@ 2021-01-27  2:47   ` Taylor Blau
  0 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-27  2:47 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

On Tue, Jan 26, 2021 at 04:01:12PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The commit-graph write logic is ready to make use of the chunk-format
> write API. Each chunk write method is already in the correct prototype.
> We only need to use the 'struct chunkfile' pointer and the correct API
> calls.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Nicely done. The majority of this patch was remarkably easy to read,
which I attribute to you doing the necessary prep work to make the
callbacks usable by the new API. Thank you.

> @@ -1941,6 +1896,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>
>  	close_commit_graph(ctx->r->objects);
>  	finalize_hashfile(f, file_hash.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
> +	free_chunkfile(cf);

Since chunkfiles are so tightly coupled to hashfiles (i.e., you can only
"construct" a chunkfile given a 'struct hashfile*'), I wonder whether
this should be:

    finalize_chunkfile(cf, ...)

instead. It seems kind of weird to give up ownership of 'f' down to the
chunkfile API only to reach down into it again.

I could even buy that you'd always want to finalize and free a chunkfile
at the same time, and so perhaps the calls could be combined, but that
may be a step too far.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 04/17] midx: rename pack_info to write_midx_context
  2021-01-26 16:01 ` [PATCH 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
@ 2021-01-27  2:49   ` Taylor Blau
  0 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-27  2:49 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

On Tue, Jan 26, 2021 at 04:01:13PM +0000, Derrick Stolee via GitGitGadget wrote:
> This change only renames "struct pack_info" to "struct
> write_midx_context" and the names of instances from "packs" to "ctx". In
> future changes, we will expand the data inside "struct
> write_midx_context" and align our chunk-writing method with the
> chunk-format API.

Thanks for saying that; that makes clear what is (and isn't) going on
here.

> @@ -463,37 +463,37 @@ struct pack_list {
>  static void add_pack_to_midx(const char *full_path, size_t full_path_len,
>  			     const char *file_name, void *data)
>  {
> -	struct pack_list *packs = (struct pack_list *)data;
> +	struct write_midx_context *ctx = (struct write_midx_context *)data;

Same comments as earlier about the unnecessary cast on the right-hand
side of this (and the below) assignment.

Otherwise this patch looks obviously fine to me.


Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 12/17] chunk-format: create read chunk API
  2021-01-26 16:01 ` [PATCH 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
@ 2021-01-27  3:02   ` Taylor Blau
  0 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-27  3:02 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

On Tue, Jan 26, 2021 at 04:01:21PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/chunk-format.c b/chunk-format.c
> index 2ce37ecc6bb..674d31d5e58 100644
> --- a/chunk-format.c
> +++ b/chunk-format.c
> @@ -12,6 +12,8 @@ struct chunk_info {
>  	uint32_t id;
>  	uint64_t size;
>  	chunk_write_fn write_fn;
> +
> +	const void *start;

It may be clearer to fold both of these into an anonymous union along
with an enum to indicate which mode we're in. But, I could also buy that
that is more error prone, so perhaps just a comment along the lines of
"exactly one of these is NULL" would suffice, too.

>  };
>
>  struct chunkfile {
> @@ -89,3 +91,65 @@ int write_chunkfile(struct chunkfile *cf, void *data)
>
>  	return 0;
>  }
> +
> +int read_table_of_contents(struct chunkfile *cf,
> +			   const unsigned char *mfile,
> +			   size_t mfile_size,

Assuming that mfile and mfile_size are a pointer to a memory mapped
region and its size? If so, a nit is that I'd expect "data" and "size"
instead of "mfile".

I think that it's probably going too far to have the chunkfile API
handle mapping its own memory, so in that way I don't think it's wrong
for the callers to be handling that.

OTOH, it does seem a little weird to temporarily hand off ownership like
this. I don't think I have a better suggestion, though.

The implementation of this function looks good to me.

> +int pair_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       chunk_read_fn fn,
> +	       void *data)
> +{
> +	int i;
> +
> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		if (cf->chunks[i].id == chunk_id)
> +			return fn(cf->chunks[i].start, cf->chunks[i].size, data);
> +	}
> +
> +	return CHUNK_NOT_FOUND;
> +}
> diff --git a/chunk-format.h b/chunk-format.h
> index bfaed672813..250e08b8e6a 100644
> --- a/chunk-format.h
> +++ b/chunk-format.h
> @@ -17,4 +17,25 @@ void add_chunk(struct chunkfile *cf,
>  	       size_t size);
>  int write_chunkfile(struct chunkfile *cf, void *data);
>
> +int read_table_of_contents(struct chunkfile *cf,
> +			   const unsigned char *mfile,
> +			   size_t mfile_size,
> +			   uint64_t toc_offset,
> +			   int toc_length);
> +
> +/*
> + * When reading a table of contents, we find the chunk with matching 'id'
> + * then call its read_fn to populate the necessary 'data' based on the
> + * chunk start and size.
> + */
> +typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
> +			     size_t chunk_size, void *data);
> +
> +
> +#define CHUNK_NOT_FOUND (-2)
> +int pair_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       chunk_read_fn fn,
> +	       void *data);

From reading the implementation, I take it that this function calls fn
with the location and size of the requested chunk, along with the user
supplied data.

I'm not sure that "pair" gives me that same sense. Maybe "read" or
"lookup" would be better?

Dunno.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 14/17] midx: use chunk-format read API
  2021-01-26 16:01 ` [PATCH 14/17] midx: " Derrick Stolee via GitGitGadget
@ 2021-01-27  3:06   ` Taylor Blau
  2021-01-27 13:50     ` Derrick Stolee
  0 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-01-27  3:06 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

On Tue, Jan 26, 2021 at 04:01:23PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Instead of parsing the table of contents directly, use the chunk-format
> API methods read_table_of_contents() and pair_chunk(). In particular, we
> can use the return value of pair_chunk() to generate an error when a
> required chunk is missing.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c                      | 103 ++++++++++++++++++++----------------
>  t/t5319-multi-pack-index.sh |   6 +--
>  2 files changed, 60 insertions(+), 49 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 0bfd2d802b6..dd019c00795 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -54,6 +54,51 @@ static char *get_midx_filename(const char *object_dir)
>  	return xstrfmt("%s/pack/multi-pack-index", object_dir);
>  }
>
> +static int midx_read_pack_names(const unsigned char *chunk_start,
> +				size_t chunk_size, void *data)
> +{
> +	struct multi_pack_index *m = (struct multi_pack_index *)data;
> +	m->chunk_pack_names = chunk_start;
> +	return 0;
> +}

There are a lot of these callbacks that just assign some 'void **' to
point at chunk_start.

Maybe a good use of the "pair_chunk" name would be something like:

    int pair_chunk(struct chunkfile *cf, uint32_t id, const unsigned char **p);

which does the same as what you wrote here. So instead of what you
wrote, you could instead:

    pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names);

This would be in addition to the richer callback-style function which
allows the caller greater flexibility (e.g., for the Bloom filter
related readers in the commit-graph code).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 02/17] chunk-format: create chunk format write API
  2021-01-27  2:42   ` Taylor Blau
@ 2021-01-27 13:49     ` Derrick Stolee
  0 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee @ 2021-01-27 13:49 UTC (permalink / raw)
  To: Taylor Blau, Derrick Stolee via GitGitGadget
  Cc: git, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

On 1/26/2021 9:42 PM, Taylor Blau wrote:
> On Tue, Jan 26, 2021 at 04:01:11PM +0000, Derrick Stolee via GitGitGadget wrote:
>> +/*
>> + * When writing a chunk-based file format, collect the chunks in
>> + * an array of chunk_info structs. The size stores the _expected_
>> + * amount of data that will be written by write_fn.
>> + */
>> +struct chunk_info {
>> +	uint32_t id;
>> +	uint64_t size;
> 
> Hmm. Would we not want an off_t to indicate the size here?
> 
> I wondered briefly if we even needed a size field at all, since calling
> write_fn would tell us the number of bytes written. But I suppose you
> want to know ahead of time so that you can write the file in one pass
> (beginning with the table of contents, which certainly needs to know the
> size).

Is off_t 64-bits on a 32-bit machine? This is intentionally typed
to be "64 bits no matter what" because it correlates with the file
format's size for the chunk offsets.

>> +		if (cf->f->total + cf->f->offset != start_offset + cf->chunks[i].size)
> 
> I don't think this is a practical concern, but a malicious caller could
> overflow this by passing a bogus "size" parameter. Maybe:
> 
>     uint64_t end_offset = ...;
> 
>     if (end_offset - start_offset != cf->chunks[i].size)
>       BUG(...)

Sure.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH 14/17] midx: use chunk-format read API
  2021-01-27  3:06   ` Taylor Blau
@ 2021-01-27 13:50     ` Derrick Stolee
  0 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee @ 2021-01-27 13:50 UTC (permalink / raw)
  To: Taylor Blau, Derrick Stolee via GitGitGadget
  Cc: git, gitster, l.s.r, szeder.dev, Derrick Stolee, Derrick Stolee

On 1/26/2021 10:06 PM, Taylor Blau wrote:
> On Tue, Jan 26, 2021 at 04:01:23PM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Instead of parsing the table of contents directly, use the chunk-format
>> API methods read_table_of_contents() and pair_chunk(). In particular, we
>> can use the return value of pair_chunk() to generate an error when a
>> required chunk is missing.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  midx.c                      | 103 ++++++++++++++++++++----------------
>>  t/t5319-multi-pack-index.sh |   6 +--
>>  2 files changed, 60 insertions(+), 49 deletions(-)
>>
>> diff --git a/midx.c b/midx.c
>> index 0bfd2d802b6..dd019c00795 100644
>> --- a/midx.c
>> +++ b/midx.c
>> @@ -54,6 +54,51 @@ static char *get_midx_filename(const char *object_dir)
>>  	return xstrfmt("%s/pack/multi-pack-index", object_dir);
>>  }
>>
>> +static int midx_read_pack_names(const unsigned char *chunk_start,
>> +				size_t chunk_size, void *data)
>> +{
>> +	struct multi_pack_index *m = (struct multi_pack_index *)data;
>> +	m->chunk_pack_names = chunk_start;
>> +	return 0;
>> +}
> 
> There are a lot of these callbacks that just assign some 'void **' to
> point at chunk_start.
> 
> Maybe a good use of the "pair_chunk" name would be something like:
> 
>     int pair_chunk(struct chunkfile *cf, uint32_t id, const unsigned char **p);
> 
> which does the same as what you wrote here. So instead of what you
> wrote, you could instead:
> 
>     pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names);
> 
> This would be in addition to the richer callback-style function which
> allows the caller greater flexibility (e.g., for the Bloom filter
> related readers in the commit-graph code).

You're right that _most_ callers just want to assign a pointer,
so this mechanism would be better. I'll make a different function,
read_chunk() perhaps, that relies on a callback for advanced users.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 00/17] Refactor chunk-format into an API
  2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                   ` (18 preceding siblings ...)
  2021-01-27  2:29 ` Taylor Blau
@ 2021-01-27 15:01 ` Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
                     ` (19 more replies)
  19 siblings, 20 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee

This is a restart on the topic previously submitted [1] but dropped because
ak/corrected-commit-date was still in progress. This version is based on
that branch.

[1]
https://lore.kernel.org/git/pull.804.git.1607012215.gitgitgadget@gmail.com/

This version also changes the approach to use a more dynamic interaction
with a struct chunkfile pointer. This idea is credited to Taylor Blau [2],
but I started again from scratch. I also go further to make struct chunkfile
anonymous to API consumers. It is defined only in chunk-format.c, which
should hopefully deter future users from interacting with that data
directly.

[2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

This combined API is beneficial to reduce duplicated logic. Or rather, to
ensure that similar file formats have similar protections against bad data.
The multi-pack-index code did not have as many guards as the commit-graph
code did, but now they both share a common base that checks for things like
duplicate chunks or offsets outside the size of the file.

Here are some stats for the end-to-end change:

 * 570 insertions(+), 456 deletions(-).
 * commit-graph.c: 107 insertions(+), 192 deletions(-)
 * midx.c: 164 insertions(+), 260 deletions(-)

While there is an overall increase to the code size, the consumers do get
smaller. Boilerplate things like abstracting method to match chunk_write_fn
and chunk_read_fn make up a lot of these insertions. The "interesting" code
gets a lot smaller and cleaner.


Updates in V2
=============

 * The method pair_chunk() now automatically sets a pointer while
   read_chunk() uses the callback. This greatly reduces the code size.

 * Pointer casts are now implicit instead of explicit.

 * Extra care is taken to not overflow when verifying chunk sizes on write.

Thanks, -Stolee

Derrick Stolee (17):
  commit-graph: anonymize data in chunk_write_fn
  chunk-format: create chunk format write API
  commit-graph: use chunk-format write API
  midx: rename pack_info to write_midx_context
  midx: use context in write_midx_pack_names()
  midx: add entries to write_midx_context
  midx: add pack_perm to write_midx_context
  midx: add num_large_offsets to write_midx_context
  midx: return success/failure in chunk write methods
  midx: drop chunk progress during write
  midx: use chunk-format API in write_midx_internal()
  chunk-format: create read chunk API
  commit-graph: use chunk-format read API
  midx: use chunk-format read API
  midx: use 64-bit multiplication for chunk sizes
  chunk-format: restore duplicate chunk checks
  chunk-format: add technical docs

 Documentation/technical/chunk-format.txt      |  54 +++
 .../technical/commit-graph-format.txt         |   3 +
 Documentation/technical/pack-format.txt       |   3 +
 Makefile                                      |   1 +
 chunk-format.c                                | 181 ++++++++
 chunk-format.h                                |  53 +++
 commit-graph.c                                | 299 +++++-------
 midx.c                                        | 424 +++++++-----------
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5319-multi-pack-index.sh                   |   6 +-
 10 files changed, 570 insertions(+), 456 deletions(-)
 create mode 100644 Documentation/technical/chunk-format.txt
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h


base-commit: 5a3b130cad0d5c770f766e3af6d32b41766374c0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-848%2Fderrickstolee%2Fchunk-format%2Frefactor-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/848

Range-diff vs v1:

  1:  09b32829e4f !  1:  243dcec9436 commit-graph: anonymize data in chunk_write_fn
     @@ commit-graph.c: struct write_commit_graph_context {
      -				    struct write_commit_graph_context *ctx)
      +				    void *data)
       {
     -+	struct write_commit_graph_context *ctx =
     -+		(struct write_commit_graph_context *)data;
     ++	struct write_commit_graph_context *ctx = data;
       	int i, count = 0;
       	struct commit **list = ctx->commits.list;
       
     @@ commit-graph.c: static int write_graph_chunk_fanout(struct hashfile *f,
      -				  struct write_commit_graph_context *ctx)
      +				  void *data)
       {
     -+	struct write_commit_graph_context *ctx =
     -+		(struct write_commit_graph_context *)data;
     ++	struct write_commit_graph_context *ctx = data;
       	struct commit **list = ctx->commits.list;
       	int count;
       	for (count = 0; count < ctx->commits.nr; count++, list++) {
     @@ commit-graph.c: static const unsigned char *commit_to_sha1(size_t index, void *t
      -				  struct write_commit_graph_context *ctx)
      +				  void *data)
       {
     -+	struct write_commit_graph_context *ctx =
     -+		(struct write_commit_graph_context *)data;
     ++	struct write_commit_graph_context *ctx = data;
       	struct commit **list = ctx->commits.list;
       	struct commit **last = ctx->commits.list + ctx->commits.nr;
       	uint32_t num_extra_edges = 0;
     @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
      -					      struct write_commit_graph_context *ctx)
      +					     void *data)
       {
     -+	struct write_commit_graph_context *ctx =
     -+		(struct write_commit_graph_context *)data;
     ++	struct write_commit_graph_context *ctx = data;
       	int i, num_generation_data_overflows = 0;
       
       	for (i = 0; i < ctx->commits.nr; i++) {
     @@ commit-graph.c: static int write_graph_chunk_generation_data(struct hashfile *f,
      -						       struct write_commit_graph_context *ctx)
      +						      void *data)
       {
     -+	struct write_commit_graph_context *ctx =
     -+		(struct write_commit_graph_context *)data;
     ++	struct write_commit_graph_context *ctx = data;
       	int i;
       	for (i = 0; i < ctx->commits.nr; i++) {
       		struct commit *c = ctx->commits.list[i];
     @@ commit-graph.c: static int write_graph_chunk_generation_data_overflow(struct has
      -					 struct write_commit_graph_context *ctx)
      +					 void *data)
       {
     -+	struct write_commit_graph_context *ctx =
     -+		(struct write_commit_graph_context *)data;
     ++	struct write_commit_graph_context *ctx = data;
       	struct commit **list = ctx->commits.list;
       	struct commit **last = ctx->commits.list + ctx->commits.nr;
       	struct commit_list *parent;
     @@ commit-graph.c: static int write_graph_chunk_extra_edges(struct hashfile *f,
      -					   struct write_commit_graph_context *ctx)
      +					   void *data)
       {
     -+	struct write_commit_graph_context *ctx =
     -+		(struct write_commit_graph_context *)data;
     ++	struct write_commit_graph_context *ctx = data;
       	struct commit **list = ctx->commits.list;
       	struct commit **last = ctx->commits.list + ctx->commits.nr;
       	uint32_t cur_pos = 0;
     @@ commit-graph.c: static void trace2_bloom_filter_settings(struct write_commit_gra
      -					struct write_commit_graph_context *ctx)
      +					void *data)
       {
     -+	struct write_commit_graph_context *ctx =
     -+		(struct write_commit_graph_context *)data;
     ++	struct write_commit_graph_context *ctx = data;
       	struct commit **list = ctx->commits.list;
       	struct commit **last = ctx->commits.list + ctx->commits.nr;
       
     @@ commit-graph.c: static int write_graph_chunk_base_1(struct hashfile *f,
      -				  struct write_commit_graph_context *ctx)
      +				    void *data)
       {
     -+	struct write_commit_graph_context *ctx =
     -+		(struct write_commit_graph_context *)data;
     ++	struct write_commit_graph_context *ctx = data;
       	int num = write_graph_chunk_base_1(f, ctx->new_base_graph);
       
       	if (num != ctx->num_commit_graphs_after - 1) {
  2:  9bd273f8c94 !  2:  814512f2167 chunk-format: create chunk format write API
     @@ chunk-format.c (new)
      +		if (result)
      +			return result;
      +
     -+		if (cf->f->total + cf->f->offset != start_offset + cf->chunks[i].size)
     ++		if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size)
      +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
      +			    cf->chunks[i].size, cf->chunks[i].id,
      +			    cf->f->total + cf->f->offset - start_offset);
  3:  a3d6177a352 =  3:  70af6e3083f commit-graph: use chunk-format write API
  4:  9fe5ee8611c !  4:  0cac7890bed midx: rename pack_info to write_midx_context
     @@ midx.c: struct pack_list {
       			     const char *file_name, void *data)
       {
      -	struct pack_list *packs = (struct pack_list *)data;
     -+	struct write_midx_context *ctx = (struct write_midx_context *)data;
     ++	struct write_midx_context *ctx = data;
       
       	if (ends_with(file_name, ".idx")) {
      -		display_progress(packs->progress, ++packs->pack_paths_checked);
  5:  14a0246b982 !  5:  4a4e90b129a midx: use context in write_midx_pack_names()
     @@ midx.c: static struct pack_midx_entry *get_sorted_entries(struct multi_pack_inde
      -				    uint32_t num_packs)
      +static size_t write_midx_pack_names(struct hashfile *f, void *data)
       {
     -+	struct write_midx_context *ctx = (struct write_midx_context *)data;
     ++	struct write_midx_context *ctx = data;
       	uint32_t i;
       	unsigned char padding[MIDX_CHUNK_ALIGNMENT];
       	size_t written = 0;
  6:  79f479ef7d1 !  6:  30ad423997b midx: add entries to write_midx_context
     @@ midx.c: static size_t write_midx_pack_names(struct hashfile *f, void *data)
       {
      -	struct pack_midx_entry *list = objects;
      -	struct pack_midx_entry *last = objects + nr_objects;
     -+	struct write_midx_context *ctx = (struct write_midx_context *)data;
     ++	struct write_midx_context *ctx = data;
      +	struct pack_midx_entry *list = ctx->entries;
      +	struct pack_midx_entry *last = ctx->entries + ctx->entries_nr;
       	uint32_t count = 0;
     @@ midx.c: static size_t write_midx_oid_fanout(struct hashfile *f,
      +				    void *data)
       {
      -	struct pack_midx_entry *list = objects;
     -+	struct write_midx_context *ctx = (struct write_midx_context *)data;
     ++	struct write_midx_context *ctx = data;
      +	unsigned char hash_len = the_hash_algo->rawsz;
      +	struct pack_midx_entry *list = ctx->entries;
       	uint32_t i;
  7:  0b4ce3f1732 !  7:  2f1c496f3ab midx: add pack_perm to write_midx_context
     @@ midx.c: static size_t write_midx_oid_lookup(struct hashfile *f,
      +					void *data)
       {
      -	struct pack_midx_entry *list = objects;
     -+	struct write_midx_context *ctx = (struct write_midx_context *)data;
     ++	struct write_midx_context *ctx = data;
      +	struct pack_midx_entry *list = ctx->entries;
       	uint32_t i, nr_large_offset = 0;
       	size_t written = 0;
  8:  eabc7b73647 !  8:  c4939548e51 midx: add num_large_offsets to write_midx_context
     @@ midx.c: static size_t write_midx_object_offsets(struct hashfile *f,
      +				       void *data)
       {
      -	struct pack_midx_entry *list = objects, *end = objects + nr_objects;
     -+	struct write_midx_context *ctx = (struct write_midx_context *)data;
     ++	struct write_midx_context *ctx = data;
      +	struct pack_midx_entry *list = ctx->entries;
      +	struct pack_midx_entry *end = ctx->entries + ctx->entries_nr;
       	size_t written = 0;
  9:  909ca28e0ba !  9:  b3cc73c2256 midx: return success/failure in chunk write methods
     @@ midx.c: static struct pack_midx_entry *get_sorted_entries(struct multi_pack_inde
      -static size_t write_midx_pack_names(struct hashfile *f, void *data)
      +static int write_midx_pack_names(struct hashfile *f, void *data)
       {
     - 	struct write_midx_context *ctx = (struct write_midx_context *)data;
     + 	struct write_midx_context *ctx = data;
       	uint32_t i;
      @@ midx.c: static size_t write_midx_pack_names(struct hashfile *f, void *data)
       	if (i < MIDX_CHUNK_ALIGNMENT) {
     @@ midx.c: static size_t write_midx_pack_names(struct hashfile *f, void *data)
      +static int write_midx_oid_fanout(struct hashfile *f,
      +				 void *data)
       {
     - 	struct write_midx_context *ctx = (struct write_midx_context *)data;
     + 	struct write_midx_context *ctx = data;
       	struct pack_midx_entry *list = ctx->entries;
      @@ midx.c: static size_t write_midx_oid_fanout(struct hashfile *f,
       		list = next;
     @@ midx.c: static size_t write_midx_oid_fanout(struct hashfile *f,
      +static int write_midx_oid_lookup(struct hashfile *f,
      +				 void *data)
       {
     - 	struct write_midx_context *ctx = (struct write_midx_context *)data;
     + 	struct write_midx_context *ctx = data;
       	unsigned char hash_len = the_hash_algo->rawsz;
       	struct pack_midx_entry *list = ctx->entries;
       	uint32_t i;
     @@ midx.c: static size_t write_midx_oid_lookup(struct hashfile *f,
      +static int write_midx_object_offsets(struct hashfile *f,
      +				     void *data)
       {
     - 	struct write_midx_context *ctx = (struct write_midx_context *)data;
     + 	struct write_midx_context *ctx = data;
       	struct pack_midx_entry *list = ctx->entries;
       	uint32_t i, nr_large_offset = 0;
      -	size_t written = 0;
     @@ midx.c: static size_t write_midx_object_offsets(struct hashfile *f,
      +static int write_midx_large_offsets(struct hashfile *f,
      +				    void *data)
       {
     - 	struct write_midx_context *ctx = (struct write_midx_context *)data;
     + 	struct write_midx_context *ctx = data;
       	struct pack_midx_entry *list = ctx->entries;
       	struct pack_midx_entry *end = ctx->entries + ctx->entries_nr;
      -	size_t written = 0;
 10:  e613ffa9ac6 = 10:  78744d3b701 midx: drop chunk progress during write
 11:  49cfb4f63e2 = 11:  07dc0cf8c68 midx: use chunk-format API in write_midx_internal()
 12:  e3475633e1d ! 12:  d8d8e9e2aa3 chunk-format: create read chunk API
     @@ Commit message
      
           1. initialize a 'struct chunkfile' with init_chunkfile(NULL).
           2. call read_table_of_contents().
     -     3. for each chunk to parse, call pair_chunk() with appropriate pointers.
     +     3. for each chunk to parse,
     +        a. call pair_chunk() to assign a pointer with the chunk position, or
     +        b. call read_chunk() to run a callback on the chunk start and size.
           4. call free_chunkfile() to clear the 'struct chunkfile' data.
      
          We are re-using the anonymous 'struct chunkfile' data, as it is internal
     @@ chunk-format.c: int write_chunkfile(struct chunkfile *cf, void *data)
      +
      +int pair_chunk(struct chunkfile *cf,
      +	       uint32_t chunk_id,
     ++	       const unsigned char **p)
     ++{
     ++	int i;
     ++
     ++	for (i = 0; i < cf->chunks_nr; i++) {
     ++		if (cf->chunks[i].id == chunk_id) {
     ++			*p = cf->chunks[i].start;
     ++			return 0;
     ++		}
     ++	}
     ++
     ++	return CHUNK_NOT_FOUND;
     ++}
     ++
     ++int read_chunk(struct chunkfile *cf,
     ++	       uint32_t chunk_id,
      +	       chunk_read_fn fn,
      +	       void *data)
      +{
     @@ chunk-format.h: void add_chunk(struct chunkfile *cf,
      +			   uint64_t toc_offset,
      +			   int toc_length);
      +
     ++#define CHUNK_NOT_FOUND (-2)
     ++
      +/*
     -+ * When reading a table of contents, we find the chunk with matching 'id'
     -+ * then call its read_fn to populate the necessary 'data' based on the
     -+ * chunk start and size.
     ++ * Find 'chunk_id' in the given chunkfile and assign the
     ++ * given pointer to the position in the mmap'd file where
     ++ * that chunk begins.
     ++ *
     ++ * Returns CHUNK_NOT_FOUND if the chunk does not exist.
      + */
     ++int pair_chunk(struct chunkfile *cf,
     ++	       uint32_t chunk_id,
     ++	       const unsigned char **p);
     ++
      +typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
      +			     size_t chunk_size, void *data);
     -+
     -+
     -+#define CHUNK_NOT_FOUND (-2)
     -+int pair_chunk(struct chunkfile *cf,
     ++/*
     ++ * Find 'chunk_id' in the given chunkfile and call the
     ++ * given chunk_read_fn method with the information for
     ++ * that chunk.
     ++ *
     ++ * Returns CHUNK_NOT_FOUND if the chunk does not exist.
     ++ */
     ++int read_chunk(struct chunkfile *cf,
      +	       uint32_t chunk_id,
      +	       chunk_read_fn fn,
      +	       void *data);
 13:  7339990f07d ! 13:  8744d278596 commit-graph: use chunk-format read API
     @@ commit-graph.c: static int verify_commit_graph_lite(struct commit_graph *g)
       	return 0;
       }
       
     -+static int graph_read_oid_fanout(const unsigned char *chunk_start,
     -+				 size_t chunk_size, void *data)
     -+{
     -+	struct commit_graph *g = (struct commit_graph *)data;
     -+	g->chunk_oid_fanout = (uint32_t*)chunk_start;
     -+	return 0;
     -+}
     -+
      +static int graph_read_oid_lookup(const unsigned char *chunk_start,
      +				 size_t chunk_size, void *data)
      +{
     -+	struct commit_graph *g = (struct commit_graph *)data;
     ++	struct commit_graph *g = data;
      +	g->chunk_oid_lookup = chunk_start;
      +	g->num_commits = chunk_size / g->hash_len;
      +	return 0;
      +}
      +
     -+static int graph_read_data(const unsigned char *chunk_start,
     -+				 size_t chunk_size, void *data)
     -+{
     -+	struct commit_graph *g = (struct commit_graph *)data;
     -+	g->chunk_commit_data = chunk_start;
     -+	return 0;
     -+}
     -+
     -+static int graph_read_extra_edges(const unsigned char *chunk_start,
     -+				  size_t chunk_size, void *data)
     -+{
     -+	struct commit_graph *g = (struct commit_graph *)data;
     -+	g->chunk_extra_edges = chunk_start;
     -+	return 0;
     -+}
     -+
     -+static int graph_read_base_graphs(const unsigned char *chunk_start,
     -+				  size_t chunk_size, void *data)
     -+{
     -+	struct commit_graph *g = (struct commit_graph *)data;
     -+	g->chunk_base_graphs = chunk_start;
     -+	return 0;
     -+}
     -+
     -+static int graph_read_generation_data(const unsigned char *chunk_start,
     -+				      size_t chunk_size, void *data)
     -+{
     -+	struct commit_graph *g = (struct commit_graph *)data;
     -+	g->chunk_generation_data = chunk_start;
     -+	return 0;
     -+}
     -+
     -+static int graph_read_generation_overflow(const unsigned char *chunk_start,
     -+					  size_t chunk_size, void *data)
     -+{
     -+	struct commit_graph *g = (struct commit_graph *)data;
     -+	g->chunk_generation_data_overflow = chunk_start;
     -+	return 0;
     -+}
     -+
     -+static int graph_read_bloom_indices(const unsigned char *chunk_start,
     -+				    size_t chunk_size, void *data)
     -+{
     -+	struct commit_graph *g = (struct commit_graph *)data;
     -+	g->chunk_bloom_indexes = chunk_start;
     -+	return 0;
     -+}
     -+
      +static int graph_read_bloom_data(const unsigned char *chunk_start,
      +				  size_t chunk_size, void *data)
      +{
     -+	struct commit_graph *g = (struct commit_graph *)data;
     ++	struct commit_graph *g = data;
      +	uint32_t hash_version;
      +	g->chunk_bloom_data = chunk_start;
      +	hash_version = get_be32(chunk_start);
     @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
      -						     / graph->hash_len;
      -			}
      -			break;
     --
     ++	cf = init_chunkfile(NULL);
     + 
      -		case GRAPH_CHUNKID_DATA:
      -			if (graph->chunk_commit_data)
      -				chunk_repeated = 1;
     @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
      -			else
      -				graph->chunk_generation_data = data + chunk_offset;
      -			break;
     -+	cf = init_chunkfile(NULL);
     - 
     +-
      -		case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
      -			if (graph->chunk_generation_data_overflow)
      -				chunk_repeated = 1;
     @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
      -			}
      -			break;
      -		}
     -+	pair_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph);
     -+	pair_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
     -+	pair_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_data, graph);
     -+	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, graph_read_extra_edges, graph);
     -+	pair_chunk(cf, GRAPH_CHUNKID_BASE, graph_read_base_graphs, graph);
     ++	pair_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
     ++		   (const unsigned char **)&graph->chunk_oid_fanout);
     ++	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
     ++	pair_chunk(cf, GRAPH_CHUNKID_DATA, &graph->chunk_commit_data);
     ++	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges);
     ++	pair_chunk(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs);
      +	pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
     -+		   graph_read_generation_data, graph);
     ++		   &graph->chunk_generation_data);
      +	pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
     -+		   graph_read_generation_overflow, graph);
     ++		   &graph->chunk_generation_data_overflow);
       
      -		if (chunk_repeated) {
      -			error(_("commit-graph chunk id %08x appears multiple times"), chunk_id);
     @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
      -		}
      +	if (r->settings.commit_graph_read_changed_paths) {
      +		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
     -+			   graph_read_bloom_indices, graph);
     -+		pair_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
     ++			   &graph->chunk_bloom_indexes);
     ++		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
      +			   graph_read_bloom_data, graph);
       	}
       
 14:  cb145e0e32a ! 14:  750c03253c9 midx: use chunk-format read API
     @@ midx.c: static char *get_midx_filename(const char *object_dir)
       	return xstrfmt("%s/pack/multi-pack-index", object_dir);
       }
       
     -+static int midx_read_pack_names(const unsigned char *chunk_start,
     -+				size_t chunk_size, void *data)
     -+{
     -+	struct multi_pack_index *m = (struct multi_pack_index *)data;
     -+	m->chunk_pack_names = chunk_start;
     -+	return 0;
     -+}
     -+
      +static int midx_read_oid_fanout(const unsigned char *chunk_start,
      +				size_t chunk_size, void *data)
      +{
     -+	struct multi_pack_index *m = (struct multi_pack_index *)data;
     ++	struct multi_pack_index *m = data;
      +	m->chunk_oid_fanout = (uint32_t *)chunk_start;
      +
      +	if (chunk_size != 4 * 256) {
     @@ midx.c: static char *get_midx_filename(const char *object_dir)
      +	}
      +	return 0;
      +}
     -+
     -+static int midx_read_oid_lookup(const unsigned char *chunk_start,
     -+				size_t chunk_size, void *data)
     -+{
     -+	struct multi_pack_index *m = (struct multi_pack_index *)data;
     -+	m->chunk_oid_lookup = chunk_start;
     -+	return 0;
     -+}
     -+
     -+static int midx_read_offsets(const unsigned char *chunk_start,
     -+			     size_t chunk_size, void *data)
     -+{
     -+	struct multi_pack_index *m = (struct multi_pack_index *)data;
     -+	m->chunk_object_offsets = chunk_start;
     -+	return 0;
     -+}
     -+
     -+static int midx_read_large_offsets(const unsigned char *chunk_start,
     -+				   size_t chunk_size, void *data)
     -+{
     -+	struct multi_pack_index *m = (struct multi_pack_index *)data;
     -+	m->chunk_large_offsets = chunk_start;
     -+	return 0;
     -+}
      +
       struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local)
       {
     @@ midx.c: struct multi_pack_index *load_multi_pack_index(const char *object_dir, i
      +				   MIDX_HEADER_SIZE, m->num_chunks))
      +		goto cleanup_fail;
      +
     -+	if (pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, midx_read_pack_names, m) == CHUNK_NOT_FOUND)
     ++	if (pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names) == CHUNK_NOT_FOUND)
       		die(_("multi-pack-index missing required pack-name chunk"));
      -	if (!m->chunk_oid_fanout)
     -+	if (pair_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m) == CHUNK_NOT_FOUND)
     ++	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m) == CHUNK_NOT_FOUND)
       		die(_("multi-pack-index missing required OID fanout chunk"));
      -	if (!m->chunk_oid_lookup)
     -+	if (pair_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, midx_read_oid_lookup, m) == CHUNK_NOT_FOUND)
     ++	if (pair_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup) == CHUNK_NOT_FOUND)
       		die(_("multi-pack-index missing required OID lookup chunk"));
      -	if (!m->chunk_object_offsets)
     -+	if (pair_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_offsets, m) == CHUNK_NOT_FOUND)
     ++	if (pair_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, &m->chunk_object_offsets) == CHUNK_NOT_FOUND)
       		die(_("multi-pack-index missing required object offsets chunk"));
       
     -+	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, midx_read_large_offsets, m);
     ++	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets);
      +
       	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
       
 15:  f6c58ff72d2 = 15:  83d292532a0 midx: use 64-bit multiplication for chunk sizes
 16:  62a23842aa6 = 16:  669eeec707a chunk-format: restore duplicate chunk checks
 17:  05cbd0a8d93 = 17:  8f3985ab5df chunk-format: add technical docs

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v2 01/17] commit-graph: anonymize data in chunk_write_fn
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In preparation for creating an API around file formats using chunks and
tables of contents, prepare the commit-graph write code to use
prototypes that will match this new API.

Specifically, convert chunk_write_fn to take a "void *data" parameter
instead of the commit-graph-specific "struct write_commit_graph_context"
pointer.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f3bde2ad95a..fae7d1b6393 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1040,8 +1040,9 @@ struct write_commit_graph_context {
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
-				    struct write_commit_graph_context *ctx)
+				    void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	int i, count = 0;
 	struct commit **list = ctx->commits.list;
 
@@ -1066,8 +1067,9 @@ static int write_graph_chunk_fanout(struct hashfile *f,
 }
 
 static int write_graph_chunk_oids(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				  void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
@@ -1085,8 +1087,9 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
 }
 
 static int write_graph_chunk_data(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				  void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t num_extra_edges = 0;
@@ -1187,8 +1190,9 @@ static int write_graph_chunk_data(struct hashfile *f,
 }
 
 static int write_graph_chunk_generation_data(struct hashfile *f,
-					      struct write_commit_graph_context *ctx)
+					     void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	int i, num_generation_data_overflows = 0;
 
 	for (i = 0; i < ctx->commits.nr; i++) {
@@ -1208,8 +1212,9 @@ static int write_graph_chunk_generation_data(struct hashfile *f,
 }
 
 static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
-						       struct write_commit_graph_context *ctx)
+						      void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	int i;
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = ctx->commits.list[i];
@@ -1226,8 +1231,9 @@ static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
 }
 
 static int write_graph_chunk_extra_edges(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+					 void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	struct commit_list *parent;
@@ -1280,8 +1286,9 @@ static int write_graph_chunk_extra_edges(struct hashfile *f,
 }
 
 static int write_graph_chunk_bloom_indexes(struct hashfile *f,
-					   struct write_commit_graph_context *ctx)
+					   void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t cur_pos = 0;
@@ -1315,8 +1322,9 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 }
 
 static int write_graph_chunk_bloom_data(struct hashfile *f,
-					struct write_commit_graph_context *ctx)
+					void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 
@@ -1737,8 +1745,9 @@ static int write_graph_chunk_base_1(struct hashfile *f,
 }
 
 static int write_graph_chunk_base(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				    void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	int num = write_graph_chunk_base_1(f, ctx->new_base_graph);
 
 	if (num != ctx->num_commit_graphs_after - 1) {
@@ -1750,7 +1759,7 @@ static int write_graph_chunk_base(struct hashfile *f,
 }
 
 typedef int (*chunk_write_fn)(struct hashfile *f,
-			      struct write_commit_graph_context *ctx);
+			      void *data);
 
 struct chunk_info {
 	uint32_t id;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 02/17] chunk-format: create chunk format write API
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-02-04 21:24     ` Junio C Hamano
  2021-01-27 15:01   ` [PATCH v2 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
                     ` (17 subsequent siblings)
  19 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In anticipation of combining the logic from the commit-graph and
multi-pack-index file formats, create a new chunk-format API. Use a
'struct chunkfile' pointer to keep track of data that has been
registered for writes. This struct is anonymous outside of
chunk-format.c to ensure no user attempts to interfere with the data.

The next change will use this API in commit-graph.c, but the general
approach is:

 1. initialize the chunkfile with init_chunkfile(f).
 2. add chunks in the intended writing order with add_chunk().
 3. write any header information to the hashfile f.
 4. write the chunkfile data using write_chunkfile().
 5. free the chunkfile struct using free_chunkfile().

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile       |  1 +
 chunk-format.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++
 chunk-format.h | 20 +++++++++++
 3 files changed, 112 insertions(+)
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h

diff --git a/Makefile b/Makefile
index 7b64106930a..50a7663841e 100644
--- a/Makefile
+++ b/Makefile
@@ -854,6 +854,7 @@ LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += chdir-notify.o
 LIB_OBJS += checkout.o
+LIB_OBJS += chunk-format.o
 LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
diff --git a/chunk-format.c b/chunk-format.c
new file mode 100644
index 00000000000..ab914c55856
--- /dev/null
+++ b/chunk-format.c
@@ -0,0 +1,91 @@
+#include "cache.h"
+#include "chunk-format.h"
+#include "csum-file.h"
+#define CHUNK_LOOKUP_WIDTH 12
+
+/*
+ * When writing a chunk-based file format, collect the chunks in
+ * an array of chunk_info structs. The size stores the _expected_
+ * amount of data that will be written by write_fn.
+ */
+struct chunk_info {
+	uint32_t id;
+	uint64_t size;
+	chunk_write_fn write_fn;
+};
+
+struct chunkfile {
+	struct hashfile *f;
+
+	struct chunk_info *chunks;
+	size_t chunks_nr;
+	size_t chunks_alloc;
+};
+
+struct chunkfile *init_chunkfile(struct hashfile *f)
+{
+	struct chunkfile *cf = xcalloc(1, sizeof(*cf));
+	cf->f = f;
+	return cf;
+}
+
+void free_chunkfile(struct chunkfile *cf)
+{
+	if (!cf)
+		return;
+	free(cf->chunks);
+	free(cf);
+}
+
+int get_num_chunks(struct chunkfile *cf)
+{
+	return cf->chunks_nr;
+}
+
+void add_chunk(struct chunkfile *cf,
+	       uint64_t id,
+	       chunk_write_fn fn,
+	       size_t size)
+{
+	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
+
+	cf->chunks[cf->chunks_nr].id = id;
+	cf->chunks[cf->chunks_nr].write_fn = fn;
+	cf->chunks[cf->chunks_nr].size = size;
+	cf->chunks_nr++;
+}
+
+int write_chunkfile(struct chunkfile *cf, void *data)
+{
+	int i;
+	size_t cur_offset = cf->f->offset + cf->f->total;
+
+	/* Add the table of contents to the current offset */
+	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		hashwrite_be32(cf->f, cf->chunks[i].id);
+		hashwrite_be64(cf->f, cur_offset);
+
+		cur_offset += cf->chunks[i].size;
+	}
+
+	/* Trailing entry marks the end of the chunks */
+	hashwrite_be32(cf->f, 0);
+	hashwrite_be64(cf->f, cur_offset);
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		uint64_t start_offset = cf->f->total + cf->f->offset;
+		int result = cf->chunks[i].write_fn(cf->f, data);
+
+		if (result)
+			return result;
+
+		if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size)
+			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
+			    cf->chunks[i].size, cf->chunks[i].id,
+			    cf->f->total + cf->f->offset - start_offset);
+	}
+
+	return 0;
+}
diff --git a/chunk-format.h b/chunk-format.h
new file mode 100644
index 00000000000..bfaed672813
--- /dev/null
+++ b/chunk-format.h
@@ -0,0 +1,20 @@
+#ifndef CHUNK_FORMAT_H
+#define CHUNK_FORMAT_H
+
+#include "git-compat-util.h"
+
+struct hashfile;
+struct chunkfile;
+
+struct chunkfile *init_chunkfile(struct hashfile *f);
+void free_chunkfile(struct chunkfile *cf);
+int get_num_chunks(struct chunkfile *cf);
+typedef int (*chunk_write_fn)(struct hashfile *f,
+			      void *data);
+void add_chunk(struct chunkfile *cf,
+	       uint64_t id,
+	       chunk_write_fn fn,
+	       size_t size);
+int write_chunkfile(struct chunkfile *cf, void *data);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 03/17] commit-graph: use chunk-format write API
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
                     ` (16 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The commit-graph write logic is ready to make use of the chunk-format
write API. Each chunk write method is already in the correct prototype.
We only need to use the 'struct chunkfile' pointer and the correct API
calls.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 118 ++++++++++++++++---------------------------------
 1 file changed, 37 insertions(+), 81 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index fae7d1b6393..ba33777dcb8 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -19,6 +19,7 @@
 #include "shallow.h"
 #include "json-writer.h"
 #include "trace2.h"
+#include "chunk-format.h"
 
 void git_test_write_commit_graph_or_die(void)
 {
@@ -1758,27 +1759,17 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
-typedef int (*chunk_write_fn)(struct hashfile *f,
-			      void *data);
-
-struct chunk_info {
-	uint32_t id;
-	uint64_t size;
-	chunk_write_fn write_fn;
-};
-
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
 	int fd;
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
-	struct chunk_info chunks[MAX_NUM_CHUNKS + 1];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
-	uint64_t chunk_offset;
 	struct object_id file_hash;
+	struct chunkfile *cf;
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -1824,76 +1815,50 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	}
 
-	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
-	chunks[0].size = GRAPH_FANOUT_SIZE;
-	chunks[0].write_fn = write_graph_chunk_fanout;
-	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
-	chunks[1].size = hashsz * ctx->commits.nr;
-	chunks[1].write_fn = write_graph_chunk_oids;
-	chunks[2].id = GRAPH_CHUNKID_DATA;
-	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
-	chunks[2].write_fn = write_graph_chunk_data;
+	cf = init_chunkfile(f);
+
+	add_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
+		  write_graph_chunk_fanout, GRAPH_FANOUT_SIZE);
+	add_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP,
+		  write_graph_chunk_oids, hashsz * ctx->commits.nr);
+	add_chunk(cf, GRAPH_CHUNKID_DATA,
+		  write_graph_chunk_data, (hashsz + 16) * ctx->commits.nr);
 
 	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
 		ctx->write_generation_data = 0;
-	if (ctx->write_generation_data) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
-		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
-		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
-		num_chunks++;
-	}
-	if (ctx->num_generation_data_overflows) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
-		chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
-		chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
-		num_chunks++;
-	}
-	if (ctx->num_extra_edges) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
-		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
-		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
-		num_chunks++;
-	}
+	if (ctx->write_generation_data)
+		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
+			  write_graph_chunk_generation_data,
+			  sizeof(uint32_t) * ctx->commits.nr);
+	if (ctx->num_generation_data_overflows)
+		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
+			  write_graph_chunk_generation_data_overflow,
+			  sizeof(timestamp_t) * ctx->num_generation_data_overflows);
+	if (ctx->num_extra_edges)
+		add_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES,
+			  write_graph_chunk_extra_edges,
+			  4 * ctx->num_extra_edges);
 	if (ctx->changed_paths) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
-		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
-		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
-		num_chunks++;
-		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
-		chunks[num_chunks].size = sizeof(uint32_t) * 3
-					  + ctx->total_bloom_filter_data_size;
-		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
-		num_chunks++;
-	}
-	if (ctx->num_commit_graphs_after > 1) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
-		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
-		chunks[num_chunks].write_fn = write_graph_chunk_base;
-		num_chunks++;
-	}
-
-	chunks[num_chunks].id = 0;
-	chunks[num_chunks].size = 0;
+		add_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+			  write_graph_chunk_bloom_indexes,
+			  sizeof(uint32_t) * ctx->commits.nr);
+		add_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
+			  write_graph_chunk_bloom_data,
+			  sizeof(uint32_t) * 3
+				+ ctx->total_bloom_filter_data_size);
+	}
+	if (ctx->num_commit_graphs_after > 1)
+		add_chunk(cf, GRAPH_CHUNKID_BASE,
+			  write_graph_chunk_base,
+			  hashsz * (ctx->num_commit_graphs_after - 1));
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
 	hashwrite_u8(f, oid_version());
-	hashwrite_u8(f, num_chunks);
+	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
-	chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
-	for (i = 0; i <= num_chunks; i++) {
-		uint32_t chunk_write[3];
-
-		chunk_write[0] = htonl(chunks[i].id);
-		chunk_write[1] = htonl(chunk_offset >> 32);
-		chunk_write[2] = htonl(chunk_offset & 0xffffffff);
-		hashwrite(f, chunk_write, 12);
-
-		chunk_offset += chunks[i].size;
-	}
-
 	if (ctx->report_progress) {
 		strbuf_addf(&progress_title,
 			    Q_("Writing out commit graph in %d pass",
@@ -1905,17 +1870,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			num_chunks * ctx->commits.nr);
 	}
 
-	for (i = 0; i < num_chunks; i++) {
-		uint64_t start_offset = f->total + f->offset;
-
-		if (chunks[i].write_fn(f, ctx))
-			return -1;
-
-		if (f->total + f->offset != start_offset + chunks[i].size)
-			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
-			    chunks[i].size, chunks[i].id,
-			    f->total + f->offset - start_offset);
-	}
+	write_chunkfile(cf, ctx);
 
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
@@ -1932,6 +1887,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	close_commit_graph(ctx->r->objects);
 	finalize_hashfile(f, file_hash.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+	free_chunkfile(cf);
 
 	if (ctx->split) {
 		FILE *chainf = fdopen_lock_file(&lk, "w");
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 04/17] midx: rename pack_info to write_midx_context
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
                     ` (15 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to streamline our chunk-based file formats, align some of
the code structure in write_midx_internal() to be similar to the
patterns in write_commit_graph_file().

Specifically, let's create a "struct write_midx_context" that can be
used as a data parameter to abstract function types.

This change only renames "struct pack_info" to "struct
write_midx_context" and the names of instances from "packs" to "ctx". In
future changes, we will expand the data inside "struct
write_midx_context" and align our chunk-writing method with the
chunk-format API.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 130 ++++++++++++++++++++++++++++-----------------------------
 1 file changed, 65 insertions(+), 65 deletions(-)

diff --git a/midx.c b/midx.c
index 79c282b070d..561f65a63a5 100644
--- a/midx.c
+++ b/midx.c
@@ -451,7 +451,7 @@ static int pack_info_compare(const void *_a, const void *_b)
 	return strcmp(a->pack_name, b->pack_name);
 }
 
-struct pack_list {
+struct write_midx_context {
 	struct pack_info *info;
 	uint32_t nr;
 	uint32_t alloc;
@@ -463,37 +463,37 @@ struct pack_list {
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			     const char *file_name, void *data)
 {
-	struct pack_list *packs = (struct pack_list *)data;
+	struct write_midx_context *ctx = data;
 
 	if (ends_with(file_name, ".idx")) {
-		display_progress(packs->progress, ++packs->pack_paths_checked);
-		if (packs->m && midx_contains_pack(packs->m, file_name))
+		display_progress(ctx->progress, ++ctx->pack_paths_checked);
+		if (ctx->m && midx_contains_pack(ctx->m, file_name))
 			return;
 
-		ALLOC_GROW(packs->info, packs->nr + 1, packs->alloc);
+		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
-		packs->info[packs->nr].p = add_packed_git(full_path,
-							  full_path_len,
-							  0);
+		ctx->info[ctx->nr].p = add_packed_git(full_path,
+						      full_path_len,
+						      0);
 
-		if (!packs->info[packs->nr].p) {
+		if (!ctx->info[ctx->nr].p) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
 			return;
 		}
 
-		if (open_pack_index(packs->info[packs->nr].p)) {
+		if (open_pack_index(ctx->info[ctx->nr].p)) {
 			warning(_("failed to open pack-index '%s'"),
 				full_path);
-			close_pack(packs->info[packs->nr].p);
-			FREE_AND_NULL(packs->info[packs->nr].p);
+			close_pack(ctx->info[ctx->nr].p);
+			FREE_AND_NULL(ctx->info[ctx->nr].p);
 			return;
 		}
 
-		packs->info[packs->nr].pack_name = xstrdup(file_name);
-		packs->info[packs->nr].orig_pack_int_id = packs->nr;
-		packs->info[packs->nr].expired = 0;
-		packs->nr++;
+		ctx->info[ctx->nr].pack_name = xstrdup(file_name);
+		ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
+		ctx->info[ctx->nr].expired = 0;
+		ctx->nr++;
 	}
 }
 
@@ -801,7 +801,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
-	struct pack_list packs;
+	struct write_midx_context ctx = { 0 };
 	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
@@ -820,40 +820,40 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 			  midx_name);
 
 	if (m)
-		packs.m = m;
+		ctx.m = m;
 	else
-		packs.m = load_multi_pack_index(object_dir, 1);
-
-	packs.nr = 0;
-	packs.alloc = packs.m ? packs.m->num_packs : 16;
-	packs.info = NULL;
-	ALLOC_ARRAY(packs.info, packs.alloc);
-
-	if (packs.m) {
-		for (i = 0; i < packs.m->num_packs; i++) {
-			ALLOC_GROW(packs.info, packs.nr + 1, packs.alloc);
-
-			packs.info[packs.nr].orig_pack_int_id = i;
-			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
-			packs.info[packs.nr].p = NULL;
-			packs.info[packs.nr].expired = 0;
-			packs.nr++;
+		ctx.m = load_multi_pack_index(object_dir, 1);
+
+	ctx.nr = 0;
+	ctx.alloc = ctx.m ? ctx.m->num_packs : 16;
+	ctx.info = NULL;
+	ALLOC_ARRAY(ctx.info, ctx.alloc);
+
+	if (ctx.m) {
+		for (i = 0; i < ctx.m->num_packs; i++) {
+			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
+
+			ctx.info[ctx.nr].orig_pack_int_id = i;
+			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
+			ctx.info[ctx.nr].p = NULL;
+			ctx.info[ctx.nr].expired = 0;
+			ctx.nr++;
 		}
 	}
 
-	packs.pack_paths_checked = 0;
+	ctx.pack_paths_checked = 0;
 	if (flags & MIDX_PROGRESS)
-		packs.progress = start_delayed_progress(_("Adding packfiles to multi-pack-index"), 0);
+		ctx.progress = start_delayed_progress(_("Adding packfiles to multi-pack-index"), 0);
 	else
-		packs.progress = NULL;
+		ctx.progress = NULL;
 
-	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
-	stop_progress(&packs.progress);
+	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
+	stop_progress(&ctx.progress);
 
-	if (packs.m && packs.nr == packs.m->num_packs && !packs_to_drop)
+	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
-	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
+	entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &nr_entries);
 
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
@@ -862,19 +862,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 			large_offsets_needed = 1;
 	}
 
-	QSORT(packs.info, packs.nr, pack_info_compare);
+	QSORT(ctx.info, ctx.nr, pack_info_compare);
 
 	if (packs_to_drop && packs_to_drop->nr) {
 		int drop_index = 0;
 		int missing_drops = 0;
 
-		for (i = 0; i < packs.nr && drop_index < packs_to_drop->nr; i++) {
-			int cmp = strcmp(packs.info[i].pack_name,
+		for (i = 0; i < ctx.nr && drop_index < packs_to_drop->nr; i++) {
+			int cmp = strcmp(ctx.info[i].pack_name,
 					 packs_to_drop->items[drop_index].string);
 
 			if (!cmp) {
 				drop_index++;
-				packs.info[i].expired = 1;
+				ctx.info[i].expired = 1;
 			} else if (cmp > 0) {
 				error(_("did not see pack-file %s to drop"),
 				      packs_to_drop->items[drop_index].string);
@@ -882,7 +882,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				missing_drops++;
 				i--;
 			} else {
-				packs.info[i].expired = 0;
+				ctx.info[i].expired = 0;
 			}
 		}
 
@@ -898,19 +898,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	 *
 	 * pack_perm[old_id] = new_id
 	 */
-	ALLOC_ARRAY(pack_perm, packs.nr);
-	for (i = 0; i < packs.nr; i++) {
-		if (packs.info[i].expired) {
+	ALLOC_ARRAY(pack_perm, ctx.nr);
+	for (i = 0; i < ctx.nr; i++) {
+		if (ctx.info[i].expired) {
 			dropped_packs++;
-			pack_perm[packs.info[i].orig_pack_int_id] = PACK_EXPIRED;
+			pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
 		} else {
-			pack_perm[packs.info[i].orig_pack_int_id] = i - dropped_packs;
+			pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
 		}
 	}
 
-	for (i = 0; i < packs.nr; i++) {
-		if (!packs.info[i].expired)
-			pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	for (i = 0; i < ctx.nr; i++) {
+		if (!ctx.info[i].expired)
+			pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
 	}
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
@@ -921,19 +921,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	if (packs.m)
-		close_midx(packs.m);
+	if (ctx.m)
+		close_midx(ctx.m);
 
 	cur_chunk = 0;
 	num_chunks = large_offsets_needed ? 5 : 4;
 
-	if (packs.nr - dropped_packs == 0) {
+	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
 		result = 1;
 		goto cleanup;
 	}
 
-	written = write_midx_header(f, num_chunks, packs.nr - dropped_packs);
+	written = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
@@ -990,7 +990,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, packs.info, packs.nr);
+				written += write_midx_pack_names(f, ctx.info, ctx.nr);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
@@ -1027,15 +1027,15 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	commit_lock_file(&lk);
 
 cleanup:
-	for (i = 0; i < packs.nr; i++) {
-		if (packs.info[i].p) {
-			close_pack(packs.info[i].p);
-			free(packs.info[i].p);
+	for (i = 0; i < ctx.nr; i++) {
+		if (ctx.info[i].p) {
+			close_pack(ctx.info[i].p);
+			free(ctx.info[i].p);
 		}
-		free(packs.info[i].pack_name);
+		free(ctx.info[i].pack_name);
 	}
 
-	free(packs.info);
+	free(ctx.info);
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 05/17] midx: use context in write_midx_pack_names()
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
                     ` (14 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align the write_midx_internal() to use the chunk-format
API, start converting chunk writing methods to match chunk_write_fn. The
first case is to convert write_midx_pack_names() to take "void *data".
We already have the necessary data in "struct write_midx_context", so
this conversion is rather mechanical.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/midx.c b/midx.c
index 561f65a63a5..88452b04433 100644
--- a/midx.c
+++ b/midx.c
@@ -643,27 +643,26 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	return deduplicated_entries;
 }
 
-static size_t write_midx_pack_names(struct hashfile *f,
-				    struct pack_info *info,
-				    uint32_t num_packs)
+static size_t write_midx_pack_names(struct hashfile *f, void *data)
 {
+	struct write_midx_context *ctx = data;
 	uint32_t i;
 	unsigned char padding[MIDX_CHUNK_ALIGNMENT];
 	size_t written = 0;
 
-	for (i = 0; i < num_packs; i++) {
+	for (i = 0; i < ctx->nr; i++) {
 		size_t writelen;
 
-		if (info[i].expired)
+		if (ctx->info[i].expired)
 			continue;
 
-		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
+		if (i && strcmp(ctx->info[i].pack_name, ctx->info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
-			    info[i - 1].pack_name,
-			    info[i].pack_name);
+			    ctx->info[i - 1].pack_name,
+			    ctx->info[i].pack_name);
 
-		writelen = strlen(info[i].pack_name) + 1;
-		hashwrite(f, info[i].pack_name, writelen);
+		writelen = strlen(ctx->info[i].pack_name) + 1;
+		hashwrite(f, ctx->info[i].pack_name, writelen);
 		written += writelen;
 	}
 
@@ -990,7 +989,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, ctx.info, ctx.nr);
+				written += write_midx_pack_names(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 06/17] midx: add entries to write_midx_context
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
                     ` (13 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "struct pack_midx_entry *entries" list and its count
into the context.

Update write_midx_oid_fanout() and write_midx_oid_lookup() to take the
context directly, as these are easy conversions with this new data.

Only the callers of write_midx_object_offsets() and
write_midx_large_offsets() are updated here, since additional data in
the context before those methods can match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 49 ++++++++++++++++++++++++++-----------------------
 1 file changed, 26 insertions(+), 23 deletions(-)

diff --git a/midx.c b/midx.c
index 88452b04433..4520ef82b91 100644
--- a/midx.c
+++ b/midx.c
@@ -458,6 +458,9 @@ struct write_midx_context {
 	struct multi_pack_index *m;
 	struct progress *progress;
 	unsigned pack_paths_checked;
+
+	struct pack_midx_entry *entries;
+	uint32_t entries_nr;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -678,11 +681,11 @@ static size_t write_midx_pack_names(struct hashfile *f, void *data)
 }
 
 static size_t write_midx_oid_fanout(struct hashfile *f,
-				    struct pack_midx_entry *objects,
-				    uint32_t nr_objects)
+				    void *data)
 {
-	struct pack_midx_entry *list = objects;
-	struct pack_midx_entry *last = objects + nr_objects;
+	struct write_midx_context *ctx = data;
+	struct pack_midx_entry *list = ctx->entries;
+	struct pack_midx_entry *last = ctx->entries + ctx->entries_nr;
 	uint32_t count = 0;
 	uint32_t i;
 
@@ -706,18 +709,19 @@ static size_t write_midx_oid_fanout(struct hashfile *f,
 	return MIDX_CHUNK_FANOUT_SIZE;
 }
 
-static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
-				    struct pack_midx_entry *objects,
-				    uint32_t nr_objects)
+static size_t write_midx_oid_lookup(struct hashfile *f,
+				    void *data)
 {
-	struct pack_midx_entry *list = objects;
+	struct write_midx_context *ctx = data;
+	unsigned char hash_len = the_hash_algo->rawsz;
+	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i;
 	size_t written = 0;
 
-	for (i = 0; i < nr_objects; i++) {
+	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		if (i < nr_objects - 1) {
+		if (i < ctx->entries_nr - 1) {
 			struct pack_midx_entry *next = list;
 			if (oidcmp(&obj->oid, &next->oid) >= 0)
 				BUG("OIDs not in order: %s >= %s",
@@ -805,8 +809,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	uint32_t nr_entries, num_large_offsets = 0;
-	struct pack_midx_entry *entries = NULL;
+	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
 	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
@@ -852,12 +855,12 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
-	entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &nr_entries);
+	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr);
 
-	for (i = 0; i < nr_entries; i++) {
-		if (entries[i].offset > 0x7fffffff)
+	for (i = 0; i < ctx.entries_nr; i++) {
+		if (ctx.entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
-		if (entries[i].offset > 0xffffffff)
+		if (ctx.entries[i].offset > 0xffffffff)
 			large_offsets_needed = 1;
 	}
 
@@ -947,10 +950,10 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * the_hash_algo->rawsz;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;
 
 	cur_chunk++;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_CHUNK_OFFSET_WIDTH;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
 	if (large_offsets_needed) {
 		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
 
@@ -993,19 +996,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-				written += write_midx_oid_fanout(f, entries, nr_entries);
+				written += write_midx_oid_fanout(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDLOOKUP:
-				written += write_midx_oid_lookup(f, the_hash_algo->rawsz, entries, nr_entries);
+				written += write_midx_oid_lookup(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, entries, nr_entries);
+				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, ctx.entries, ctx.entries_nr);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, num_large_offsets, entries, nr_entries);
+				written += write_midx_large_offsets(f, num_large_offsets, ctx.entries, ctx.entries_nr);
 				break;
 
 			default:
@@ -1035,7 +1038,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	}
 
 	free(ctx.info);
-	free(entries);
+	free(ctx.entries);
 	free(pack_perm);
 	free(midx_name);
 	return result;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 07/17] midx: add pack_perm to write_midx_context
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
                     ` (12 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t *pack_perm" and large_offsets_needed bit
into the context.

Update write_midx_object_offsets() to match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/midx.c b/midx.c
index 4520ef82b91..cd994e333ec 100644
--- a/midx.c
+++ b/midx.c
@@ -461,6 +461,9 @@ struct write_midx_context {
 
 	struct pack_midx_entry *entries;
 	uint32_t entries_nr;
+
+	uint32_t *pack_perm;
+	unsigned large_offsets_needed:1;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -736,27 +739,27 @@ static size_t write_midx_oid_lookup(struct hashfile *f,
 	return written;
 }
 
-static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
-					uint32_t *perm,
-					struct pack_midx_entry *objects, uint32_t nr_objects)
+static size_t write_midx_object_offsets(struct hashfile *f,
+					void *data)
 {
-	struct pack_midx_entry *list = objects;
+	struct write_midx_context *ctx = data;
+	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i, nr_large_offset = 0;
 	size_t written = 0;
 
-	for (i = 0; i < nr_objects; i++) {
+	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		if (perm[obj->pack_int_id] == PACK_EXPIRED)
+		if (ctx->pack_perm[obj->pack_int_id] == PACK_EXPIRED)
 			BUG("object %s is in an expired pack with int-id %d",
 			    oid_to_hex(&obj->oid),
 			    obj->pack_int_id);
 
-		hashwrite_be32(f, perm[obj->pack_int_id]);
+		hashwrite_be32(f, ctx->pack_perm[obj->pack_int_id]);
 
-		if (large_offset_needed && obj->offset >> 31)
+		if (ctx->large_offsets_needed && obj->offset >> 31)
 			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
-		else if (!large_offset_needed && obj->offset >> 32)
+		else if (!ctx->large_offsets_needed && obj->offset >> 32)
 			BUG("object %s requires a large offset (%"PRIx64") but the MIDX is not writing large offsets!",
 			    oid_to_hex(&obj->oid),
 			    obj->offset);
@@ -805,13 +808,11 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
-	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -857,11 +858,12 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr);
 
+	ctx.large_offsets_needed = 0;
 	for (i = 0; i < ctx.entries_nr; i++) {
 		if (ctx.entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
 		if (ctx.entries[i].offset > 0xffffffff)
-			large_offsets_needed = 1;
+			ctx.large_offsets_needed = 1;
 	}
 
 	QSORT(ctx.info, ctx.nr, pack_info_compare);
@@ -900,13 +902,13 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	 *
 	 * pack_perm[old_id] = new_id
 	 */
-	ALLOC_ARRAY(pack_perm, ctx.nr);
+	ALLOC_ARRAY(ctx.pack_perm, ctx.nr);
 	for (i = 0; i < ctx.nr; i++) {
 		if (ctx.info[i].expired) {
 			dropped_packs++;
-			pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
+			ctx.pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
 		} else {
-			pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
+			ctx.pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
 		}
 	}
 
@@ -927,7 +929,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		close_midx(ctx.m);
 
 	cur_chunk = 0;
-	num_chunks = large_offsets_needed ? 5 : 4;
+	num_chunks = ctx.large_offsets_needed ? 5 : 4;
 
 	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
@@ -954,7 +956,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	cur_chunk++;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
-	if (large_offsets_needed) {
+	if (ctx.large_offsets_needed) {
 		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
 
 		cur_chunk++;
@@ -1004,7 +1006,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, ctx.entries, ctx.entries_nr);
+				written += write_midx_object_offsets(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
@@ -1039,7 +1041,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	free(ctx.info);
 	free(ctx.entries);
-	free(pack_perm);
+	free(ctx.pack_perm);
 	free(midx_name);
 	return result;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 08/17] midx: add num_large_offsets to write_midx_context
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (6 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
                     ` (11 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t num_large_offsets" into the context. With
this new data, write_midx_large_offsets() now matches the
chunk_write_fn type.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/midx.c b/midx.c
index cd994e333ec..5be081f229a 100644
--- a/midx.c
+++ b/midx.c
@@ -464,6 +464,7 @@ struct write_midx_context {
 
 	uint32_t *pack_perm;
 	unsigned large_offsets_needed:1;
+	uint32_t num_large_offsets;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -772,11 +773,14 @@ static size_t write_midx_object_offsets(struct hashfile *f,
 	return written;
 }
 
-static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_offset,
-				       struct pack_midx_entry *objects, uint32_t nr_objects)
+static size_t write_midx_large_offsets(struct hashfile *f,
+				       void *data)
 {
-	struct pack_midx_entry *list = objects, *end = objects + nr_objects;
+	struct write_midx_context *ctx = data;
+	struct pack_midx_entry *list = ctx->entries;
+	struct pack_midx_entry *end = ctx->entries + ctx->entries_nr;
 	size_t written = 0;
+	uint32_t nr_large_offset = ctx->num_large_offsets;
 
 	while (nr_large_offset) {
 		struct pack_midx_entry *obj;
@@ -811,7 +815,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
@@ -861,7 +864,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	ctx.large_offsets_needed = 0;
 	for (i = 0; i < ctx.entries_nr; i++) {
 		if (ctx.entries[i].offset > 0x7fffffff)
-			num_large_offsets++;
+			ctx.num_large_offsets++;
 		if (ctx.entries[i].offset > 0xffffffff)
 			ctx.large_offsets_needed = 1;
 	}
@@ -961,7 +964,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		cur_chunk++;
 		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
-					   num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
+					   ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
 	}
 
 	chunk_ids[cur_chunk] = 0;
@@ -1010,7 +1013,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, num_large_offsets, ctx.entries, ctx.entries_nr);
+				written += write_midx_large_offsets(f, &ctx);
 				break;
 
 			default:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 09/17] midx: return success/failure in chunk write methods
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (7 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-02-04 22:59     ` Junio C Hamano
  2021-01-27 15:01   ` [PATCH v2 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
                     ` (10 subsequent siblings)
  19 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Historically, the chunk-writing methods in midx.c have returned the
amount of data written so the writer method could compare this with the
table of contents. This presents with some interesting issues:

1. If a chunk writing method has a bug that miscalculates the written
   bytes, then we can satisfy the table of contents without actually
   writing the right amount of data to the hashfile. The commit-graph
   writing code checks the hashfile struct directly for a more robust
   verification.

2. There is no way for a chunk writing method to gracefully fail.
   Returning an int presents an opportunity to fail without a die().

3. The current pattern doesn't match chunk_write_fn type exactly, so we
   cannot share code with commit-graph.c

For these reasons, convert the midx chunk writer methods to return an
'int'. Since none of them fail at the moment, they all return 0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 63 +++++++++++++++++++++++++---------------------------------
 1 file changed, 27 insertions(+), 36 deletions(-)

diff --git a/midx.c b/midx.c
index 5be081f229a..e23a5fc4903 100644
--- a/midx.c
+++ b/midx.c
@@ -650,7 +650,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	return deduplicated_entries;
 }
 
-static size_t write_midx_pack_names(struct hashfile *f, void *data)
+static int write_midx_pack_names(struct hashfile *f, void *data)
 {
 	struct write_midx_context *ctx = data;
 	uint32_t i;
@@ -678,14 +678,13 @@ static size_t write_midx_pack_names(struct hashfile *f, void *data)
 	if (i < MIDX_CHUNK_ALIGNMENT) {
 		memset(padding, 0, sizeof(padding));
 		hashwrite(f, padding, i);
-		written += i;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_oid_fanout(struct hashfile *f,
-				    void *data)
+static int write_midx_oid_fanout(struct hashfile *f,
+				 void *data)
 {
 	struct write_midx_context *ctx = data;
 	struct pack_midx_entry *list = ctx->entries;
@@ -710,17 +709,16 @@ static size_t write_midx_oid_fanout(struct hashfile *f,
 		list = next;
 	}
 
-	return MIDX_CHUNK_FANOUT_SIZE;
+	return 0;
 }
 
-static size_t write_midx_oid_lookup(struct hashfile *f,
-				    void *data)
+static int write_midx_oid_lookup(struct hashfile *f,
+				 void *data)
 {
 	struct write_midx_context *ctx = data;
 	unsigned char hash_len = the_hash_algo->rawsz;
 	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i;
-	size_t written = 0;
 
 	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
@@ -734,19 +732,17 @@ static size_t write_midx_oid_lookup(struct hashfile *f,
 		}
 
 		hashwrite(f, obj->oid.hash, (int)hash_len);
-		written += hash_len;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_object_offsets(struct hashfile *f,
-					void *data)
+static int write_midx_object_offsets(struct hashfile *f,
+				     void *data)
 {
 	struct write_midx_context *ctx = data;
 	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i, nr_large_offset = 0;
-	size_t written = 0;
 
 	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
@@ -766,20 +762,17 @@ static size_t write_midx_object_offsets(struct hashfile *f,
 			    obj->offset);
 		else
 			hashwrite_be32(f, (uint32_t)obj->offset);
-
-		written += MIDX_CHUNK_OFFSET_WIDTH;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_large_offsets(struct hashfile *f,
-				       void *data)
+static int write_midx_large_offsets(struct hashfile *f,
+				    void *data)
 {
 	struct write_midx_context *ctx = data;
 	struct pack_midx_entry *list = ctx->entries;
 	struct pack_midx_entry *end = ctx->entries + ctx->entries_nr;
-	size_t written = 0;
 	uint32_t nr_large_offset = ctx->num_large_offsets;
 
 	while (nr_large_offset) {
@@ -795,12 +788,12 @@ static size_t write_midx_large_offsets(struct hashfile *f,
 		if (!(offset >> 31))
 			continue;
 
-		written += hashwrite_be64(f, offset);
+		hashwrite_be64(f, offset);
 
 		nr_large_offset--;
 	}
 
-	return written;
+	return 0;
 }
 
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
@@ -812,7 +805,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint64_t written = 0;
+	uint64_t header_size = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	struct progress *progress = NULL;
@@ -940,10 +933,10 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		goto cleanup;
 	}
 
-	written = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
+	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
-	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
+	chunk_offsets[cur_chunk] = header_size + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
@@ -981,39 +974,37 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		hashwrite_be32(f, chunk_ids[i]);
 		hashwrite_be64(f, chunk_offsets[i]);
-
-		written += MIDX_CHUNKLOOKUP_WIDTH;
 	}
 
 	if (flags & MIDX_PROGRESS)
 		progress = start_delayed_progress(_("Writing chunks to multi-pack-index"),
 					  num_chunks);
 	for (i = 0; i < num_chunks; i++) {
-		if (written != chunk_offsets[i])
+		if (f->total + f->offset != chunk_offsets[i])
 			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
 			    chunk_offsets[i],
-			    written,
+			    f->total + f->offset,
 			    chunk_ids[i]);
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, &ctx);
+				write_midx_pack_names(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-				written += write_midx_oid_fanout(f, &ctx);
+				write_midx_oid_fanout(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDLOOKUP:
-				written += write_midx_oid_lookup(f, &ctx);
+				write_midx_oid_lookup(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, &ctx);
+				write_midx_object_offsets(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, &ctx);
+				write_midx_large_offsets(f, &ctx);
 				break;
 
 			default:
@@ -1025,9 +1016,9 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	}
 	stop_progress(&progress);
 
-	if (written != chunk_offsets[num_chunks])
+	if (f->total + f->offset != chunk_offsets[num_chunks])
 		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-		    written,
+		    f->total + f->offset,
 		    chunk_offsets[num_chunks]);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 10/17] midx: drop chunk progress during write
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (8 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
                     ` (9 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Most expensive operations in write_midx_internal() use the context
struct's progress member, and these indicate the process of the
expensive operations within the chunk writing methods. However, there is
a competing progress struct that counts the progress over all chunks.
This is not very helpful compared to the others, so drop it.

This also reduces our barriers to combining the chunk writing code with
chunk-format.c.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/midx.c b/midx.c
index e23a5fc4903..6ee262aab79 100644
--- a/midx.c
+++ b/midx.c
@@ -808,7 +808,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t header_size = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	struct progress *progress = NULL;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -976,9 +975,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		hashwrite_be64(f, chunk_offsets[i]);
 	}
 
-	if (flags & MIDX_PROGRESS)
-		progress = start_delayed_progress(_("Writing chunks to multi-pack-index"),
-					  num_chunks);
 	for (i = 0; i < num_chunks; i++) {
 		if (f->total + f->offset != chunk_offsets[i])
 			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
@@ -1011,10 +1007,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
 		}
-
-		display_progress(progress, i + 1);
 	}
-	stop_progress(&progress);
 
 	if (f->total + f->offset != chunk_offsets[num_chunks])
 		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 11/17] midx: use chunk-format API in write_midx_internal()
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (9 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
                     ` (8 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The chunk-format API allows writing the table of contents and all chunks
using the anonymous 'struct chunkfile' type. We only need to convert our
local chunk logic to this API for the multi-pack-index writes to share
that logic with the commit-graph file writes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 104 +++++++++++----------------------------------------------
 1 file changed, 19 insertions(+), 85 deletions(-)

diff --git a/midx.c b/midx.c
index 6ee262aab79..3585e04a706 100644
--- a/midx.c
+++ b/midx.c
@@ -11,6 +11,7 @@
 #include "trace2.h"
 #include "run-command.h"
 #include "repository.h"
+#include "chunk-format.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -799,18 +800,15 @@ static int write_midx_large_offsets(struct hashfile *f,
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
 			       struct string_list *packs_to_drop, unsigned flags)
 {
-	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
 	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint64_t header_size = 0;
-	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
-	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
+	struct chunkfile *cf;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name))
@@ -923,98 +921,34 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.m)
 		close_midx(ctx.m);
 
-	cur_chunk = 0;
-	num_chunks = ctx.large_offsets_needed ? 5 : 4;
-
 	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
 		result = 1;
 		goto cleanup;
 	}
 
-	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
-
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
-	chunk_offsets[cur_chunk] = header_size + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
-
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
-
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
-
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;
-
-	cur_chunk++;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
-	if (ctx.large_offsets_needed) {
-		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
-
-		cur_chunk++;
-		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
-					   ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
-	}
-
-	chunk_ids[cur_chunk] = 0;
-
-	for (i = 0; i <= num_chunks; i++) {
-		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
-			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
-			    chunk_offsets[i - 1],
-			    chunk_offsets[i]);
-
-		if (chunk_offsets[i] % MIDX_CHUNK_ALIGNMENT)
-			BUG("chunk offset %"PRIu64" is not properly aligned",
-			    chunk_offsets[i]);
-
-		hashwrite_be32(f, chunk_ids[i]);
-		hashwrite_be64(f, chunk_offsets[i]);
-	}
-
-	for (i = 0; i < num_chunks; i++) {
-		if (f->total + f->offset != chunk_offsets[i])
-			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
-			    chunk_offsets[i],
-			    f->total + f->offset,
-			    chunk_ids[i]);
+	cf = init_chunkfile(f);
 
-		switch (chunk_ids[i]) {
-			case MIDX_CHUNKID_PACKNAMES:
-				write_midx_pack_names(f, &ctx);
-				break;
+	add_chunk(cf, MIDX_CHUNKID_PACKNAMES,
+		  write_midx_pack_names, pack_name_concat_len);
+	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
+		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
+	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
+		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
+	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
+		  write_midx_object_offsets,
+		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
 
-			case MIDX_CHUNKID_OIDFANOUT:
-				write_midx_oid_fanout(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_OIDLOOKUP:
-				write_midx_oid_lookup(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_OBJECTOFFSETS:
-				write_midx_object_offsets(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_LARGEOFFSETS:
-				write_midx_large_offsets(f, &ctx);
-				break;
-
-			default:
-				BUG("trying to write unknown chunk id %"PRIx32,
-				    chunk_ids[i]);
-		}
-	}
+	if (ctx.large_offsets_needed)
+		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
+			write_midx_large_offsets,
+			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
 
-	if (f->total + f->offset != chunk_offsets[num_chunks])
-		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-		    f->total + f->offset,
-		    chunk_offsets[num_chunks]);
+	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
+	write_chunkfile(cf, &ctx);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
+	free_chunkfile(cf);
 	commit_lock_file(&lk);
 
 cleanup:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 12/17] chunk-format: create read chunk API
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (10 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-02-04 23:40     ` Junio C Hamano
  2021-01-27 15:01   ` [PATCH v2 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
                     ` (7 subsequent siblings)
  19 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add the capability to read the table of contents, then pair the chunks
with necessary logic using read_chunk_fn pointers. Callers will be added
in future changes, but the typical outline will be:

 1. initialize a 'struct chunkfile' with init_chunkfile(NULL).
 2. call read_table_of_contents().
 3. for each chunk to parse,
    a. call pair_chunk() to assign a pointer with the chunk position, or
    b. call read_chunk() to run a callback on the chunk start and size.
 4. call free_chunkfile() to clear the 'struct chunkfile' data.

We are re-using the anonymous 'struct chunkfile' data, as it is internal
to the chunk-format API. This gives it essentially two modes: write and
read. If the same struct instance was used for both reads and writes,
then there would be failures.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++
 chunk-format.h | 33 +++++++++++++++++++++
 2 files changed, 113 insertions(+)

diff --git a/chunk-format.c b/chunk-format.c
index ab914c55856..74501084cf8 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -12,6 +12,8 @@ struct chunk_info {
 	uint32_t id;
 	uint64_t size;
 	chunk_write_fn write_fn;
+
+	const void *start;
 };
 
 struct chunkfile {
@@ -89,3 +91,81 @@ int write_chunkfile(struct chunkfile *cf, void *data)
 
 	return 0;
 }
+
+int read_table_of_contents(struct chunkfile *cf,
+			   const unsigned char *mfile,
+			   size_t mfile_size,
+			   uint64_t toc_offset,
+			   int toc_length)
+{
+	uint32_t chunk_id;
+	const unsigned char *table_of_contents = mfile + toc_offset;
+
+	ALLOC_GROW(cf->chunks, toc_length, cf->chunks_alloc);
+
+	while (toc_length--) {
+		uint64_t chunk_offset, next_chunk_offset;
+
+		chunk_id = get_be32(table_of_contents);
+		chunk_offset = get_be64(table_of_contents + 4);
+
+		if (!chunk_id) {
+			error(_("terminating chunk id appears earlier than expected"));
+			return 1;
+		}
+
+		table_of_contents += CHUNK_LOOKUP_WIDTH;
+		next_chunk_offset = get_be64(table_of_contents + 4);
+
+		if (next_chunk_offset < chunk_offset ||
+		    next_chunk_offset > mfile_size - the_hash_algo->rawsz) {
+			error(_("improper chunk offset(s) %"PRIx64" and %"PRIx64""),
+			      chunk_offset, next_chunk_offset);
+			return -1;
+		}
+
+		cf->chunks[cf->chunks_nr].id = chunk_id;
+		cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
+		cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset;
+		cf->chunks_nr++;
+	}
+
+	chunk_id = get_be32(table_of_contents);
+	if (chunk_id) {
+		error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
+		return -1;
+	}
+
+	return 0;
+}
+
+int pair_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       const unsigned char **p)
+{
+	int i;
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		if (cf->chunks[i].id == chunk_id) {
+			*p = cf->chunks[i].start;
+			return 0;
+		}
+	}
+
+	return CHUNK_NOT_FOUND;
+}
+
+int read_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       chunk_read_fn fn,
+	       void *data)
+{
+	int i;
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		if (cf->chunks[i].id == chunk_id)
+			return fn(cf->chunks[i].start, cf->chunks[i].size, data);
+	}
+
+	return CHUNK_NOT_FOUND;
+}
diff --git a/chunk-format.h b/chunk-format.h
index bfaed672813..b62c9bf8ba1 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -17,4 +17,37 @@ void add_chunk(struct chunkfile *cf,
 	       size_t size);
 int write_chunkfile(struct chunkfile *cf, void *data);
 
+int read_table_of_contents(struct chunkfile *cf,
+			   const unsigned char *mfile,
+			   size_t mfile_size,
+			   uint64_t toc_offset,
+			   int toc_length);
+
+#define CHUNK_NOT_FOUND (-2)
+
+/*
+ * Find 'chunk_id' in the given chunkfile and assign the
+ * given pointer to the position in the mmap'd file where
+ * that chunk begins.
+ *
+ * Returns CHUNK_NOT_FOUND if the chunk does not exist.
+ */
+int pair_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       const unsigned char **p);
+
+typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
+			     size_t chunk_size, void *data);
+/*
+ * Find 'chunk_id' in the given chunkfile and call the
+ * given chunk_read_fn method with the information for
+ * that chunk.
+ *
+ * Returns CHUNK_NOT_FOUND if the chunk does not exist.
+ */
+int read_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       chunk_read_fn fn,
+	       void *data);
+
 #endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 13/17] commit-graph: use chunk-format read API
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (11 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 14/17] midx: " Derrick Stolee via GitGitGadget
                     ` (6 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). While the current
implementation loses the duplicate-chunk detection, that will be added
in a future change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 154 ++++++++++++++--------------------------
 t/t5318-commit-graph.sh |   2 +-
 2 files changed, 53 insertions(+), 103 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index ba33777dcb8..8aa4881d85d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -299,15 +299,43 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
+static int graph_read_oid_lookup(const unsigned char *chunk_start,
+				 size_t chunk_size, void *data)
+{
+	struct commit_graph *g = data;
+	g->chunk_oid_lookup = chunk_start;
+	g->num_commits = chunk_size / g->hash_len;
+	return 0;
+}
+
+static int graph_read_bloom_data(const unsigned char *chunk_start,
+				  size_t chunk_size, void *data)
+{
+	struct commit_graph *g = data;
+	uint32_t hash_version;
+	g->chunk_bloom_data = chunk_start;
+	hash_version = get_be32(chunk_start);
+
+	if (hash_version != 1)
+		return 0;
+
+	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
+	g->bloom_filter_settings->hash_version = hash_version;
+	g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4);
+	g->bloom_filter_settings->bits_per_entry = get_be32(chunk_start + 8);
+	g->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
+
+	return 0;
+}
+
 struct commit_graph *parse_commit_graph(struct repository *r,
 					void *graph_map, size_t graph_size)
 {
-	const unsigned char *data, *chunk_lookup;
-	uint32_t i;
+	const unsigned char *data;
 	struct commit_graph *graph;
-	uint64_t next_chunk_offset;
 	uint32_t graph_signature;
 	unsigned char graph_version, hash_version;
+	struct chunkfile *cf = NULL;
 
 	if (!graph_map)
 		return NULL;
@@ -356,108 +384,28 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		return NULL;
 	}
 
-	chunk_lookup = data + 8;
-	next_chunk_offset = get_be64(chunk_lookup + 4);
-	for (i = 0; i < graph->num_chunks; i++) {
-		uint32_t chunk_id;
-		uint64_t chunk_offset = next_chunk_offset;
-		int chunk_repeated = 0;
-
-		chunk_id = get_be32(chunk_lookup + 0);
-
-		chunk_lookup += GRAPH_CHUNKLOOKUP_WIDTH;
-		next_chunk_offset = get_be64(chunk_lookup + 4);
-
-		if (chunk_offset > graph_size - the_hash_algo->rawsz) {
-			error(_("commit-graph improper chunk offset %08x%08x"), (uint32_t)(chunk_offset >> 32),
-			      (uint32_t)chunk_offset);
-			goto free_and_return;
-		}
-
-		switch (chunk_id) {
-		case GRAPH_CHUNKID_OIDFANOUT:
-			if (graph->chunk_oid_fanout)
-				chunk_repeated = 1;
-			else
-				graph->chunk_oid_fanout = (uint32_t*)(data + chunk_offset);
-			break;
-
-		case GRAPH_CHUNKID_OIDLOOKUP:
-			if (graph->chunk_oid_lookup)
-				chunk_repeated = 1;
-			else {
-				graph->chunk_oid_lookup = data + chunk_offset;
-				graph->num_commits = (next_chunk_offset - chunk_offset)
-						     / graph->hash_len;
-			}
-			break;
+	cf = init_chunkfile(NULL);
 
-		case GRAPH_CHUNKID_DATA:
-			if (graph->chunk_commit_data)
-				chunk_repeated = 1;
-			else
-				graph->chunk_commit_data = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_GENERATION_DATA:
-			if (graph->chunk_generation_data)
-				chunk_repeated = 1;
-			else
-				graph->chunk_generation_data = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
-			if (graph->chunk_generation_data_overflow)
-				chunk_repeated = 1;
-			else
-				graph->chunk_generation_data_overflow = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_EXTRAEDGES:
-			if (graph->chunk_extra_edges)
-				chunk_repeated = 1;
-			else
-				graph->chunk_extra_edges = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BASE:
-			if (graph->chunk_base_graphs)
-				chunk_repeated = 1;
-			else
-				graph->chunk_base_graphs = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BLOOMINDEXES:
-			if (graph->chunk_bloom_indexes)
-				chunk_repeated = 1;
-			else if (r->settings.commit_graph_read_changed_paths)
-				graph->chunk_bloom_indexes = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BLOOMDATA:
-			if (graph->chunk_bloom_data)
-				chunk_repeated = 1;
-			else if (r->settings.commit_graph_read_changed_paths) {
-				uint32_t hash_version;
-				graph->chunk_bloom_data = data + chunk_offset;
-				hash_version = get_be32(data + chunk_offset);
-
-				if (hash_version != 1)
-					break;
+	if (read_table_of_contents(cf, graph->data, graph_size,
+				   GRAPH_HEADER_SIZE, graph->num_chunks))
+		goto free_and_return;
 
-				graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
-				graph->bloom_filter_settings->hash_version = hash_version;
-				graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
-				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
-				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
-			}
-			break;
-		}
+	pair_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
+		   (const unsigned char **)&graph->chunk_oid_fanout);
+	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
+	pair_chunk(cf, GRAPH_CHUNKID_DATA, &graph->chunk_commit_data);
+	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges);
+	pair_chunk(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs);
+	pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
+		   &graph->chunk_generation_data);
+	pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
+		   &graph->chunk_generation_data_overflow);
 
-		if (chunk_repeated) {
-			error(_("commit-graph chunk id %08x appears multiple times"), chunk_id);
-			goto free_and_return;
-		}
+	if (r->settings.commit_graph_read_changed_paths) {
+		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+			   &graph->chunk_bloom_indexes);
+		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
+			   graph_read_bloom_data, graph);
 	}
 
 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
@@ -474,9 +422,11 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	if (verify_commit_graph_lite(graph))
 		goto free_and_return;
 
+	free_chunkfile(cf);
 	return graph;
 
 free_and_return:
+	free_chunkfile(cf);
 	free(graph->bloom_filter_settings);
 	free(graph);
 	return NULL;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index fa27df579a5..c7da741284e 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -564,7 +564,7 @@ test_expect_success 'detect bad hash version' '
 
 test_expect_success 'detect low chunk count' '
 	corrupt_graph_and_verify $GRAPH_BYTE_CHUNK_COUNT "\01" \
-		"missing the .* chunk"
+		"final chunk has non-zero id"
 '
 
 test_expect_success 'detect missing OID fanout chunk' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 14/17] midx: use chunk-format read API
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (12 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-01-27 15:01   ` [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  19 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). In particular, we
can use the return value of pair_chunk() to generate an error when a
required chunk is missing.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 71 +++++++++++++------------------------
 t/t5319-multi-pack-index.sh |  6 ++--
 2 files changed, 28 insertions(+), 49 deletions(-)

diff --git a/midx.c b/midx.c
index 3585e04a706..e94dcd34b7f 100644
--- a/midx.c
+++ b/midx.c
@@ -54,6 +54,19 @@ static char *get_midx_filename(const char *object_dir)
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
 
+static int midx_read_oid_fanout(const unsigned char *chunk_start,
+				size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = data;
+	m->chunk_oid_fanout = (uint32_t *)chunk_start;
+
+	if (chunk_size != 4 * 256) {
+		error(_("multi-pack-index OID fanout is of the wrong size"));
+		return 1;
+	}
+	return 0;
+}
+
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local)
 {
 	struct multi_pack_index *m = NULL;
@@ -65,6 +78,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 	char *midx_name = get_midx_filename(object_dir);
 	uint32_t i;
 	const char *cur_pack_name;
+	struct chunkfile *cf = NULL;
 
 	fd = git_open(midx_name);
 
@@ -114,58 +128,23 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 
 	m->num_packs = get_be32(m->data + MIDX_BYTE_NUM_PACKS);
 
-	for (i = 0; i < m->num_chunks; i++) {
-		uint32_t chunk_id = get_be32(m->data + MIDX_HEADER_SIZE +
-					     MIDX_CHUNKLOOKUP_WIDTH * i);
-		uint64_t chunk_offset = get_be64(m->data + MIDX_HEADER_SIZE + 4 +
-						 MIDX_CHUNKLOOKUP_WIDTH * i);
-
-		if (chunk_offset >= m->data_len)
-			die(_("invalid chunk offset (too large)"));
-
-		switch (chunk_id) {
-			case MIDX_CHUNKID_PACKNAMES:
-				m->chunk_pack_names = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_OIDFANOUT:
-				m->chunk_oid_fanout = (uint32_t *)(m->data + chunk_offset);
-				break;
-
-			case MIDX_CHUNKID_OIDLOOKUP:
-				m->chunk_oid_lookup = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_OBJECTOFFSETS:
-				m->chunk_object_offsets = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_LARGEOFFSETS:
-				m->chunk_large_offsets = m->data + chunk_offset;
-				break;
-
-			case 0:
-				die(_("terminating multi-pack-index chunk id appears earlier than expected"));
-				break;
-
-			default:
-				/*
-				 * Do nothing on unrecognized chunks, allowing future
-				 * extensions to add optional chunks.
-				 */
-				break;
-		}
-	}
+	cf = init_chunkfile(NULL);
 
-	if (!m->chunk_pack_names)
+	if (read_table_of_contents(cf, m->data, midx_size,
+				   MIDX_HEADER_SIZE, m->num_chunks))
+		goto cleanup_fail;
+
+	if (pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required pack-name chunk"));
-	if (!m->chunk_oid_fanout)
+	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required OID fanout chunk"));
-	if (!m->chunk_oid_lookup)
+	if (pair_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required OID lookup chunk"));
-	if (!m->chunk_object_offsets)
+	if (pair_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, &m->chunk_object_offsets) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required object offsets chunk"));
 
+	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets);
+
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
 	m->pack_names = xcalloc(m->num_packs, sizeof(*m->pack_names));
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 297de502a94..ad4e878b65b 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -314,12 +314,12 @@ test_expect_success 'verify bad OID version' '
 
 test_expect_success 'verify truncated chunk count' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\01" $objdir \
-		"missing required"
+		"final chunk has non-zero id"
 '
 
 test_expect_success 'verify extended chunk count' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\07" $objdir \
-		"terminating multi-pack-index chunk id appears earlier than expected"
+		"terminating chunk id appears earlier than expected"
 '
 
 test_expect_success 'verify missing required chunk' '
@@ -329,7 +329,7 @@ test_expect_success 'verify missing required chunk' '
 
 test_expect_success 'verify invalid chunk offset' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_OFFSET "\01" $objdir \
-		"invalid chunk offset (too large)"
+		"improper chunk offset(s)"
 '
 
 test_expect_success 'verify packnames out of order' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (13 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 14/17] midx: " Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-02-05  0:00     ` Junio C Hamano
  2021-01-27 15:01   ` [PATCH v2 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  19 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When calculating the sizes of certain chunks, we should use 64-bit
multiplication always. This allows us to properly predict the chunk
sizes without risk of overflow.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/midx.c b/midx.c
index e94dcd34b7f..a365dac6bbc 100644
--- a/midx.c
+++ b/midx.c
@@ -913,7 +913,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
 		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
 	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
-		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
+		  write_midx_oid_lookup, (uint64_t)ctx.entries_nr * the_hash_algo->rawsz);
 	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
 		  write_midx_object_offsets,
 		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
@@ -921,7 +921,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.large_offsets_needed)
 		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
 			write_midx_large_offsets,
-			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
+			(uint64_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
 
 	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
 	write_chunkfile(cf, &ctx);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 16/17] chunk-format: restore duplicate chunk checks
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (14 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-02-05  0:05     ` Junio C Hamano
  2021-01-27 15:01   ` [PATCH v2 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  19 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Before refactoring into the chunk-format API, the commit-graph parsing
logic included checks for duplicate chunks. It is unlikely that we would
desire a chunk-based file format that allows duplicate chunk IDs in the
table of contents, so add duplicate checks into
read_table_of_contents().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/chunk-format.c b/chunk-format.c
index 74501084cf8..1ee875df423 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -14,6 +14,7 @@ struct chunk_info {
 	chunk_write_fn write_fn;
 
 	const void *start;
+	unsigned found:1;
 };
 
 struct chunkfile {
@@ -98,6 +99,7 @@ int read_table_of_contents(struct chunkfile *cf,
 			   uint64_t toc_offset,
 			   int toc_length)
 {
+	int i;
 	uint32_t chunk_id;
 	const unsigned char *table_of_contents = mfile + toc_offset;
 
@@ -124,6 +126,14 @@ int read_table_of_contents(struct chunkfile *cf,
 			return -1;
 		}
 
+		for (i = 0; i < cf->chunks_nr; i++) {
+			if (cf->chunks[i].id == chunk_id) {
+				error(_("duplicate chunk ID %"PRIx32" found"),
+					chunk_id);
+				return -1;
+			}
+		}
+
 		cf->chunks[cf->chunks_nr].id = chunk_id;
 		cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
 		cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v2 17/17] chunk-format: add technical docs
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (15 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
@ 2021-01-27 15:01   ` Derrick Stolee via GitGitGadget
  2021-02-05  0:15     ` Junio C Hamano
  2021-01-27 16:03   ` [PATCH v2 00/17] Refactor chunk-format into an API Taylor Blau
                     ` (2 subsequent siblings)
  19 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-27 15:01 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The chunk-based file format is now an API in the code, but we should
also take time to document it as a file format. Specifically, it matches
the CHUNK LOOKUP sections of the commit-graph and multi-pack-index
files, but there are some commonalities that should be grouped in this
document.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/chunk-format.txt      | 54 +++++++++++++++++++
 .../technical/commit-graph-format.txt         |  3 ++
 Documentation/technical/pack-format.txt       |  3 ++
 3 files changed, 60 insertions(+)
 create mode 100644 Documentation/technical/chunk-format.txt

diff --git a/Documentation/technical/chunk-format.txt b/Documentation/technical/chunk-format.txt
new file mode 100644
index 00000000000..3db3792dea2
--- /dev/null
+++ b/Documentation/technical/chunk-format.txt
@@ -0,0 +1,54 @@
+Chunk-based file formats
+========================
+
+Some file formats in Git use a common concept of "chunks" to describe
+sections of the file. This allows structured access to a large file by
+scanning a small "table of contents" for the remaining data. This common
+format is used by the `commit-graph` and `multi-pack-index` files. See
+link:technical/pack-format.html[the `multi-pack-index` format] and
+link:technical/commit-graph-format.html[the `commit-graph` format] for
+how they use the chunks to describe structured data.
+
+A chunk-based file format begins with some header information custom to
+that format. That header should include enough information to identify
+the file type, format version, and number of chunks in the file. From this
+information, that file can determine the start of the chunk-based region.
+
+The chunk-based region starts with a table of contents describing where
+each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
+where C is the number of chunks. Consider the following table:
+
+  | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
+  |--------------------|------------------------|
+  | ID[0]              | OFFSET[0]              |
+  | ...                | ...                    |
+  | ID[C]              | OFFSET[C]              |
+  | 0x0000             | OFFSET[C+1]            |
+
+Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
+Each integer is stored in network-byte order.
+
+The chunk identifier `ID[i]` is a label for the data stored within this
+fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
+size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
+and `OFFSET[i]`. This requires that the chunk data appears contiguously
+in the same order as the table of contents.
+
+The final entry in the table of contents must be four zero bytes. This
+confirms that the table of contents is ending and provides the offset for
+the end of the chunk-based data.
+
+Note: The chunk-based format expects that the file contains _at least_ a
+trailing hash after `OFFSET[C+1]`.
+
+Functions for working with chunk-based file formats are declared in
+`chunk-format.h`. Using these methods provide extra checks that assist
+developers when creating new file formats, including:
+
+ 1. Writing and reading the table of contents.
+
+ 2. Verifying that the data written in a chunk matches the expected size
+    that was recorded in the table of contents.
+
+ 3. Checking that a table of contents describes offsets properly within
+    the file boundaries.
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b6658eff188..87971c27dd7 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -61,6 +61,9 @@ CHUNK LOOKUP:
       the length using the next chunk position if necessary.) Each chunk
       ID appears at most once.
 
+  The CHUNK LOOKUP matches the table of contents from
+  link:technical/chunk-format.html[the chunk-based file format].
+
   The remaining data in the body is described one chunk at a time, and
   these chunks may be given in any order. Chunks are required unless
   otherwise specified.
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index f96b2e605f3..2fb1e60d29e 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -301,6 +301,9 @@ CHUNK LOOKUP:
 	    (Chunks are provided in file-order, so you can infer the length
 	    using the next chunk position if necessary.)
 
+	The CHUNK LOOKUP matches the table of contents from
+	link:technical/chunk-format.html[the chunk-based file format].
+
 	The remaining data in the body is described one chunk at a time, and
 	these chunks may be given in any order. Chunks are required unless
 	otherwise specified.
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 00/17] Refactor chunk-format into an API
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (16 preceding siblings ...)
  2021-01-27 15:01   ` [PATCH v2 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
@ 2021-01-27 16:03   ` Taylor Blau
  2021-02-05  2:08   ` Junio C Hamano
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  19 siblings, 0 replies; 120+ messages in thread
From: Taylor Blau @ 2021-01-27 16:03 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee

On Wed, Jan 27, 2021 at 03:01:39PM +0000, Derrick Stolee via GitGitGadget wrote:
> Updates in V2
> =============
>
>  * The method pair_chunk() now automatically sets a pointer while
>    read_chunk() uses the callback. This greatly reduces the code size.
>
>  * Pointer casts are now implicit instead of explicit.
>
>  * Extra care is taken to not overflow when verifying chunk sizes on write.

Thanks, I read the range-diff between this version and the last and
appreciate you taking the time to address all of my concerns.

I think that this is ready to go, so please have my:

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 02/17] chunk-format: create chunk format write API
  2021-01-27 15:01   ` [PATCH v2 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
@ 2021-02-04 21:24     ` Junio C Hamano
  2021-02-04 22:40       ` Junio C Hamano
  2021-02-05 11:37       ` Derrick Stolee
  0 siblings, 2 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-02-04 21:24 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +/*
> + * When writing a chunk-based file format, collect the chunks in
> + * an array of chunk_info structs. The size stores the _expected_
> + * amount of data that will be written by write_fn.
> + */
> +struct chunk_info {
> +	uint32_t id;
> +	uint64_t size;
> +	chunk_write_fn write_fn;
> +};
> +...
> +void add_chunk(struct chunkfile *cf,
> +	       uint64_t id,
> +	       chunk_write_fn fn,
> +	       size_t size)
> +{
> +	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
> +
> +	cf->chunks[cf->chunks_nr].id = id;
> +	cf->chunks[cf->chunks_nr].write_fn = fn;
> +	cf->chunks[cf->chunks_nr].size = size;
> +	cf->chunks_nr++;
> +}

Somebody somewhere between the caller in the higher part of the
callchain (that has to work with platform native types) and the
on-disk format at the bottom of the callchain (that has to work
with fixed size data fields) needs to make sure that the size that
the higher level caller has fits on-disk data structure we define,
and the data we read from disk fits the in-core structure our caller
use on the reading side.

If there is a function at the one level closer to the disk than
"struct chunk_info" that takes a "struct chunk_info" and writes the
id and size to disk (and fills "struct chunk_info" from what is read
from the disk, on the reading side), it would be a good place to do
the size_t to uint64_t conversion.

It is OK to do the conversion-while-checking in add_chunk(), too.

But a silent type casting from size_t to uint64_t done silently by
assignment bothers me.  Also, I think you meant to make the incoming
ID uint32_t; am I missing something, or did nobody notice it in the
review of the previous round?

> +int write_chunkfile(struct chunkfile *cf, void *data)
> +{
> +	int i;
> +	size_t cur_offset = cf->f->offset + cf->f->total;

That ought to be off_t, as it is a seek position inside a file
(struct hashfile.total is already off_t).

Use csum-file.h::hashfile_total() instead, perhaps?  .offset member
is an implementation detail of the hashfile API (i.e. some leftover
bits are kept in in-core buffer, until we accumulate enough to make
it worth flushing to the disk), and by using the helper, this code
does not have to know about it.

> +	/* Add the table of contents to the current offset */
> +	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;

Is that 12 == sizeof(chunk_info.id) + sizeof(chunk_info.size)?
If so, this makes sense.

> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		hashwrite_be32(cf->f, cf->chunks[i].id);
> +		hashwrite_be64(cf->f, cur_offset);
> +
> +		cur_offset += cf->chunks[i].size;
> +	}
> +
> +	/* Trailing entry marks the end of the chunks */
> +	hashwrite_be32(cf->f, 0);
> +	hashwrite_be64(cf->f, cur_offset);

OK.  This helper does not tell us anything about what comes in the
on-disk file before this point, but we write a table of contents
that says "chunk with this ID has this size, chunk with that ID has
that size, ...", concluded by something that looks like another
entry with chunk ID 0 that records the current offset as its size.

> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		uint64_t start_offset = cf->f->total + cf->f->offset;

Likewise about the type and use of hashfile_total().

> +		int result = cf->chunks[i].write_fn(cf->f, data);
> +
> +		if (result)
> +			return result;
> +
> +		if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size)
> +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
> +			    cf->chunks[i].size, cf->chunks[i].id,
> +			    cf->f->total + cf->f->offset - start_offset);

I won't complain, as this apparently is sufficient to abstract out
the two existing chunked format files, but it somehow feels a bit
limiting that the one that calls add_chunk() is required to know
what the size of generated data would be, way before .write_fn() is
called to produce the actual data here.

> +	}
> +
> +	return 0;
> +}
> diff --git a/chunk-format.h b/chunk-format.h
> new file mode 100644
> index 00000000000..bfaed672813
> --- /dev/null
> +++ b/chunk-format.h
> @@ -0,0 +1,20 @@
> +#ifndef CHUNK_FORMAT_H
> +#define CHUNK_FORMAT_H
> +
> +#include "git-compat-util.h"
> +
> +struct hashfile;
> +struct chunkfile;
> +
> +struct chunkfile *init_chunkfile(struct hashfile *f);
> +void free_chunkfile(struct chunkfile *cf);
> +int get_num_chunks(struct chunkfile *cf);
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> +			      void *data);

Write this on a single line.

> +void add_chunk(struct chunkfile *cf,
> +	       uint64_t id,
> +	       chunk_write_fn fn,
> +	       size_t size);

Shouldn't this match the order of members in chunk_info struct?

> +int write_chunkfile(struct chunkfile *cf, void *data);
> +
> +#endif

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 02/17] chunk-format: create chunk format write API
  2021-02-04 21:24     ` Junio C Hamano
@ 2021-02-04 22:40       ` Junio C Hamano
  2021-02-05 11:37       ` Derrick Stolee
  1 sibling, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-02-04 22:40 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

>> +void add_chunk(struct chunkfile *cf,
>> +	       uint64_t id,
>> +	       chunk_write_fn fn,
>> +	       size_t size);
>
> Shouldn't this match the order of members in chunk_info struct?

Nah.  Unless there are other reasons to touch numerous add_chunk()
calls that have already been written (like we find a need to add a
new parameter to the call), I do not think this matters too much.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 09/17] midx: return success/failure in chunk write methods
  2021-01-27 15:01   ` [PATCH v2 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
@ 2021-02-04 22:59     ` Junio C Hamano
  2021-02-05 11:42       ` Derrick Stolee
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-02-04 22:59 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> Historically, the chunk-writing methods in midx.c have returned the
> amount of data written so the writer method could compare this with the
> table of contents. This presents with some interesting issues:
>
> 1. If a chunk writing method has a bug that miscalculates the written
>    bytes, then we can satisfy the table of contents without actually
>    writing the right amount of data to the hashfile. The commit-graph
>    writing code checks the hashfile struct directly for a more robust
>    verification.
>
> 2. There is no way for a chunk writing method to gracefully fail.
>    Returning an int presents an opportunity to fail without a die().
>
> 3. The current pattern doesn't match chunk_write_fn type exactly, so we
>    cannot share code with commit-graph.c
>
> For these reasons, convert the midx chunk writer methods to return an
> 'int'. Since none of them fail at the moment, they all return 0.

OK, that makes sense.

In this step I see the same "f->total + f->offset" sprinkled in some
places, but they will all go away in the "midx: use chunk-format API
in write_midx_internal()" step, so it is not such a huge deal.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 12/17] chunk-format: create read chunk API
  2021-01-27 15:01   ` [PATCH v2 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
@ 2021-02-04 23:40     ` Junio C Hamano
  2021-02-05 12:19       ` Derrick Stolee
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-02-04 23:40 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> We are re-using the anonymous 'struct chunkfile' data, as it is internal
> to the chunk-format API. This gives it essentially two modes: write and
> read. If the same struct instance was used for both reads and writes,
> then there would be failures.

Writing it here won't help future developers very much.  I think
that belongs to API/function doc for init_chunkfile().

> diff --git a/chunk-format.c b/chunk-format.c
> index ab914c55856..74501084cf8 100644
> --- a/chunk-format.c
> +++ b/chunk-format.c
> @@ -12,6 +12,8 @@ struct chunk_info {
>  	uint32_t id;
>  	uint64_t size;
>  	chunk_write_fn write_fn;
> +
> +	const void *start;
>  };
>  
>  struct chunkfile {
> @@ -89,3 +91,81 @@ int write_chunkfile(struct chunkfile *cf, void *data)
>  
>  	return 0;
>  }
> +
> +int read_table_of_contents(struct chunkfile *cf,
> +			   const unsigned char *mfile,
> +			   size_t mfile_size,
> +			   uint64_t toc_offset,
> +			   int toc_length)

It's a bit of mystery, having seen how the table-of-contents is laid
out by reading the writing side of the code, how toc_offset and
toc_length are discovered by the caller.  IIRC, the size to cover
everything from the beginning of the file to the end of the
table-of-contents was recorded as the length of a non-existent chunk
with id 0, but we need to be able to somehow find it to use that as
a way to get to the (end of) table-of-contents from the beginning of
the file.   I guess we'll learn enough when we get to read the code
that calls this function.

> +{
> +	uint32_t chunk_id;
> +	const unsigned char *table_of_contents = mfile + toc_offset;
> +
> +	ALLOC_GROW(cf->chunks, toc_length, cf->chunks_alloc);
> +
> +	while (toc_length--) {
> +		uint64_t chunk_offset, next_chunk_offset;
> +
> +		chunk_id = get_be32(table_of_contents);
> +		chunk_offset = get_be64(table_of_contents + 4);
> +
> +		if (!chunk_id) {
> +			error(_("terminating chunk id appears earlier than expected"));
> +			return 1;
> +		}
> +
> +		table_of_contents += CHUNK_LOOKUP_WIDTH;
> +		next_chunk_offset = get_be64(table_of_contents + 4);
> +
> +		if (next_chunk_offset < chunk_offset ||
> +		    next_chunk_offset > mfile_size - the_hash_algo->rawsz) {

The chunks have to be recorded in toc in the order they appear, and
there must be enough room to store the hashfile trailer checksum
after the last chunk.  OK.

> +			error(_("improper chunk offset(s) %"PRIx64" and %"PRIx64""),
> +			      chunk_offset, next_chunk_offset);
> +			return -1;
> +		}
> +
> +		cf->chunks[cf->chunks_nr].id = chunk_id;
> +		cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
> +		cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset;
> +		cf->chunks_nr++;
> +	}
> +
> +	chunk_id = get_be32(table_of_contents);
> +	if (chunk_id) {
> +		error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
> +		return -1;
> +	}

Shouldn't we be validating the size component associated with this
"id=0" fake chunk that appears at the end as well?  IIRC, the size
field should be pointing at the byte immediately after the TOC entry
for the last true chunk, immediately before this zero chunk id?

> +int pair_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       const unsigned char **p)
> +{
> +	int i;
> +
> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		if (cf->chunks[i].id == chunk_id) {
> +			*p = cf->chunks[i].start;
> +			return 0;
> +		}

OK, the assumption here is that there will be at most one chunk that
has the chunk_id we seek to find (or putting it differently, second
and subsequent chunks with the same ID are ignored).  We may want to
write it down somewhere.

> +	}
> +
> +	return CHUNK_NOT_FOUND;
> +}
> +
> +int read_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       chunk_read_fn fn,
> +	       void *data)
> +{
> +	int i;
> +
> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		if (cf->chunks[i].id == chunk_id)
> +			return fn(cf->chunks[i].start, cf->chunks[i].size, data);

It is curious why pair_chunk() exists in the first place.  With
something like this:

        int pair_chunk_fn(const unsigned char *chunk_start,
                          size_t chunk_size,
                          void *data)
        {
                const unsigned char **p = data;
                *p = chunk_start;
                return 0;
        }

instead of

	const unsigned char *location;

	pair_chunk(cf, chunk_id, &location);

we can write

	const unsigned char *location;

	read_chunk(cf, chunk_id, pair_chunk_fn, &location);

no?  That would allow us to reorganize the in-core TOC more easily
if it turns out to be necessary in the future.

> diff --git a/chunk-format.h b/chunk-format.h
> index bfaed672813..b62c9bf8ba1 100644
> --- a/chunk-format.h
> +++ b/chunk-format.h
> @@ -17,4 +17,37 @@ void add_chunk(struct chunkfile *cf,
>  	       size_t size);
>  int write_chunkfile(struct chunkfile *cf, void *data);
>  
> +int read_table_of_contents(struct chunkfile *cf,
> +			   const unsigned char *mfile,
> +			   size_t mfile_size,
> +			   uint64_t toc_offset,
> +			   int toc_length);
> +
> +#define CHUNK_NOT_FOUND (-2)
> +
> +/*
> + * Find 'chunk_id' in the given chunkfile and assign the
> + * given pointer to the position in the mmap'd file where
> + * that chunk begins.
> + *
> + * Returns CHUNK_NOT_FOUND if the chunk does not exist.
> + */
> +int pair_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       const unsigned char **p);
> +
> +typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
> +			     size_t chunk_size, void *data);

Is the assumption throughout the chunked file API that reading is
actually not done as a "seek+read+process", but as a normal memory
access over mmapped region?  The reason I ask is because the answer
affects the choice of the right type for the offset.  The function
signature of read_table_of_contents() uses size_t to represent the
length of the entire mmapped region that holds the data from the
file, and that is better than off_t, especially if size_t were
smaller than off_t (i.e. we may not be able to mmap a huge size that
filesystem can handle and let us access with seek+read).

But the assumption that the whole mfile can be mmapped in as a whole
is only in read_table_of_contents(), and users of read_chunk() API
can be oblivious, I think---IOW, we could "page in" the chunk's data
in read_chunk() while the callback function works on it in-core, and
then discard it, if we wanted to change the implementation [*].

	Side note: for that to work, the API must say that the
	callback function MUST NOT assume that the memory region
	starting at chunk_start it is given will stay in memory
	after it returns.  Otherwise, we cannot "page in" and "page
	out".

I am not advocating that we should not assume the entire file can be
mapped in.  I would however advocate to be explicit in documenting
what the users of API can and cannot do (e.g. if we want the "read"
callbacks to take advantage of the fact that mfile will stay mapped
until the chunkfile is discarded, we should say so, so that they will
not make unnecessary copies out of the mmapped region).

> +/*
> + * Find 'chunk_id' in the given chunkfile and call the
> + * given chunk_read_fn method with the information for
> + * that chunk.
> + *
> + * Returns CHUNK_NOT_FOUND if the chunk does not exist.
> + */
> +int read_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       chunk_read_fn fn,
> +	       void *data);

Did I miss an update to free_chunkfile() to release resources used
to read this file?  For some reason, unlike the writing side, the
reading side of this API feels a bit incomplete to me.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes
  2021-01-27 15:01   ` [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
@ 2021-02-05  0:00     ` Junio C Hamano
  2021-02-05 10:59       ` Chris Torek
                         ` (2 more replies)
  0 siblings, 3 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-02-05  0:00 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When calculating the sizes of certain chunks, we should use 64-bit
> multiplication always. This allows us to properly predict the chunk
> sizes without risk of overflow.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

This one I find somewhat questionable for multiple reasons.

 * the fourth parameter of add_chunk() is of size_t, not uint64_t;
   shouldn't the multiplication be done in type size_t instead?

 * these mutiplications were introduced in "midx: use chunk-format
   API in write_midx_internal()"; that step should use the
   arithmetic with cast (if necessary) from the start, no?

 * There is "ctx.entries_nr * MIDX_CHUNKID_OFFSET_WIDTH" passed to
   add_chunk(), in the post-context of the first hunk.  Shouldn't
   that be covered as well?  I didn't grep for all uses of
   add_chunk(), but I wouldn't be surprised if this patch missed
   some of the calls that need the same treatment.

> diff --git a/midx.c b/midx.c
> index e94dcd34b7f..a365dac6bbc 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -913,7 +913,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
>  		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
>  	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
> -		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
> +		  write_midx_oid_lookup, (uint64_t)ctx.entries_nr * the_hash_algo->rawsz);
>  	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
>  		  write_midx_object_offsets,
>  		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
> @@ -921,7 +921,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  	if (ctx.large_offsets_needed)
>  		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
>  			write_midx_large_offsets,
> -			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
> +			(uint64_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
>  
>  	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
>  	write_chunkfile(cf, &ctx);

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 16/17] chunk-format: restore duplicate chunk checks
  2021-01-27 15:01   ` [PATCH v2 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
@ 2021-02-05  0:05     ` Junio C Hamano
  2021-02-05 12:31       ` Derrick Stolee
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-02-05  0:05 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> Before refactoring into the chunk-format API, the commit-graph parsing
> logic included checks for duplicate chunks. It is unlikely that we would
> desire a chunk-based file format that allows duplicate chunk IDs in the
> table of contents, so add duplicate checks into
> read_table_of_contents().

Makes sense.  This answers a question I had while reading one of the
previous steps about the design, I think.

However...

> diff --git a/chunk-format.c b/chunk-format.c
> index 74501084cf8..1ee875df423 100644
> --- a/chunk-format.c
> +++ b/chunk-format.c
> @@ -14,6 +14,7 @@ struct chunk_info {
>  	chunk_write_fn write_fn;
>  
>  	const void *start;
> +	unsigned found:1;

This defines a .found member ...

> @@ -98,6 +99,7 @@ int read_table_of_contents(struct chunkfile *cf,
>  			   uint64_t toc_offset,
>  			   int toc_length)
>  {
> +	int i;
>  	uint32_t chunk_id;
>  	const unsigned char *table_of_contents = mfile + toc_offset;
>  
> @@ -124,6 +126,14 @@ int read_table_of_contents(struct chunkfile *cf,
>  			return -1;
>  		}
>  
> +		for (i = 0; i < cf->chunks_nr; i++) {
> +			if (cf->chunks[i].id == chunk_id) {
> +				error(_("duplicate chunk ID %"PRIx32" found"),
> +					chunk_id);
> +				return -1;
> +			}
> +		}
> +
>  		cf->chunks[cf->chunks_nr].id = chunk_id;
>  		cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
>  		cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset;

... and no new code touches it.

The way duplicate is found is by having a inner loop that checks the
IDs of chunks we've seen so far (quadratic, but presumably that
would not matter as long as we'd be dealing with just half a dozen
chunk types).

Is the .found bit used for something else and needs to be added in a
different step?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 17/17] chunk-format: add technical docs
  2021-01-27 15:01   ` [PATCH v2 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
@ 2021-02-05  0:15     ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-02-05  0:15 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +Chunk-based file formats
> +========================
> +
> +Some file formats in Git use a common concept of "chunks" to describe
> +sections of the file. This allows structured access to a large file by
> +scanning a small "table of contents" for the remaining data. This common
> +format is used by the `commit-graph` and `multi-pack-index` files. See
> +link:technical/pack-format.html[the `multi-pack-index` format] and
> +link:technical/commit-graph-format.html[the `commit-graph` format] for
> +how they use the chunks to describe structured data.
> +
> +A chunk-based file format begins with some header information custom to
> +that format. That header should include enough information to identify
> +the file type, format version, and number of chunks in the file. From this
> +information, that file can determine the start of the chunk-based region.
> +
> +The chunk-based region starts with a table of contents describing where
> +each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
> +where C is the number of chunks. Consider the following table:
> +
> +  | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
> +  |--------------------|------------------------|
> +  | ID[0]              | OFFSET[0]              |
> +  | ...                | ...                    |
> +  | ID[C]              | OFFSET[C]              |
> +  | 0x0000             | OFFSET[C+1]            |
> +
> +Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
> +Each integer is stored in network-byte order.
> +
> +The chunk identifier `ID[i]` is a label for the data stored within this
> +fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
> +size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
> +and `OFFSET[i]`. This requires that the chunk data appears contiguously
> +in the same order as the table of contents.
> +
> +The final entry in the table of contents must be four zero bytes. This
> +confirms that the table of contents is ending and provides the offset for
> +the end of the chunk-based data.
> +
> +Note: The chunk-based format expects that the file contains _at least_ a
> +trailing hash after `OFFSET[C+1]`.

I think the above describes what I saw in the writing side of the
code quite clearly and very well.  I misread that the OFFSET[C+1]
was pointing elsewhere in my review of [2/17] somehow, but the code
is clear that it points at the end of the last chunk from the code,
and the above documents it well.

My comments on the need to document the reading side API, on what
the read_chunk callback should be able to assume (namely, the whole
thing stays in memory until the caller that decided to use chunkfile
API decides to discard it), still stands, I would think.

Thanks.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 00/17] Refactor chunk-format into an API
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (17 preceding siblings ...)
  2021-01-27 16:03   ` [PATCH v2 00/17] Refactor chunk-format into an API Taylor Blau
@ 2021-02-05  2:08   ` Junio C Hamano
  2021-02-05  2:27     ` Derrick Stolee
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  19 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-02-05  2:08 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This is a restart on the topic previously submitted [1] but dropped because
> ak/corrected-commit-date was still in progress. This version is based on
> that branch.

I've read the topic through, and found it a pleasant read.  There
are some questionable use of integer types, some uneven application
of casts, and the reading side API is somewhat underdocumented, but
the overall direction looked quite sane.

I am undecided if I should expect a reroll, or declare what we have
"already good enough" for 'next' and expect incremental refinements,
though.  A reroll that catches all nits would certainly make the
resulting topic's history nicer for future developers to work with,
but its also a pain to re-read essentially the same patches again,
so...



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 00/17] Refactor chunk-format into an API
  2021-02-05  2:08   ` Junio C Hamano
@ 2021-02-05  2:27     ` Derrick Stolee
  0 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee @ 2021-02-05  2:27 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee

On 2/4/2021 9:08 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> This is a restart on the topic previously submitted [1] but dropped because
>> ak/corrected-commit-date was still in progress. This version is based on
>> that branch.
> 
> I've read the topic through, and found it a pleasant read.  There
> are some questionable use of integer types, some uneven application
> of casts, and the reading side API is somewhat underdocumented, but
> the overall direction looked quite sane.
> 
> I am undecided if I should expect a reroll, or declare what we have
> "already good enough" for 'next' and expect incremental refinements,
> though.  A reroll that catches all nits would certainly make the
> resulting topic's history nicer for future developers to work with,
> but its also a pain to re-read essentially the same patches again,
> so...

I appreciate the willingness to take the topic as-is, but I think
you've given me enough interesting comments to deserve a re-roll.
I will provide one tomorrow and hope that the range-diff is a
pleasant read ;).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes
  2021-02-05  0:00     ` Junio C Hamano
@ 2021-02-05 10:59       ` Chris Torek
  2021-02-05 20:41         ` Junio C Hamano
  2021-02-05 12:30       ` Derrick Stolee
  2021-02-07 19:50       ` SZEDER Gábor
  2 siblings, 1 reply; 120+ messages in thread
From: Chris Torek @ 2021-02-05 10:59 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, Git List, Taylor Blau,
	René Scharfe, szeder.dev, Derrick Stolee, Derrick Stolee,
	Derrick Stolee

On Thu, Feb 4, 2021 at 4:00 PM Junio C Hamano <gitster@pobox.com> wrote:
>  * the fourth parameter of add_chunk() is of size_t, not uint64_t;
>    shouldn't the multiplication be done in type size_t instead?

There are (still) systems with 32-bit size_t (but 64-bit
off_t / file sizes), so ... probably not.  Is size_t ever more than
64 bits these days?

Chris

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 02/17] chunk-format: create chunk format write API
  2021-02-04 21:24     ` Junio C Hamano
  2021-02-04 22:40       ` Junio C Hamano
@ 2021-02-05 11:37       ` Derrick Stolee
  2021-02-05 19:25         ` Junio C Hamano
  1 sibling, 1 reply; 120+ messages in thread
From: Derrick Stolee @ 2021-02-05 11:37 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee, Derrick Stolee

On 2/4/2021 4:24 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +/*
>> + * When writing a chunk-based file format, collect the chunks in
>> + * an array of chunk_info structs. The size stores the _expected_
>> + * amount of data that will be written by write_fn.
>> + */
>> +struct chunk_info {
>> +	uint32_t id;
>> +	uint64_t size;
>> +	chunk_write_fn write_fn;
>> +};
>> +...
>> +void add_chunk(struct chunkfile *cf,
>> +	       uint64_t id,
>> +	       chunk_write_fn fn,
>> +	       size_t size)
>> +{
>> +	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
>> +
>> +	cf->chunks[cf->chunks_nr].id = id;
>> +	cf->chunks[cf->chunks_nr].write_fn = fn;
>> +	cf->chunks[cf->chunks_nr].size = size;
>> +	cf->chunks_nr++;
>> +}
> 
> Somebody somewhere between the caller in the higher part of the
> callchain (that has to work with platform native types) and the
> on-disk format at the bottom of the callchain (that has to work
> with fixed size data fields) needs to make sure that the size that
> the higher level caller has fits on-disk data structure we define,
> and the data we read from disk fits the in-core structure our caller
> use on the reading side.
> 
> If there is a function at the one level closer to the disk than
> "struct chunk_info" that takes a "struct chunk_info" and writes the
> id and size to disk (and fills "struct chunk_info" from what is read
> from the disk, on the reading side), it would be a good place to do
> the size_t to uint64_t conversion.

I'm fine with keeping the external interface focused on size_t
instead of uint64_t.

> It is OK to do the conversion-while-checking in add_chunk(), too.
> 
> But a silent type casting from size_t to uint64_t done silently by
> assignment bothers me. 

Does this bother you only because its part of the external interface?
If I understand correctly, uint64_t will always be at least as big
as size_t, so this doesn't need any protections for overflow or
anything. Is there something I should be doing before casting?

> Also, I think you meant to make the incoming
> ID uint32_t; am I missing something, or did nobody notice it in the
> review of the previous round?

Yes, this should be 32-bits. Will fix.
 
>> +int write_chunkfile(struct chunkfile *cf, void *data)
>> +{
>> +	int i;
>> +	size_t cur_offset = cf->f->offset + cf->f->total;
> 
> That ought to be off_t, as it is a seek position inside a file
> (struct hashfile.total is already off_t).

I can use off_t for the other offsets in this computation, but
cur_offset will be used in hashwrite_be64(), so maybe it is best
to use uint64_t here?

> Use csum-file.h::hashfile_total() instead, perhaps?  .offset member
> is an implementation detail of the hashfile API (i.e. some leftover
> bits are kept in in-core buffer, until we accumulate enough to make
> it worth flushing to the disk), and by using the helper, this code
> does not have to know about it.

Thanks! This is cleaner.

>> +	/* Add the table of contents to the current offset */
>> +	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;
> 
> Is that 12 == sizeof(chunk_info.id) + sizeof(chunk_info.size)?
> If so, this makes sense.

Yes.

>> +	for (i = 0; i < cf->chunks_nr; i++) {
>> +		hashwrite_be32(cf->f, cf->chunks[i].id);
>> +		hashwrite_be64(cf->f, cur_offset);
>> +
>> +		cur_offset += cf->chunks[i].size;
>> +	}
>> +
>> +	/* Trailing entry marks the end of the chunks */
>> +	hashwrite_be32(cf->f, 0);
>> +	hashwrite_be64(cf->f, cur_offset);
> 
> OK.  This helper does not tell us anything about what comes in the
> on-disk file before this point, but we write a table of contents
> that says "chunk with this ID has this size, chunk with that ID has
> that size, ...", concluded by something that looks like another
> entry with chunk ID 0 that records the current offset as its size.

Right. The table of contents gives us enough information to find
the start _and_ end of each chunk (and hence compute their size).

>> +	for (i = 0; i < cf->chunks_nr; i++) {
>> +		uint64_t start_offset = cf->f->total + cf->f->offset;
> 
> Likewise about the type and use of hashfile_total().

This one can definitely be off_t.

>> +		int result = cf->chunks[i].write_fn(cf->f, data);
>> +
>> +		if (result)
>> +			return result;
>> +
>> +		if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size)
>> +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
>> +			    cf->chunks[i].size, cf->chunks[i].id,
>> +			    cf->f->total + cf->f->offset - start_offset);
> 
> I won't complain, as this apparently is sufficient to abstract out
> the two existing chunked format files, but it somehow feels a bit
> limiting that the one that calls add_chunk() is required to know
> what the size of generated data would be, way before .write_fn() is
> called to produce the actual data here.

This was pointed out earlier, but it _is_ part of the existing users
of the format. The table of contents is written at the start of the
file instead of the end (such as in the .zip format).

The current chunk format API makes the same assumption (ToC comes
first) but could be adjusted later to let this part of the method
dynamically compute the chunk sizes and fill a ToC at the end. The
way to modify this API would be to add a 'flags' parameter.

So far, this has not been necessary, but might be in the future.

>> +	}
>> +
>> +	return 0;
>> +}
>> diff --git a/chunk-format.h b/chunk-format.h
>> new file mode 100644
>> index 00000000000..bfaed672813
>> --- /dev/null
>> +++ b/chunk-format.h
>> @@ -0,0 +1,20 @@
>> +#ifndef CHUNK_FORMAT_H
>> +#define CHUNK_FORMAT_H
>> +
>> +#include "git-compat-util.h"
>> +
>> +struct hashfile;
>> +struct chunkfile;
>> +
>> +struct chunkfile *init_chunkfile(struct hashfile *f);
>> +void free_chunkfile(struct chunkfile *cf);
>> +int get_num_chunks(struct chunkfile *cf);
>> +typedef int (*chunk_write_fn)(struct hashfile *f,
>> +			      void *data);
> 
> Write this on a single line.

Will do.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 09/17] midx: return success/failure in chunk write methods
  2021-02-04 22:59     ` Junio C Hamano
@ 2021-02-05 11:42       ` Derrick Stolee
  0 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee @ 2021-02-05 11:42 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee, Derrick Stolee

On 2/4/2021 5:59 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> In this step I see the same "f->total + f->offset" sprinkled in some
> places, but they will all go away in the "midx: use chunk-format API
> in write_midx_internal()" step, so it is not such a huge deal.

I'll move the ones that are introduced by this patch, but others
already exist that also get removed by a later patch. I won't
touch those until they are deleted.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 12/17] chunk-format: create read chunk API
  2021-02-04 23:40     ` Junio C Hamano
@ 2021-02-05 12:19       ` Derrick Stolee
  2021-02-05 19:37         ` Junio C Hamano
  0 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee @ 2021-02-05 12:19 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee, Derrick Stolee

On 2/4/2021 6:40 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> We are re-using the anonymous 'struct chunkfile' data, as it is internal
>> to the chunk-format API. This gives it essentially two modes: write and
>> read. If the same struct instance was used for both reads and writes,
>> then there would be failures.
> 
> Writing it here won't help future developers very much.  I think
> that belongs to API/function doc for init_chunkfile().

I will add a clear comment there. Thanks.

>> +int read_table_of_contents(struct chunkfile *cf,
>> +			   const unsigned char *mfile,
>> +			   size_t mfile_size,
>> +			   uint64_t toc_offset,
>> +			   int toc_length)
> 
> It's a bit of mystery, having seen how the table-of-contents is laid
> out by reading the writing side of the code, how toc_offset and
> toc_length are discovered by the caller.  IIRC, the size to cover
> everything from the beginning of the file to the end of the
> table-of-contents was recorded as the length of a non-existent chunk
> with id 0, but we need to be able to somehow find it to use that as
> a way to get to the (end of) table-of-contents from the beginning of
> the file.   I guess we'll learn enough when we get to read the code
> that calls this function.

The existing formats have a byte in their header specifying how
many chunks are in the table of contents. That's how this information
is known in advance.

If we want to instead rely on the terminating chunk id with value 0,
then this method could be modified in the future. It does complicate
the allocation of cf->chunks slightly.

>> +	chunk_id = get_be32(table_of_contents);
>> +	if (chunk_id) {
>> +		error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
>> +		return -1;
>> +	}
> 
> Shouldn't we be validating the size component associated with this
> "id=0" fake chunk that appears at the end as well?  IIRC, the size
> field should be pointing at the byte immediately after the TOC entry
> for the last true chunk, immediately before this zero chunk id?

During the loop, we scanned ahead to find the offset of the
terminating chunk and compute the size of the last "real" chunk.

Any size validation here would be to check that the offset points
_exactly_ to the terminating hash, but that might be too restrictive
on the format. (Who knows if there is a legit reason to have non-
chunked data between the chunked data and the trailing hash?)

As I re-roll, I plan to skip this final check. It would be easy to
add it in a forward fix, thought.

>> +int pair_chunk(struct chunkfile *cf,
>> +	       uint32_t chunk_id,
>> +	       const unsigned char **p)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < cf->chunks_nr; i++) {
>> +		if (cf->chunks[i].id == chunk_id) {
>> +			*p = cf->chunks[i].start;
>> +			return 0;
>> +		}
> 
> OK, the assumption here is that there will be at most one chunk that
> has the chunk_id we seek to find (or putting it differently, second
> and subsequent chunks with the same ID are ignored).  We may want to
> write it down somewhere.

This is enforced by a later patch, as you discovered.

>> +	for (i = 0; i < cf->chunks_nr; i++) {
>> +		if (cf->chunks[i].id == chunk_id)
>> +			return fn(cf->chunks[i].start, cf->chunks[i].size, data);
> 
> It is curious why pair_chunk() exists in the first place.  With
> something like this:
> 
>         int pair_chunk_fn(const unsigned char *chunk_start,
>                           size_t chunk_size,
>                           void *data)
>         {
>                 const unsigned char **p = data;
>                 *p = chunk_start;
>                 return 0;
>         }
> 
> instead of
> 
> 	const unsigned char *location;
> 
> 	pair_chunk(cf, chunk_id, &location);
> 
> we can write
> 
> 	const unsigned char *location;
> 
> 	read_chunk(cf, chunk_id, pair_chunk_fn, &location);
> 
> no?  That would allow us to reorganize the in-core TOC more easily
> if it turns out to be necessary in the future.

I like this, but why not just use pair_chunk_fn inside of
the implementation of pair_chunk() so callers have an easy
interface. This results in the implementation:

static int pair_chunk_fn(const unsigned char *chunk_start,
			 size_t chunk_size,
			 void *data)
{
	const unsigned char **p = data;
	*p = chunk_start;
	return 0;
}

int pair_chunk(struct chunkfile *cf,
	       uint32_t chunk_id,
	       const unsigned char **p)
{
	return read_chunk(cf, chunk_id, pair_chunk_fn, p);
}

>> +typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
>> +			     size_t chunk_size, void *data);
> 
> Is the assumption throughout the chunked file API that reading is
> actually not done as a "seek+read+process", but as a normal memory
> access over mmapped region?  The reason I ask is because the answer
> affects the choice of the right type for the offset.  The function
> signature of read_table_of_contents() uses size_t to represent the
> length of the entire mmapped region that holds the data from the
> file, and that is better than off_t, especially if size_t were
> smaller than off_t (i.e. we may not be able to mmap a huge size that
> filesystem can handle and let us access with seek+read).
> 
> But the assumption that the whole mfile can be mmapped in as a whole
> is only in read_table_of_contents(), and users of read_chunk() API
> can be oblivious, I think---IOW, we could "page in" the chunk's data
> in read_chunk() while the callback function works on it in-core, and
> then discard it, if we wanted to change the implementation [*].
> 
> 	Side note: for that to work, the API must say that the
> 	callback function MUST NOT assume that the memory region
> 	starting at chunk_start it is given will stay in memory
> 	after it returns.  Otherwise, we cannot "page in" and "page
> 	out".
> 
> I am not advocating that we should not assume the entire file can be
> mapped in.  I would however advocate to be explicit in documenting
> what the users of API can and cannot do (e.g. if we want the "read"
> callbacks to take advantage of the fact that mfile will stay mapped
> until the chunkfile is discarded, we should say so, so that they will
> not make unnecessary copies out of the mmapped region).

I will add this expectation to the documentation.

>> +/*
>> + * Find 'chunk_id' in the given chunkfile and call the
>> + * given chunk_read_fn method with the information for
>> + * that chunk.
>> + *
>> + * Returns CHUNK_NOT_FOUND if the chunk does not exist.
>> + */
>> +int read_chunk(struct chunkfile *cf,
>> +	       uint32_t chunk_id,
>> +	       chunk_read_fn fn,
>> +	       void *data);
> 
> Did I miss an update to free_chunkfile() to release resources used
> to read this file?  For some reason, unlike the writing side, the
> reading side of this API feels a bit incomplete to me.

free_chunkfile(cf) already frees cf->chunks (and cf itself). No
other resources are allocated during the read. The caller is
responsible for the mmap'd file resource.

The reading side could use more documentation. I will think about
this.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes
  2021-02-05  0:00     ` Junio C Hamano
  2021-02-05 10:59       ` Chris Torek
@ 2021-02-05 12:30       ` Derrick Stolee
  2021-02-05 19:42         ` Junio C Hamano
  2021-02-07 19:50       ` SZEDER Gábor
  2 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee @ 2021-02-05 12:30 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee, Derrick Stolee

On 2/4/2021 7:00 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> When calculating the sizes of certain chunks, we should use 64-bit
>> multiplication always. This allows us to properly predict the chunk
>> sizes without risk of overflow.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  midx.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> This one I find somewhat questionable for multiple reasons.
> 
>  * the fourth parameter of add_chunk() is of size_t, not uint64_t;
>    shouldn't the multiplication be done in type size_t instead?

This is probably appropriate because we will truncate to size_t if
it is smaller than uint64_t.

>  * these mutiplications were introduced in "midx: use chunk-format
>    API in write_midx_internal()"; that step should use the
>    arithmetic with cast (if necessary) from the start, no?

I wanted to isolate these changes specifically so we could be
careful about the multiplications and not be distracted by them
when converting to the chunk-format API. The multiplications were
"moved" by that patch, not "introduced".

>  * There is "ctx.entries_nr * MIDX_CHUNKID_OFFSET_WIDTH" passed to
>    add_chunk(), in the post-context of the first hunk.  Shouldn't
>    that be covered as well?  I didn't grep for all uses of
>    add_chunk(), but I wouldn't be surprised if this patch missed
>    some of the calls that need the same treatment.

And here is a great example of why it was good to call out these
multiplications in their own patch.

I did a full inspection of all multiplications in midx.c and
found a few more instances of possible overflow. Two are on the
read side, but they require the object lookup chunk to have size
4gb or larger. This is not _that_ far off from possibility! My
multi-pack-index for the Windows repository is currently ~1.6 GB
(in total, including the other chunks).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 16/17] chunk-format: restore duplicate chunk checks
  2021-02-05  0:05     ` Junio C Hamano
@ 2021-02-05 12:31       ` Derrick Stolee
  0 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee @ 2021-02-05 12:31 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee, Derrick Stolee

On 2/4/2021 7:05 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>  	const void *start;
>> +	unsigned found:1;
> 
> This defines a .found member ...
>> ... and no new code touches it.
> 
> The way duplicate is found is by having a inner loop that checks the
> IDs of chunks we've seen so far (quadratic, but presumably that
> would not matter as long as we'd be dealing with just half a dozen
> chunk types).
> 
> Is the .found bit used for something else and needs to be added in a
> different step?

Nope. It is just noise that I should have caught and deleted.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 00/17] Refactor chunk-format into an API
  2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (18 preceding siblings ...)
  2021-02-05  2:08   ` Junio C Hamano
@ 2021-02-05 14:30   ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
                       ` (17 more replies)
  19 siblings, 18 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee

This is a restart on the topic previously submitted [1] but dropped because
ak/corrected-commit-date was still in progress. This version is based on
that branch.

[1]
https://lore.kernel.org/git/pull.804.git.1607012215.gitgitgadget@gmail.com/

This version also changes the approach to use a more dynamic interaction
with a struct chunkfile pointer. This idea is credited to Taylor Blau [2],
but I started again from scratch. I also go further to make struct chunkfile
anonymous to API consumers. It is defined only in chunk-format.c, which
should hopefully deter future users from interacting with that data
directly.

[2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

This combined API is beneficial to reduce duplicated logic. Or rather, to
ensure that similar file formats have similar protections against bad data.
The multi-pack-index code did not have as many guards as the commit-graph
code did, but now they both share a common base that checks for things like
duplicate chunks or offsets outside the size of the file.

Here are some stats for the end-to-end change:

 * 570 insertions(+), 456 deletions(-).
 * commit-graph.c: 107 insertions(+), 192 deletions(-)
 * midx.c: 164 insertions(+), 260 deletions(-)

While there is an overall increase to the code size, the consumers do get
smaller. Boilerplate things like abstracting method to match chunk_write_fn
and chunk_read_fn make up a lot of these insertions. The "interesting" code
gets a lot smaller and cleaner.


Updates in V3
=============

 * API methods use better types and changed their order to match internal
   data more closely.

 * Use hashfile_total() instead of internal data values.

 * The implementation of pair_chunk() uses read_chunk().

 * init_chunkfile() has an in-code doc comment warning against using the
   same struct chunkfile for reads and writes.

 * More multiplications are correctly cast in midx.c.

 * The chunk-format technical docs are expanded.


Updates in V2
=============

 * The method pair_chunk() now automatically sets a pointer while
   read_chunk() uses the callback. This greatly reduces the code size.

 * Pointer casts are now implicit instead of explicit.

 * Extra care is taken to not overflow when verifying chunk sizes on write.

Thanks, -Stolee

Derrick Stolee (17):
  commit-graph: anonymize data in chunk_write_fn
  chunk-format: create chunk format write API
  commit-graph: use chunk-format write API
  midx: rename pack_info to write_midx_context
  midx: use context in write_midx_pack_names()
  midx: add entries to write_midx_context
  midx: add pack_perm to write_midx_context
  midx: add num_large_offsets to write_midx_context
  midx: return success/failure in chunk write methods
  midx: drop chunk progress during write
  midx: use chunk-format API in write_midx_internal()
  chunk-format: create read chunk API
  commit-graph: use chunk-format read API
  midx: use chunk-format read API
  midx: use 64-bit multiplication for chunk sizes
  chunk-format: restore duplicate chunk checks
  chunk-format: add technical docs

 Documentation/technical/chunk-format.txt      | 116 +++++
 .../technical/commit-graph-format.txt         |   3 +
 Documentation/technical/pack-format.txt       |   3 +
 Makefile                                      |   1 +
 chunk-format.c                                | 180 ++++++++
 chunk-format.h                                |  65 +++
 commit-graph.c                                | 299 +++++-------
 midx.c                                        | 431 +++++++-----------
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5319-multi-pack-index.sh                   |   6 +-
 10 files changed, 648 insertions(+), 458 deletions(-)
 create mode 100644 Documentation/technical/chunk-format.txt
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h


base-commit: 5a3b130cad0d5c770f766e3af6d32b41766374c0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-848%2Fderrickstolee%2Fchunk-format%2Frefactor-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/848

Range-diff vs v2:

  1:  243dcec94368 =  1:  243dcec94368 commit-graph: anonymize data in chunk_write_fn
  2:  814512f21671 !  2:  16c37d2370cf chunk-format: create chunk format write API
     @@ Commit message
           5. free the chunkfile struct using free_chunkfile().
      
          Helped-by: Taylor Blau <me@ttaylorr.com>
     +    Helped-by: Junio C Hamano <gitster@pobox.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## Makefile ##
     @@ chunk-format.c (new)
      +}
      +
      +void add_chunk(struct chunkfile *cf,
     -+	       uint64_t id,
     -+	       chunk_write_fn fn,
     -+	       size_t size)
     ++	       uint32_t id,
     ++	       size_t size,
     ++	       chunk_write_fn fn)
      +{
      +	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
      +
     @@ chunk-format.c (new)
      +int write_chunkfile(struct chunkfile *cf, void *data)
      +{
      +	int i;
     -+	size_t cur_offset = cf->f->offset + cf->f->total;
     ++	uint64_t cur_offset = hashfile_total(cf->f);
      +
      +	/* Add the table of contents to the current offset */
      +	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;
     @@ chunk-format.c (new)
      +	hashwrite_be64(cf->f, cur_offset);
      +
      +	for (i = 0; i < cf->chunks_nr; i++) {
     -+		uint64_t start_offset = cf->f->total + cf->f->offset;
     ++		off_t start_offset = hashfile_total(cf->f);
      +		int result = cf->chunks[i].write_fn(cf->f, data);
      +
      +		if (result)
      +			return result;
      +
     -+		if (cf->f->total + cf->f->offset - start_offset != cf->chunks[i].size)
     ++		if (hashfile_total(cf->f) - start_offset != cf->chunks[i].size)
      +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
      +			    cf->chunks[i].size, cf->chunks[i].id,
     -+			    cf->f->total + cf->f->offset - start_offset);
     ++			    hashfile_total(cf->f) - start_offset);
      +	}
      +
      +	return 0;
     @@ chunk-format.h (new)
      +struct chunkfile *init_chunkfile(struct hashfile *f);
      +void free_chunkfile(struct chunkfile *cf);
      +int get_num_chunks(struct chunkfile *cf);
     -+typedef int (*chunk_write_fn)(struct hashfile *f,
     -+			      void *data);
     ++typedef int (*chunk_write_fn)(struct hashfile *f, void *data);
      +void add_chunk(struct chunkfile *cf,
     -+	       uint64_t id,
     -+	       chunk_write_fn fn,
     -+	       size_t size);
     ++	       uint32_t id,
     ++	       size_t size,
     ++	       chunk_write_fn fn);
      +int write_chunkfile(struct chunkfile *cf, void *data);
      +
      +#endif
  3:  70af6e3083f4 !  3:  e549e24d79af commit-graph: use chunk-format write API
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
      -	chunks[2].write_fn = write_graph_chunk_data;
      +	cf = init_chunkfile(f);
      +
     -+	add_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
     -+		  write_graph_chunk_fanout, GRAPH_FANOUT_SIZE);
     -+	add_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP,
     -+		  write_graph_chunk_oids, hashsz * ctx->commits.nr);
     -+	add_chunk(cf, GRAPH_CHUNKID_DATA,
     -+		  write_graph_chunk_data, (hashsz + 16) * ctx->commits.nr);
     ++	add_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, GRAPH_FANOUT_SIZE,
     ++		  write_graph_chunk_fanout);
     ++	add_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, hashsz * ctx->commits.nr,
     ++		  write_graph_chunk_oids);
     ++	add_chunk(cf, GRAPH_CHUNKID_DATA, (hashsz + 16) * ctx->commits.nr,
     ++		  write_graph_chunk_data);
       
       	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
       		ctx->write_generation_data = 0;
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
      -	}
      +	if (ctx->write_generation_data)
      +		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
     -+			  write_graph_chunk_generation_data,
     -+			  sizeof(uint32_t) * ctx->commits.nr);
     ++			  sizeof(uint32_t) * ctx->commits.nr,
     ++			  write_graph_chunk_generation_data);
      +	if (ctx->num_generation_data_overflows)
      +		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
     -+			  write_graph_chunk_generation_data_overflow,
     -+			  sizeof(timestamp_t) * ctx->num_generation_data_overflows);
     ++			  sizeof(timestamp_t) * ctx->num_generation_data_overflows,
     ++			  write_graph_chunk_generation_data_overflow);
      +	if (ctx->num_extra_edges)
      +		add_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES,
     -+			  write_graph_chunk_extra_edges,
     -+			  4 * ctx->num_extra_edges);
     ++			  4 * ctx->num_extra_edges,
     ++			  write_graph_chunk_extra_edges);
       	if (ctx->changed_paths) {
      -		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
      -		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
      -	chunks[num_chunks].id = 0;
      -	chunks[num_chunks].size = 0;
      +		add_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
     -+			  write_graph_chunk_bloom_indexes,
     -+			  sizeof(uint32_t) * ctx->commits.nr);
     ++			  sizeof(uint32_t) * ctx->commits.nr,
     ++			  write_graph_chunk_bloom_indexes);
      +		add_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
     -+			  write_graph_chunk_bloom_data,
      +			  sizeof(uint32_t) * 3
     -+				+ ctx->total_bloom_filter_data_size);
     ++				+ ctx->total_bloom_filter_data_size,
     ++			  write_graph_chunk_bloom_data);
      +	}
      +	if (ctx->num_commit_graphs_after > 1)
      +		add_chunk(cf, GRAPH_CHUNKID_BASE,
     -+			  write_graph_chunk_base,
     -+			  hashsz * (ctx->num_commit_graphs_after - 1));
     ++			  hashsz * (ctx->num_commit_graphs_after - 1),
     ++			  write_graph_chunk_base);
       
       	hashwrite_be32(f, GRAPH_SIGNATURE);
       
  4:  0cac7890bed7 =  4:  66ff49ed9309 midx: rename pack_info to write_midx_context
  5:  4a4e90b129ae =  5:  1d7484c0cffa midx: use context in write_midx_pack_names()
  6:  30ad423997b7 =  6:  ea0e7d40e537 midx: add entries to write_midx_context
  7:  2f1c496f3ab5 =  7:  b283a38fb775 midx: add pack_perm to write_midx_context
  8:  c4939548e51c =  8:  e7064512ab7f midx: add num_large_offsets to write_midx_context
  9:  b3cc73c22567 !  9:  7aa3242e15b7 midx: return success/failure in chunk write methods
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
       	stop_progress(&progress);
       
      -	if (written != chunk_offsets[num_chunks])
     -+	if (f->total + f->offset != chunk_offsets[num_chunks])
     ++	if (hashfile_total(f) != chunk_offsets[num_chunks])
       		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
      -		    written,
     -+		    f->total + f->offset,
     ++		    hashfile_total(f),
       		    chunk_offsets[num_chunks]);
       
       	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 10:  78744d3b7016 ! 10:  70f68c95e479 midx: drop chunk progress during write
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
       	}
      -	stop_progress(&progress);
       
     - 	if (f->total + f->offset != chunk_offsets[num_chunks])
     + 	if (hashfile_total(f) != chunk_offsets[num_chunks])
       		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
 11:  07dc0cf8c683 ! 11:  787cd7f18d2e midx: use chunk-format API in write_midx_internal()
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
      -			case MIDX_CHUNKID_PACKNAMES:
      -				write_midx_pack_names(f, &ctx);
      -				break;
     -+	add_chunk(cf, MIDX_CHUNKID_PACKNAMES,
     -+		  write_midx_pack_names, pack_name_concat_len);
     -+	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
     -+		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
     ++	add_chunk(cf, MIDX_CHUNKID_PACKNAMES, pack_name_concat_len,
     ++		  write_midx_pack_names);
     ++	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT, MIDX_CHUNK_FANOUT_SIZE,
     ++		  write_midx_oid_fanout);
      +	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
     -+		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
     ++		  ctx.entries_nr * the_hash_algo->rawsz,
     ++		  write_midx_oid_lookup);
      +	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
     -+		  write_midx_object_offsets,
     -+		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
     ++		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
     ++		  write_midx_object_offsets);
       
      -			case MIDX_CHUNKID_OIDFANOUT:
      -				write_midx_oid_fanout(f, &ctx);
     @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack
      -	}
      +	if (ctx.large_offsets_needed)
      +		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
     -+			write_midx_large_offsets,
     -+			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
     ++			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
     ++			write_midx_large_offsets);
       
     --	if (f->total + f->offset != chunk_offsets[num_chunks])
     +-	if (hashfile_total(f) != chunk_offsets[num_chunks])
      -		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
     --		    f->total + f->offset,
     +-		    hashfile_total(f),
      -		    chunk_offsets[num_chunks]);
      +	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
      +	write_chunkfile(cf, &ctx);
 12:  d8d8e9e2aa3f ! 12:  366eb2afee83 chunk-format: create read chunk API
     @@ Commit message
          read. If the same struct instance was used for both reads and writes,
          then there would be failures.
      
     +    Helped-by: Junio C Hamano <gitster@pobox.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## chunk-format.c ##
     @@ chunk-format.c: int write_chunkfile(struct chunkfile *cf, void *data)
      +	return 0;
      +}
      +
     ++static int pair_chunk_fn(const unsigned char *chunk_start,
     ++			 size_t chunk_size,
     ++			 void *data)
     ++{
     ++	const unsigned char **p = data;
     ++	*p = chunk_start;
     ++	return 0;
     ++}
     ++
      +int pair_chunk(struct chunkfile *cf,
      +	       uint32_t chunk_id,
      +	       const unsigned char **p)
      +{
     -+	int i;
     -+
     -+	for (i = 0; i < cf->chunks_nr; i++) {
     -+		if (cf->chunks[i].id == chunk_id) {
     -+			*p = cf->chunks[i].start;
     -+			return 0;
     -+		}
     -+	}
     -+
     -+	return CHUNK_NOT_FOUND;
     ++	return read_chunk(cf, chunk_id, pair_chunk_fn, p);
      +}
      +
      +int read_chunk(struct chunkfile *cf,
     @@ chunk-format.c: int write_chunkfile(struct chunkfile *cf, void *data)
      +}
      
       ## chunk-format.h ##
     +@@
     + struct hashfile;
     + struct chunkfile;
     + 
     ++/*
     ++ * Initialize a 'struct chunkfile' for writing _or_ reading a file
     ++ * with the chunk format.
     ++ *
     ++ * If writing a file, supply a non-NULL 'struct hashfile *' that will
     ++ * be used to write.
     ++ *
     ++ * If reading a file, then supply the memory-mapped data to the
     ++ * pair_chunk() or read_chunk() methods, as appropriate.
     ++ *
     ++ * DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances
     ++ * for reading and writing.
     ++ */
     + struct chunkfile *init_chunkfile(struct hashfile *f);
     + void free_chunkfile(struct chunkfile *cf);
     + int get_num_chunks(struct chunkfile *cf);
      @@ chunk-format.h: void add_chunk(struct chunkfile *cf,
     - 	       size_t size);
     + 	       chunk_write_fn fn);
       int write_chunkfile(struct chunkfile *cf, void *data);
       
      +int read_table_of_contents(struct chunkfile *cf,
 13:  8744d2785965 = 13:  7838ad32e2e0 commit-graph: use chunk-format read API
 14:  750c03253c95 ! 14:  6bddd9e63b9b midx: use chunk-format read API
     @@ midx.c: struct multi_pack_index *load_multi_pack_index(const char *object_dir, i
       	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
       
       	m->pack_names = xcalloc(m->num_packs, sizeof(*m->pack_names));
     +@@ midx.c: struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
     + cleanup_fail:
     + 	free(m);
     + 	free(midx_name);
     ++	free(cf);
     + 	if (midx_map)
     + 		munmap(midx_map, midx_size);
     + 	if (0 <= fd)
      
       ## t/t5319-multi-pack-index.sh ##
      @@ t/t5319-multi-pack-index.sh: test_expect_success 'verify bad OID version' '
 15:  83d292532a0f ! 15:  3cd97f389f1f midx: use 64-bit multiplication for chunk sizes
     @@ Commit message
          multiplication always. This allows us to properly predict the chunk
          sizes without risk of overflow.
      
     +    Other possible overflows were discovered by evaluating each
     +    multiplication in midx.c and ensuring that at least one side of the
     +    operator was of type size_t or off_t.
     +
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## midx.c ##
     +@@ midx.c: static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
     + 	const unsigned char *offset_data;
     + 	uint32_t offset32;
     + 
     +-	offset_data = m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH;
     ++	offset_data = m->chunk_object_offsets + (off_t)pos * MIDX_CHUNK_OFFSET_WIDTH;
     + 	offset32 = get_be32(offset_data + sizeof(uint32_t));
     + 
     + 	if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) {
     +@@ midx.c: static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
     + 
     + static uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos)
     + {
     +-	return get_be32(m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH);
     ++	return get_be32(m->chunk_object_offsets +
     ++			(off_t)pos * MIDX_CHUNK_OFFSET_WIDTH);
     + }
     + 
     + static int nth_midxed_pack_entry(struct repository *r,
      @@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
     - 	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT,
     - 		  write_midx_oid_fanout, MIDX_CHUNK_FANOUT_SIZE);
     + 	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT, MIDX_CHUNK_FANOUT_SIZE,
     + 		  write_midx_oid_fanout);
       	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
     --		  write_midx_oid_lookup, ctx.entries_nr * the_hash_algo->rawsz);
     -+		  write_midx_oid_lookup, (uint64_t)ctx.entries_nr * the_hash_algo->rawsz);
     +-		  ctx.entries_nr * the_hash_algo->rawsz,
     ++		  (size_t)ctx.entries_nr * the_hash_algo->rawsz,
     + 		  write_midx_oid_lookup);
       	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
     - 		  write_midx_object_offsets,
     - 		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH);
     -@@ midx.c: static int write_midx_internal(const char *object_dir, struct multi_pack_index *
     +-		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
     ++		  (size_t)ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
     + 		  write_midx_object_offsets);
     + 
       	if (ctx.large_offsets_needed)
       		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
     - 			write_midx_large_offsets,
     --			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
     -+			(uint64_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH);
     +-			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
     ++			(size_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
     + 			write_midx_large_offsets);
       
       	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
     - 	write_chunkfile(cf, &ctx);
 16:  669eeec707ab ! 16:  b9a1bddf615f chunk-format: restore duplicate chunk checks
     @@ Commit message
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## chunk-format.c ##
     -@@ chunk-format.c: struct chunk_info {
     - 	chunk_write_fn write_fn;
     - 
     - 	const void *start;
     -+	unsigned found:1;
     - };
     - 
     - struct chunkfile {
      @@ chunk-format.c: int read_table_of_contents(struct chunkfile *cf,
       			   uint64_t toc_offset,
       			   int toc_length)
 17:  8f3985ab5df3 ! 17:  4c7d751f1e39 chunk-format: add technical docs
     @@ Documentation/technical/chunk-format.txt (new)
      +
      +Functions for working with chunk-based file formats are declared in
      +`chunk-format.h`. Using these methods provide extra checks that assist
     -+developers when creating new file formats, including:
     ++developers when creating new file formats.
      +
     -+ 1. Writing and reading the table of contents.
     ++Writing chunk-based file formats
     ++--------------------------------
      +
     -+ 2. Verifying that the data written in a chunk matches the expected size
     -+    that was recorded in the table of contents.
     ++To write a chunk-based file format, create a `struct chunkfile` by
     ++calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
     ++caller is responsible for opening the `hashfile` and writing header
     ++information so the file format is identifiable before the chunk-based
     ++format begins.
      +
     -+ 3. Checking that a table of contents describes offsets properly within
     -+    the file boundaries.
     ++Then, call `add_chunk()` for each chunk that is intended for write. This
     ++populates the `chunkfile` with information about the order and size of
     ++each chunk to write. Provide a `chunk_write_fn` function pointer to
     ++perform the write of the chunk data upon request.
     ++
     ++Call `write_chunkfile()` to write the table of contents to the `hashfile`
     ++followed by each of the chunks. This will verify that each chunk wrote
     ++the expected amount of data so the table of contents is correct.
     ++
     ++Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
     ++caller is responsible for finalizing the `hashfile` by writing the trailing
     ++hash and closing the file.
     ++
     ++Reading chunk-based file formats
     ++--------------------------------
     ++
     ++To read a chunk-based file format, the file must be opened as a
     ++memory-mapped region. The chunk-format API expects that the entire file
     ++is mapped as a contiguous memory region.
     ++
     ++Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.
     ++
     ++After reading the header information from the beginning of the file,
     ++including the chunk count, call `read_table_of_contents()` to populate
     ++the `struct chunkfile` with the list of chunks, their offsets, and their
     ++sizes.
     ++
     ++Extract the data information for each chunk using `pair_chunk()` or
     ++`read_chunk()`:
     ++
     ++* `pair_chunk()` assigns a given pointer with the location inside the
     ++  memory-mapped file corresponding to that chunk's offset. If the chunk
     ++  does not exist, then the pointer is not modified.
     ++
     ++* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
     ++  with the appropriate initial pointer and size information. The function
     ++  is not called if the chunk does not exist. Use this method to read chunks
     ++  if you need to perform immediate parsing or if you need to execute logic
     ++  based on the size of the chunk.
     ++
     ++After calling these methods, call `free_chunkfile()` to clear the
     ++`struct chunkfile` data. This will not close the memory-mapped region.
     ++Callers are expected to own that data for the timeframe the pointers into
     ++the region are needed.
     ++
     ++Examples
     ++--------
     ++
     ++These file formats use the chunk-format API, and can be used as examples
     ++for future formats:
     ++
     ++* *commit-graph:* see `write_commit_graph_file()` and `parse_commit_graph()`
     ++  in `commit-graph.c` for how the chunk-format API is used to write and
     ++  parse the commit-graph file format documented in
     ++  link:technical/commit-graph-format.html[the commit-graph file format].
     ++
     ++* *multi-pack-index:* see `write_midx_internal()` and `load_multi_pack_index()`
     ++  in `midx.c` for how the chunk-format API is used to write and
     ++  parse the multi-pack-index file format documented in
     ++  link:technical/pack-format.html[the multi-pack-index file format].
      
       ## Documentation/technical/commit-graph-format.txt ##
      @@ Documentation/technical/commit-graph-format.txt: CHUNK LOOKUP:

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 01/17] commit-graph: anonymize data in chunk_write_fn
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
                       ` (16 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In preparation for creating an API around file formats using chunks and
tables of contents, prepare the commit-graph write code to use
prototypes that will match this new API.

Specifically, convert chunk_write_fn to take a "void *data" parameter
instead of the commit-graph-specific "struct write_commit_graph_context"
pointer.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f3bde2ad95a1..fae7d1b63931 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1040,8 +1040,9 @@ struct write_commit_graph_context {
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
-				    struct write_commit_graph_context *ctx)
+				    void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	int i, count = 0;
 	struct commit **list = ctx->commits.list;
 
@@ -1066,8 +1067,9 @@ static int write_graph_chunk_fanout(struct hashfile *f,
 }
 
 static int write_graph_chunk_oids(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				  void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
@@ -1085,8 +1087,9 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
 }
 
 static int write_graph_chunk_data(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				  void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t num_extra_edges = 0;
@@ -1187,8 +1190,9 @@ static int write_graph_chunk_data(struct hashfile *f,
 }
 
 static int write_graph_chunk_generation_data(struct hashfile *f,
-					      struct write_commit_graph_context *ctx)
+					     void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	int i, num_generation_data_overflows = 0;
 
 	for (i = 0; i < ctx->commits.nr; i++) {
@@ -1208,8 +1212,9 @@ static int write_graph_chunk_generation_data(struct hashfile *f,
 }
 
 static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
-						       struct write_commit_graph_context *ctx)
+						      void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	int i;
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = ctx->commits.list[i];
@@ -1226,8 +1231,9 @@ static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
 }
 
 static int write_graph_chunk_extra_edges(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+					 void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	struct commit_list *parent;
@@ -1280,8 +1286,9 @@ static int write_graph_chunk_extra_edges(struct hashfile *f,
 }
 
 static int write_graph_chunk_bloom_indexes(struct hashfile *f,
-					   struct write_commit_graph_context *ctx)
+					   void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t cur_pos = 0;
@@ -1315,8 +1322,9 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 }
 
 static int write_graph_chunk_bloom_data(struct hashfile *f,
-					struct write_commit_graph_context *ctx)
+					void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 
@@ -1737,8 +1745,9 @@ static int write_graph_chunk_base_1(struct hashfile *f,
 }
 
 static int write_graph_chunk_base(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				    void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	int num = write_graph_chunk_base_1(f, ctx->new_base_graph);
 
 	if (num != ctx->num_commit_graphs_after - 1) {
@@ -1750,7 +1759,7 @@ static int write_graph_chunk_base(struct hashfile *f,
 }
 
 typedef int (*chunk_write_fn)(struct hashfile *f,
-			      struct write_commit_graph_context *ctx);
+			      void *data);
 
 struct chunk_info {
 	uint32_t id;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 02/17] chunk-format: create chunk format write API
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-07 21:13       ` SZEDER Gábor
  2021-02-05 14:30     ` [PATCH v3 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
                       ` (15 subsequent siblings)
  17 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In anticipation of combining the logic from the commit-graph and
multi-pack-index file formats, create a new chunk-format API. Use a
'struct chunkfile' pointer to keep track of data that has been
registered for writes. This struct is anonymous outside of
chunk-format.c to ensure no user attempts to interfere with the data.

The next change will use this API in commit-graph.c, but the general
approach is:

 1. initialize the chunkfile with init_chunkfile(f).
 2. add chunks in the intended writing order with add_chunk().
 3. write any header information to the hashfile f.
 4. write the chunkfile data using write_chunkfile().
 5. free the chunkfile struct using free_chunkfile().

Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile       |  1 +
 chunk-format.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++
 chunk-format.h | 19 +++++++++++
 3 files changed, 111 insertions(+)
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h

diff --git a/Makefile b/Makefile
index 7b64106930a6..50a7663841e9 100644
--- a/Makefile
+++ b/Makefile
@@ -854,6 +854,7 @@ LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += chdir-notify.o
 LIB_OBJS += checkout.o
+LIB_OBJS += chunk-format.o
 LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
diff --git a/chunk-format.c b/chunk-format.c
new file mode 100644
index 000000000000..6e0f1900213e
--- /dev/null
+++ b/chunk-format.c
@@ -0,0 +1,91 @@
+#include "cache.h"
+#include "chunk-format.h"
+#include "csum-file.h"
+#define CHUNK_LOOKUP_WIDTH 12
+
+/*
+ * When writing a chunk-based file format, collect the chunks in
+ * an array of chunk_info structs. The size stores the _expected_
+ * amount of data that will be written by write_fn.
+ */
+struct chunk_info {
+	uint32_t id;
+	uint64_t size;
+	chunk_write_fn write_fn;
+};
+
+struct chunkfile {
+	struct hashfile *f;
+
+	struct chunk_info *chunks;
+	size_t chunks_nr;
+	size_t chunks_alloc;
+};
+
+struct chunkfile *init_chunkfile(struct hashfile *f)
+{
+	struct chunkfile *cf = xcalloc(1, sizeof(*cf));
+	cf->f = f;
+	return cf;
+}
+
+void free_chunkfile(struct chunkfile *cf)
+{
+	if (!cf)
+		return;
+	free(cf->chunks);
+	free(cf);
+}
+
+int get_num_chunks(struct chunkfile *cf)
+{
+	return cf->chunks_nr;
+}
+
+void add_chunk(struct chunkfile *cf,
+	       uint32_t id,
+	       size_t size,
+	       chunk_write_fn fn)
+{
+	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
+
+	cf->chunks[cf->chunks_nr].id = id;
+	cf->chunks[cf->chunks_nr].write_fn = fn;
+	cf->chunks[cf->chunks_nr].size = size;
+	cf->chunks_nr++;
+}
+
+int write_chunkfile(struct chunkfile *cf, void *data)
+{
+	int i;
+	uint64_t cur_offset = hashfile_total(cf->f);
+
+	/* Add the table of contents to the current offset */
+	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		hashwrite_be32(cf->f, cf->chunks[i].id);
+		hashwrite_be64(cf->f, cur_offset);
+
+		cur_offset += cf->chunks[i].size;
+	}
+
+	/* Trailing entry marks the end of the chunks */
+	hashwrite_be32(cf->f, 0);
+	hashwrite_be64(cf->f, cur_offset);
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		off_t start_offset = hashfile_total(cf->f);
+		int result = cf->chunks[i].write_fn(cf->f, data);
+
+		if (result)
+			return result;
+
+		if (hashfile_total(cf->f) - start_offset != cf->chunks[i].size)
+			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
+			    cf->chunks[i].size, cf->chunks[i].id,
+			    hashfile_total(cf->f) - start_offset);
+	}
+
+	return 0;
+}
diff --git a/chunk-format.h b/chunk-format.h
new file mode 100644
index 000000000000..9a1d770accec
--- /dev/null
+++ b/chunk-format.h
@@ -0,0 +1,19 @@
+#ifndef CHUNK_FORMAT_H
+#define CHUNK_FORMAT_H
+
+#include "git-compat-util.h"
+
+struct hashfile;
+struct chunkfile;
+
+struct chunkfile *init_chunkfile(struct hashfile *f);
+void free_chunkfile(struct chunkfile *cf);
+int get_num_chunks(struct chunkfile *cf);
+typedef int (*chunk_write_fn)(struct hashfile *f, void *data);
+void add_chunk(struct chunkfile *cf,
+	       uint32_t id,
+	       size_t size,
+	       chunk_write_fn fn);
+int write_chunkfile(struct chunkfile *cf, void *data);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 03/17] commit-graph: use chunk-format write API
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
                       ` (14 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The commit-graph write logic is ready to make use of the chunk-format
write API. Each chunk write method is already in the correct prototype.
We only need to use the 'struct chunkfile' pointer and the correct API
calls.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 118 ++++++++++++++++---------------------------------
 1 file changed, 37 insertions(+), 81 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index fae7d1b63931..7c607d23b29f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -19,6 +19,7 @@
 #include "shallow.h"
 #include "json-writer.h"
 #include "trace2.h"
+#include "chunk-format.h"
 
 void git_test_write_commit_graph_or_die(void)
 {
@@ -1758,27 +1759,17 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
-typedef int (*chunk_write_fn)(struct hashfile *f,
-			      void *data);
-
-struct chunk_info {
-	uint32_t id;
-	uint64_t size;
-	chunk_write_fn write_fn;
-};
-
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
 	int fd;
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
-	struct chunk_info chunks[MAX_NUM_CHUNKS + 1];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
-	uint64_t chunk_offset;
 	struct object_id file_hash;
+	struct chunkfile *cf;
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -1824,76 +1815,50 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	}
 
-	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
-	chunks[0].size = GRAPH_FANOUT_SIZE;
-	chunks[0].write_fn = write_graph_chunk_fanout;
-	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
-	chunks[1].size = hashsz * ctx->commits.nr;
-	chunks[1].write_fn = write_graph_chunk_oids;
-	chunks[2].id = GRAPH_CHUNKID_DATA;
-	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
-	chunks[2].write_fn = write_graph_chunk_data;
+	cf = init_chunkfile(f);
+
+	add_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, GRAPH_FANOUT_SIZE,
+		  write_graph_chunk_fanout);
+	add_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, hashsz * ctx->commits.nr,
+		  write_graph_chunk_oids);
+	add_chunk(cf, GRAPH_CHUNKID_DATA, (hashsz + 16) * ctx->commits.nr,
+		  write_graph_chunk_data);
 
 	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
 		ctx->write_generation_data = 0;
-	if (ctx->write_generation_data) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
-		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
-		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
-		num_chunks++;
-	}
-	if (ctx->num_generation_data_overflows) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
-		chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
-		chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
-		num_chunks++;
-	}
-	if (ctx->num_extra_edges) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
-		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
-		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
-		num_chunks++;
-	}
+	if (ctx->write_generation_data)
+		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
+			  sizeof(uint32_t) * ctx->commits.nr,
+			  write_graph_chunk_generation_data);
+	if (ctx->num_generation_data_overflows)
+		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
+			  sizeof(timestamp_t) * ctx->num_generation_data_overflows,
+			  write_graph_chunk_generation_data_overflow);
+	if (ctx->num_extra_edges)
+		add_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES,
+			  4 * ctx->num_extra_edges,
+			  write_graph_chunk_extra_edges);
 	if (ctx->changed_paths) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
-		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
-		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
-		num_chunks++;
-		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
-		chunks[num_chunks].size = sizeof(uint32_t) * 3
-					  + ctx->total_bloom_filter_data_size;
-		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
-		num_chunks++;
-	}
-	if (ctx->num_commit_graphs_after > 1) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
-		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
-		chunks[num_chunks].write_fn = write_graph_chunk_base;
-		num_chunks++;
-	}
-
-	chunks[num_chunks].id = 0;
-	chunks[num_chunks].size = 0;
+		add_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+			  sizeof(uint32_t) * ctx->commits.nr,
+			  write_graph_chunk_bloom_indexes);
+		add_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
+			  sizeof(uint32_t) * 3
+				+ ctx->total_bloom_filter_data_size,
+			  write_graph_chunk_bloom_data);
+	}
+	if (ctx->num_commit_graphs_after > 1)
+		add_chunk(cf, GRAPH_CHUNKID_BASE,
+			  hashsz * (ctx->num_commit_graphs_after - 1),
+			  write_graph_chunk_base);
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
 	hashwrite_u8(f, oid_version());
-	hashwrite_u8(f, num_chunks);
+	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
-	chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
-	for (i = 0; i <= num_chunks; i++) {
-		uint32_t chunk_write[3];
-
-		chunk_write[0] = htonl(chunks[i].id);
-		chunk_write[1] = htonl(chunk_offset >> 32);
-		chunk_write[2] = htonl(chunk_offset & 0xffffffff);
-		hashwrite(f, chunk_write, 12);
-
-		chunk_offset += chunks[i].size;
-	}
-
 	if (ctx->report_progress) {
 		strbuf_addf(&progress_title,
 			    Q_("Writing out commit graph in %d pass",
@@ -1905,17 +1870,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			num_chunks * ctx->commits.nr);
 	}
 
-	for (i = 0; i < num_chunks; i++) {
-		uint64_t start_offset = f->total + f->offset;
-
-		if (chunks[i].write_fn(f, ctx))
-			return -1;
-
-		if (f->total + f->offset != start_offset + chunks[i].size)
-			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
-			    chunks[i].size, chunks[i].id,
-			    f->total + f->offset - start_offset);
-	}
+	write_chunkfile(cf, ctx);
 
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
@@ -1932,6 +1887,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	close_commit_graph(ctx->r->objects);
 	finalize_hashfile(f, file_hash.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+	free_chunkfile(cf);
 
 	if (ctx->split) {
 		FILE *chainf = fdopen_lock_file(&lk, "w");
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 04/17] midx: rename pack_info to write_midx_context
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
                       ` (13 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to streamline our chunk-based file formats, align some of
the code structure in write_midx_internal() to be similar to the
patterns in write_commit_graph_file().

Specifically, let's create a "struct write_midx_context" that can be
used as a data parameter to abstract function types.

This change only renames "struct pack_info" to "struct
write_midx_context" and the names of instances from "packs" to "ctx". In
future changes, we will expand the data inside "struct
write_midx_context" and align our chunk-writing method with the
chunk-format API.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 130 ++++++++++++++++++++++++++++-----------------------------
 1 file changed, 65 insertions(+), 65 deletions(-)

diff --git a/midx.c b/midx.c
index 79c282b070d2..561f65a63a5b 100644
--- a/midx.c
+++ b/midx.c
@@ -451,7 +451,7 @@ static int pack_info_compare(const void *_a, const void *_b)
 	return strcmp(a->pack_name, b->pack_name);
 }
 
-struct pack_list {
+struct write_midx_context {
 	struct pack_info *info;
 	uint32_t nr;
 	uint32_t alloc;
@@ -463,37 +463,37 @@ struct pack_list {
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			     const char *file_name, void *data)
 {
-	struct pack_list *packs = (struct pack_list *)data;
+	struct write_midx_context *ctx = data;
 
 	if (ends_with(file_name, ".idx")) {
-		display_progress(packs->progress, ++packs->pack_paths_checked);
-		if (packs->m && midx_contains_pack(packs->m, file_name))
+		display_progress(ctx->progress, ++ctx->pack_paths_checked);
+		if (ctx->m && midx_contains_pack(ctx->m, file_name))
 			return;
 
-		ALLOC_GROW(packs->info, packs->nr + 1, packs->alloc);
+		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
-		packs->info[packs->nr].p = add_packed_git(full_path,
-							  full_path_len,
-							  0);
+		ctx->info[ctx->nr].p = add_packed_git(full_path,
+						      full_path_len,
+						      0);
 
-		if (!packs->info[packs->nr].p) {
+		if (!ctx->info[ctx->nr].p) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
 			return;
 		}
 
-		if (open_pack_index(packs->info[packs->nr].p)) {
+		if (open_pack_index(ctx->info[ctx->nr].p)) {
 			warning(_("failed to open pack-index '%s'"),
 				full_path);
-			close_pack(packs->info[packs->nr].p);
-			FREE_AND_NULL(packs->info[packs->nr].p);
+			close_pack(ctx->info[ctx->nr].p);
+			FREE_AND_NULL(ctx->info[ctx->nr].p);
 			return;
 		}
 
-		packs->info[packs->nr].pack_name = xstrdup(file_name);
-		packs->info[packs->nr].orig_pack_int_id = packs->nr;
-		packs->info[packs->nr].expired = 0;
-		packs->nr++;
+		ctx->info[ctx->nr].pack_name = xstrdup(file_name);
+		ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
+		ctx->info[ctx->nr].expired = 0;
+		ctx->nr++;
 	}
 }
 
@@ -801,7 +801,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
-	struct pack_list packs;
+	struct write_midx_context ctx = { 0 };
 	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
@@ -820,40 +820,40 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 			  midx_name);
 
 	if (m)
-		packs.m = m;
+		ctx.m = m;
 	else
-		packs.m = load_multi_pack_index(object_dir, 1);
-
-	packs.nr = 0;
-	packs.alloc = packs.m ? packs.m->num_packs : 16;
-	packs.info = NULL;
-	ALLOC_ARRAY(packs.info, packs.alloc);
-
-	if (packs.m) {
-		for (i = 0; i < packs.m->num_packs; i++) {
-			ALLOC_GROW(packs.info, packs.nr + 1, packs.alloc);
-
-			packs.info[packs.nr].orig_pack_int_id = i;
-			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
-			packs.info[packs.nr].p = NULL;
-			packs.info[packs.nr].expired = 0;
-			packs.nr++;
+		ctx.m = load_multi_pack_index(object_dir, 1);
+
+	ctx.nr = 0;
+	ctx.alloc = ctx.m ? ctx.m->num_packs : 16;
+	ctx.info = NULL;
+	ALLOC_ARRAY(ctx.info, ctx.alloc);
+
+	if (ctx.m) {
+		for (i = 0; i < ctx.m->num_packs; i++) {
+			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
+
+			ctx.info[ctx.nr].orig_pack_int_id = i;
+			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
+			ctx.info[ctx.nr].p = NULL;
+			ctx.info[ctx.nr].expired = 0;
+			ctx.nr++;
 		}
 	}
 
-	packs.pack_paths_checked = 0;
+	ctx.pack_paths_checked = 0;
 	if (flags & MIDX_PROGRESS)
-		packs.progress = start_delayed_progress(_("Adding packfiles to multi-pack-index"), 0);
+		ctx.progress = start_delayed_progress(_("Adding packfiles to multi-pack-index"), 0);
 	else
-		packs.progress = NULL;
+		ctx.progress = NULL;
 
-	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
-	stop_progress(&packs.progress);
+	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
+	stop_progress(&ctx.progress);
 
-	if (packs.m && packs.nr == packs.m->num_packs && !packs_to_drop)
+	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
-	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
+	entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &nr_entries);
 
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
@@ -862,19 +862,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 			large_offsets_needed = 1;
 	}
 
-	QSORT(packs.info, packs.nr, pack_info_compare);
+	QSORT(ctx.info, ctx.nr, pack_info_compare);
 
 	if (packs_to_drop && packs_to_drop->nr) {
 		int drop_index = 0;
 		int missing_drops = 0;
 
-		for (i = 0; i < packs.nr && drop_index < packs_to_drop->nr; i++) {
-			int cmp = strcmp(packs.info[i].pack_name,
+		for (i = 0; i < ctx.nr && drop_index < packs_to_drop->nr; i++) {
+			int cmp = strcmp(ctx.info[i].pack_name,
 					 packs_to_drop->items[drop_index].string);
 
 			if (!cmp) {
 				drop_index++;
-				packs.info[i].expired = 1;
+				ctx.info[i].expired = 1;
 			} else if (cmp > 0) {
 				error(_("did not see pack-file %s to drop"),
 				      packs_to_drop->items[drop_index].string);
@@ -882,7 +882,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				missing_drops++;
 				i--;
 			} else {
-				packs.info[i].expired = 0;
+				ctx.info[i].expired = 0;
 			}
 		}
 
@@ -898,19 +898,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	 *
 	 * pack_perm[old_id] = new_id
 	 */
-	ALLOC_ARRAY(pack_perm, packs.nr);
-	for (i = 0; i < packs.nr; i++) {
-		if (packs.info[i].expired) {
+	ALLOC_ARRAY(pack_perm, ctx.nr);
+	for (i = 0; i < ctx.nr; i++) {
+		if (ctx.info[i].expired) {
 			dropped_packs++;
-			pack_perm[packs.info[i].orig_pack_int_id] = PACK_EXPIRED;
+			pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
 		} else {
-			pack_perm[packs.info[i].orig_pack_int_id] = i - dropped_packs;
+			pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
 		}
 	}
 
-	for (i = 0; i < packs.nr; i++) {
-		if (!packs.info[i].expired)
-			pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	for (i = 0; i < ctx.nr; i++) {
+		if (!ctx.info[i].expired)
+			pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
 	}
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
@@ -921,19 +921,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	if (packs.m)
-		close_midx(packs.m);
+	if (ctx.m)
+		close_midx(ctx.m);
 
 	cur_chunk = 0;
 	num_chunks = large_offsets_needed ? 5 : 4;
 
-	if (packs.nr - dropped_packs == 0) {
+	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
 		result = 1;
 		goto cleanup;
 	}
 
-	written = write_midx_header(f, num_chunks, packs.nr - dropped_packs);
+	written = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
@@ -990,7 +990,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, packs.info, packs.nr);
+				written += write_midx_pack_names(f, ctx.info, ctx.nr);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
@@ -1027,15 +1027,15 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	commit_lock_file(&lk);
 
 cleanup:
-	for (i = 0; i < packs.nr; i++) {
-		if (packs.info[i].p) {
-			close_pack(packs.info[i].p);
-			free(packs.info[i].p);
+	for (i = 0; i < ctx.nr; i++) {
+		if (ctx.info[i].p) {
+			close_pack(ctx.info[i].p);
+			free(ctx.info[i].p);
 		}
-		free(packs.info[i].pack_name);
+		free(ctx.info[i].pack_name);
 	}
 
-	free(packs.info);
+	free(ctx.info);
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 05/17] midx: use context in write_midx_pack_names()
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
                       ` (12 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align the write_midx_internal() to use the chunk-format
API, start converting chunk writing methods to match chunk_write_fn. The
first case is to convert write_midx_pack_names() to take "void *data".
We already have the necessary data in "struct write_midx_context", so
this conversion is rather mechanical.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/midx.c b/midx.c
index 561f65a63a5b..88452b044337 100644
--- a/midx.c
+++ b/midx.c
@@ -643,27 +643,26 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	return deduplicated_entries;
 }
 
-static size_t write_midx_pack_names(struct hashfile *f,
-				    struct pack_info *info,
-				    uint32_t num_packs)
+static size_t write_midx_pack_names(struct hashfile *f, void *data)
 {
+	struct write_midx_context *ctx = data;
 	uint32_t i;
 	unsigned char padding[MIDX_CHUNK_ALIGNMENT];
 	size_t written = 0;
 
-	for (i = 0; i < num_packs; i++) {
+	for (i = 0; i < ctx->nr; i++) {
 		size_t writelen;
 
-		if (info[i].expired)
+		if (ctx->info[i].expired)
 			continue;
 
-		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
+		if (i && strcmp(ctx->info[i].pack_name, ctx->info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
-			    info[i - 1].pack_name,
-			    info[i].pack_name);
+			    ctx->info[i - 1].pack_name,
+			    ctx->info[i].pack_name);
 
-		writelen = strlen(info[i].pack_name) + 1;
-		hashwrite(f, info[i].pack_name, writelen);
+		writelen = strlen(ctx->info[i].pack_name) + 1;
+		hashwrite(f, ctx->info[i].pack_name, writelen);
 		written += writelen;
 	}
 
@@ -990,7 +989,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, ctx.info, ctx.nr);
+				written += write_midx_pack_names(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 06/17] midx: add entries to write_midx_context
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (4 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
                       ` (11 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "struct pack_midx_entry *entries" list and its count
into the context.

Update write_midx_oid_fanout() and write_midx_oid_lookup() to take the
context directly, as these are easy conversions with this new data.

Only the callers of write_midx_object_offsets() and
write_midx_large_offsets() are updated here, since additional data in
the context before those methods can match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 49 ++++++++++++++++++++++++++-----------------------
 1 file changed, 26 insertions(+), 23 deletions(-)

diff --git a/midx.c b/midx.c
index 88452b044337..4520ef82b91b 100644
--- a/midx.c
+++ b/midx.c
@@ -458,6 +458,9 @@ struct write_midx_context {
 	struct multi_pack_index *m;
 	struct progress *progress;
 	unsigned pack_paths_checked;
+
+	struct pack_midx_entry *entries;
+	uint32_t entries_nr;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -678,11 +681,11 @@ static size_t write_midx_pack_names(struct hashfile *f, void *data)
 }
 
 static size_t write_midx_oid_fanout(struct hashfile *f,
-				    struct pack_midx_entry *objects,
-				    uint32_t nr_objects)
+				    void *data)
 {
-	struct pack_midx_entry *list = objects;
-	struct pack_midx_entry *last = objects + nr_objects;
+	struct write_midx_context *ctx = data;
+	struct pack_midx_entry *list = ctx->entries;
+	struct pack_midx_entry *last = ctx->entries + ctx->entries_nr;
 	uint32_t count = 0;
 	uint32_t i;
 
@@ -706,18 +709,19 @@ static size_t write_midx_oid_fanout(struct hashfile *f,
 	return MIDX_CHUNK_FANOUT_SIZE;
 }
 
-static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
-				    struct pack_midx_entry *objects,
-				    uint32_t nr_objects)
+static size_t write_midx_oid_lookup(struct hashfile *f,
+				    void *data)
 {
-	struct pack_midx_entry *list = objects;
+	struct write_midx_context *ctx = data;
+	unsigned char hash_len = the_hash_algo->rawsz;
+	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i;
 	size_t written = 0;
 
-	for (i = 0; i < nr_objects; i++) {
+	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		if (i < nr_objects - 1) {
+		if (i < ctx->entries_nr - 1) {
 			struct pack_midx_entry *next = list;
 			if (oidcmp(&obj->oid, &next->oid) >= 0)
 				BUG("OIDs not in order: %s >= %s",
@@ -805,8 +809,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	uint32_t nr_entries, num_large_offsets = 0;
-	struct pack_midx_entry *entries = NULL;
+	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
 	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
@@ -852,12 +855,12 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
-	entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &nr_entries);
+	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr);
 
-	for (i = 0; i < nr_entries; i++) {
-		if (entries[i].offset > 0x7fffffff)
+	for (i = 0; i < ctx.entries_nr; i++) {
+		if (ctx.entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
-		if (entries[i].offset > 0xffffffff)
+		if (ctx.entries[i].offset > 0xffffffff)
 			large_offsets_needed = 1;
 	}
 
@@ -947,10 +950,10 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * the_hash_algo->rawsz;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;
 
 	cur_chunk++;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_CHUNK_OFFSET_WIDTH;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
 	if (large_offsets_needed) {
 		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
 
@@ -993,19 +996,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-				written += write_midx_oid_fanout(f, entries, nr_entries);
+				written += write_midx_oid_fanout(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDLOOKUP:
-				written += write_midx_oid_lookup(f, the_hash_algo->rawsz, entries, nr_entries);
+				written += write_midx_oid_lookup(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, entries, nr_entries);
+				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, ctx.entries, ctx.entries_nr);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, num_large_offsets, entries, nr_entries);
+				written += write_midx_large_offsets(f, num_large_offsets, ctx.entries, ctx.entries_nr);
 				break;
 
 			default:
@@ -1035,7 +1038,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	}
 
 	free(ctx.info);
-	free(entries);
+	free(ctx.entries);
 	free(pack_perm);
 	free(midx_name);
 	return result;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 07/17] midx: add pack_perm to write_midx_context
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (5 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
                       ` (10 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t *pack_perm" and large_offsets_needed bit
into the context.

Update write_midx_object_offsets() to match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/midx.c b/midx.c
index 4520ef82b91b..cd994e333ecb 100644
--- a/midx.c
+++ b/midx.c
@@ -461,6 +461,9 @@ struct write_midx_context {
 
 	struct pack_midx_entry *entries;
 	uint32_t entries_nr;
+
+	uint32_t *pack_perm;
+	unsigned large_offsets_needed:1;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -736,27 +739,27 @@ static size_t write_midx_oid_lookup(struct hashfile *f,
 	return written;
 }
 
-static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
-					uint32_t *perm,
-					struct pack_midx_entry *objects, uint32_t nr_objects)
+static size_t write_midx_object_offsets(struct hashfile *f,
+					void *data)
 {
-	struct pack_midx_entry *list = objects;
+	struct write_midx_context *ctx = data;
+	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i, nr_large_offset = 0;
 	size_t written = 0;
 
-	for (i = 0; i < nr_objects; i++) {
+	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		if (perm[obj->pack_int_id] == PACK_EXPIRED)
+		if (ctx->pack_perm[obj->pack_int_id] == PACK_EXPIRED)
 			BUG("object %s is in an expired pack with int-id %d",
 			    oid_to_hex(&obj->oid),
 			    obj->pack_int_id);
 
-		hashwrite_be32(f, perm[obj->pack_int_id]);
+		hashwrite_be32(f, ctx->pack_perm[obj->pack_int_id]);
 
-		if (large_offset_needed && obj->offset >> 31)
+		if (ctx->large_offsets_needed && obj->offset >> 31)
 			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
-		else if (!large_offset_needed && obj->offset >> 32)
+		else if (!ctx->large_offsets_needed && obj->offset >> 32)
 			BUG("object %s requires a large offset (%"PRIx64") but the MIDX is not writing large offsets!",
 			    oid_to_hex(&obj->oid),
 			    obj->offset);
@@ -805,13 +808,11 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
-	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -857,11 +858,12 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr);
 
+	ctx.large_offsets_needed = 0;
 	for (i = 0; i < ctx.entries_nr; i++) {
 		if (ctx.entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
 		if (ctx.entries[i].offset > 0xffffffff)
-			large_offsets_needed = 1;
+			ctx.large_offsets_needed = 1;
 	}
 
 	QSORT(ctx.info, ctx.nr, pack_info_compare);
@@ -900,13 +902,13 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	 *
 	 * pack_perm[old_id] = new_id
 	 */
-	ALLOC_ARRAY(pack_perm, ctx.nr);
+	ALLOC_ARRAY(ctx.pack_perm, ctx.nr);
 	for (i = 0; i < ctx.nr; i++) {
 		if (ctx.info[i].expired) {
 			dropped_packs++;
-			pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
+			ctx.pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
 		} else {
-			pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
+			ctx.pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
 		}
 	}
 
@@ -927,7 +929,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		close_midx(ctx.m);
 
 	cur_chunk = 0;
-	num_chunks = large_offsets_needed ? 5 : 4;
+	num_chunks = ctx.large_offsets_needed ? 5 : 4;
 
 	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
@@ -954,7 +956,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	cur_chunk++;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
-	if (large_offsets_needed) {
+	if (ctx.large_offsets_needed) {
 		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
 
 		cur_chunk++;
@@ -1004,7 +1006,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, ctx.entries, ctx.entries_nr);
+				written += write_midx_object_offsets(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
@@ -1039,7 +1041,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	free(ctx.info);
 	free(ctx.entries);
-	free(pack_perm);
+	free(ctx.pack_perm);
 	free(midx_name);
 	return result;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 08/17] midx: add num_large_offsets to write_midx_context
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (6 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
                       ` (9 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t num_large_offsets" into the context. With
this new data, write_midx_large_offsets() now matches the
chunk_write_fn type.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/midx.c b/midx.c
index cd994e333ecb..5be081f229ad 100644
--- a/midx.c
+++ b/midx.c
@@ -464,6 +464,7 @@ struct write_midx_context {
 
 	uint32_t *pack_perm;
 	unsigned large_offsets_needed:1;
+	uint32_t num_large_offsets;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -772,11 +773,14 @@ static size_t write_midx_object_offsets(struct hashfile *f,
 	return written;
 }
 
-static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_offset,
-				       struct pack_midx_entry *objects, uint32_t nr_objects)
+static size_t write_midx_large_offsets(struct hashfile *f,
+				       void *data)
 {
-	struct pack_midx_entry *list = objects, *end = objects + nr_objects;
+	struct write_midx_context *ctx = data;
+	struct pack_midx_entry *list = ctx->entries;
+	struct pack_midx_entry *end = ctx->entries + ctx->entries_nr;
 	size_t written = 0;
+	uint32_t nr_large_offset = ctx->num_large_offsets;
 
 	while (nr_large_offset) {
 		struct pack_midx_entry *obj;
@@ -811,7 +815,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
@@ -861,7 +864,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	ctx.large_offsets_needed = 0;
 	for (i = 0; i < ctx.entries_nr; i++) {
 		if (ctx.entries[i].offset > 0x7fffffff)
-			num_large_offsets++;
+			ctx.num_large_offsets++;
 		if (ctx.entries[i].offset > 0xffffffff)
 			ctx.large_offsets_needed = 1;
 	}
@@ -961,7 +964,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		cur_chunk++;
 		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
-					   num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
+					   ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
 	}
 
 	chunk_ids[cur_chunk] = 0;
@@ -1010,7 +1013,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, num_large_offsets, ctx.entries, ctx.entries_nr);
+				written += write_midx_large_offsets(f, &ctx);
 				break;
 
 			default:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 09/17] midx: return success/failure in chunk write methods
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (7 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
                       ` (8 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Historically, the chunk-writing methods in midx.c have returned the
amount of data written so the writer method could compare this with the
table of contents. This presents with some interesting issues:

1. If a chunk writing method has a bug that miscalculates the written
   bytes, then we can satisfy the table of contents without actually
   writing the right amount of data to the hashfile. The commit-graph
   writing code checks the hashfile struct directly for a more robust
   verification.

2. There is no way for a chunk writing method to gracefully fail.
   Returning an int presents an opportunity to fail without a die().

3. The current pattern doesn't match chunk_write_fn type exactly, so we
   cannot share code with commit-graph.c

For these reasons, convert the midx chunk writer methods to return an
'int'. Since none of them fail at the moment, they all return 0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 63 +++++++++++++++++++++++++---------------------------------
 1 file changed, 27 insertions(+), 36 deletions(-)

diff --git a/midx.c b/midx.c
index 5be081f229ad..c92a6c47be01 100644
--- a/midx.c
+++ b/midx.c
@@ -650,7 +650,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	return deduplicated_entries;
 }
 
-static size_t write_midx_pack_names(struct hashfile *f, void *data)
+static int write_midx_pack_names(struct hashfile *f, void *data)
 {
 	struct write_midx_context *ctx = data;
 	uint32_t i;
@@ -678,14 +678,13 @@ static size_t write_midx_pack_names(struct hashfile *f, void *data)
 	if (i < MIDX_CHUNK_ALIGNMENT) {
 		memset(padding, 0, sizeof(padding));
 		hashwrite(f, padding, i);
-		written += i;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_oid_fanout(struct hashfile *f,
-				    void *data)
+static int write_midx_oid_fanout(struct hashfile *f,
+				 void *data)
 {
 	struct write_midx_context *ctx = data;
 	struct pack_midx_entry *list = ctx->entries;
@@ -710,17 +709,16 @@ static size_t write_midx_oid_fanout(struct hashfile *f,
 		list = next;
 	}
 
-	return MIDX_CHUNK_FANOUT_SIZE;
+	return 0;
 }
 
-static size_t write_midx_oid_lookup(struct hashfile *f,
-				    void *data)
+static int write_midx_oid_lookup(struct hashfile *f,
+				 void *data)
 {
 	struct write_midx_context *ctx = data;
 	unsigned char hash_len = the_hash_algo->rawsz;
 	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i;
-	size_t written = 0;
 
 	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
@@ -734,19 +732,17 @@ static size_t write_midx_oid_lookup(struct hashfile *f,
 		}
 
 		hashwrite(f, obj->oid.hash, (int)hash_len);
-		written += hash_len;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_object_offsets(struct hashfile *f,
-					void *data)
+static int write_midx_object_offsets(struct hashfile *f,
+				     void *data)
 {
 	struct write_midx_context *ctx = data;
 	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i, nr_large_offset = 0;
-	size_t written = 0;
 
 	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
@@ -766,20 +762,17 @@ static size_t write_midx_object_offsets(struct hashfile *f,
 			    obj->offset);
 		else
 			hashwrite_be32(f, (uint32_t)obj->offset);
-
-		written += MIDX_CHUNK_OFFSET_WIDTH;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_large_offsets(struct hashfile *f,
-				       void *data)
+static int write_midx_large_offsets(struct hashfile *f,
+				    void *data)
 {
 	struct write_midx_context *ctx = data;
 	struct pack_midx_entry *list = ctx->entries;
 	struct pack_midx_entry *end = ctx->entries + ctx->entries_nr;
-	size_t written = 0;
 	uint32_t nr_large_offset = ctx->num_large_offsets;
 
 	while (nr_large_offset) {
@@ -795,12 +788,12 @@ static size_t write_midx_large_offsets(struct hashfile *f,
 		if (!(offset >> 31))
 			continue;
 
-		written += hashwrite_be64(f, offset);
+		hashwrite_be64(f, offset);
 
 		nr_large_offset--;
 	}
 
-	return written;
+	return 0;
 }
 
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
@@ -812,7 +805,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint64_t written = 0;
+	uint64_t header_size = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	struct progress *progress = NULL;
@@ -940,10 +933,10 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		goto cleanup;
 	}
 
-	written = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
+	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
-	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
+	chunk_offsets[cur_chunk] = header_size + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
@@ -981,39 +974,37 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		hashwrite_be32(f, chunk_ids[i]);
 		hashwrite_be64(f, chunk_offsets[i]);
-
-		written += MIDX_CHUNKLOOKUP_WIDTH;
 	}
 
 	if (flags & MIDX_PROGRESS)
 		progress = start_delayed_progress(_("Writing chunks to multi-pack-index"),
 					  num_chunks);
 	for (i = 0; i < num_chunks; i++) {
-		if (written != chunk_offsets[i])
+		if (f->total + f->offset != chunk_offsets[i])
 			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
 			    chunk_offsets[i],
-			    written,
+			    f->total + f->offset,
 			    chunk_ids[i]);
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, &ctx);
+				write_midx_pack_names(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-				written += write_midx_oid_fanout(f, &ctx);
+				write_midx_oid_fanout(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDLOOKUP:
-				written += write_midx_oid_lookup(f, &ctx);
+				write_midx_oid_lookup(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, &ctx);
+				write_midx_object_offsets(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, &ctx);
+				write_midx_large_offsets(f, &ctx);
 				break;
 
 			default:
@@ -1025,9 +1016,9 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	}
 	stop_progress(&progress);
 
-	if (written != chunk_offsets[num_chunks])
+	if (hashfile_total(f) != chunk_offsets[num_chunks])
 		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-		    written,
+		    hashfile_total(f),
 		    chunk_offsets[num_chunks]);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 10/17] midx: drop chunk progress during write
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (8 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
                       ` (7 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Most expensive operations in write_midx_internal() use the context
struct's progress member, and these indicate the process of the
expensive operations within the chunk writing methods. However, there is
a competing progress struct that counts the progress over all chunks.
This is not very helpful compared to the others, so drop it.

This also reduces our barriers to combining the chunk writing code with
chunk-format.c.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/midx.c b/midx.c
index c92a6c47be01..4f4aa351e60e 100644
--- a/midx.c
+++ b/midx.c
@@ -808,7 +808,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t header_size = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	struct progress *progress = NULL;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -976,9 +975,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		hashwrite_be64(f, chunk_offsets[i]);
 	}
 
-	if (flags & MIDX_PROGRESS)
-		progress = start_delayed_progress(_("Writing chunks to multi-pack-index"),
-					  num_chunks);
 	for (i = 0; i < num_chunks; i++) {
 		if (f->total + f->offset != chunk_offsets[i])
 			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
@@ -1011,10 +1007,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
 		}
-
-		display_progress(progress, i + 1);
 	}
-	stop_progress(&progress);
 
 	if (hashfile_total(f) != chunk_offsets[num_chunks])
 		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 11/17] midx: use chunk-format API in write_midx_internal()
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (9 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
                       ` (6 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The chunk-format API allows writing the table of contents and all chunks
using the anonymous 'struct chunkfile' type. We only need to convert our
local chunk logic to this API for the multi-pack-index writes to share
that logic with the commit-graph file writes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 105 +++++++++++----------------------------------------------
 1 file changed, 20 insertions(+), 85 deletions(-)

diff --git a/midx.c b/midx.c
index 4f4aa351e60e..d9c7411b083b 100644
--- a/midx.c
+++ b/midx.c
@@ -11,6 +11,7 @@
 #include "trace2.h"
 #include "run-command.h"
 #include "repository.h"
+#include "chunk-format.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -799,18 +800,15 @@ static int write_midx_large_offsets(struct hashfile *f,
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
 			       struct string_list *packs_to_drop, unsigned flags)
 {
-	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
 	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint64_t header_size = 0;
-	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
-	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
+	struct chunkfile *cf;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name))
@@ -923,98 +921,35 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.m)
 		close_midx(ctx.m);
 
-	cur_chunk = 0;
-	num_chunks = ctx.large_offsets_needed ? 5 : 4;
-
 	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
 		result = 1;
 		goto cleanup;
 	}
 
-	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
-
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
-	chunk_offsets[cur_chunk] = header_size + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
-
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
-
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
-
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;
-
-	cur_chunk++;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
-	if (ctx.large_offsets_needed) {
-		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
-
-		cur_chunk++;
-		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
-					   ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
-	}
-
-	chunk_ids[cur_chunk] = 0;
-
-	for (i = 0; i <= num_chunks; i++) {
-		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
-			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
-			    chunk_offsets[i - 1],
-			    chunk_offsets[i]);
-
-		if (chunk_offsets[i] % MIDX_CHUNK_ALIGNMENT)
-			BUG("chunk offset %"PRIu64" is not properly aligned",
-			    chunk_offsets[i]);
-
-		hashwrite_be32(f, chunk_ids[i]);
-		hashwrite_be64(f, chunk_offsets[i]);
-	}
-
-	for (i = 0; i < num_chunks; i++) {
-		if (f->total + f->offset != chunk_offsets[i])
-			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
-			    chunk_offsets[i],
-			    f->total + f->offset,
-			    chunk_ids[i]);
+	cf = init_chunkfile(f);
 
-		switch (chunk_ids[i]) {
-			case MIDX_CHUNKID_PACKNAMES:
-				write_midx_pack_names(f, &ctx);
-				break;
+	add_chunk(cf, MIDX_CHUNKID_PACKNAMES, pack_name_concat_len,
+		  write_midx_pack_names);
+	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT, MIDX_CHUNK_FANOUT_SIZE,
+		  write_midx_oid_fanout);
+	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
+		  ctx.entries_nr * the_hash_algo->rawsz,
+		  write_midx_oid_lookup);
+	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
+		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
+		  write_midx_object_offsets);
 
-			case MIDX_CHUNKID_OIDFANOUT:
-				write_midx_oid_fanout(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_OIDLOOKUP:
-				write_midx_oid_lookup(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_OBJECTOFFSETS:
-				write_midx_object_offsets(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_LARGEOFFSETS:
-				write_midx_large_offsets(f, &ctx);
-				break;
-
-			default:
-				BUG("trying to write unknown chunk id %"PRIx32,
-				    chunk_ids[i]);
-		}
-	}
+	if (ctx.large_offsets_needed)
+		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
+			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
+			write_midx_large_offsets);
 
-	if (hashfile_total(f) != chunk_offsets[num_chunks])
-		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-		    hashfile_total(f),
-		    chunk_offsets[num_chunks]);
+	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
+	write_chunkfile(cf, &ctx);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
+	free_chunkfile(cf);
 	commit_lock_file(&lk);
 
 cleanup:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 12/17] chunk-format: create read chunk API
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (10 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-07 20:20       ` SZEDER Gábor
  2021-02-05 14:30     ` [PATCH v3 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
                       ` (5 subsequent siblings)
  17 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add the capability to read the table of contents, then pair the chunks
with necessary logic using read_chunk_fn pointers. Callers will be added
in future changes, but the typical outline will be:

 1. initialize a 'struct chunkfile' with init_chunkfile(NULL).
 2. call read_table_of_contents().
 3. for each chunk to parse,
    a. call pair_chunk() to assign a pointer with the chunk position, or
    b. call read_chunk() to run a callback on the chunk start and size.
 4. call free_chunkfile() to clear the 'struct chunkfile' data.

We are re-using the anonymous 'struct chunkfile' data, as it is internal
to the chunk-format API. This gives it essentially two modes: write and
read. If the same struct instance was used for both reads and writes,
then there would be failures.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++
 chunk-format.h | 46 +++++++++++++++++++++++++++++
 2 files changed, 126 insertions(+)

diff --git a/chunk-format.c b/chunk-format.c
index 6e0f1900213e..bc9d4caf7276 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -12,6 +12,8 @@ struct chunk_info {
 	uint32_t id;
 	uint64_t size;
 	chunk_write_fn write_fn;
+
+	const void *start;
 };
 
 struct chunkfile {
@@ -89,3 +91,81 @@ int write_chunkfile(struct chunkfile *cf, void *data)
 
 	return 0;
 }
+
+int read_table_of_contents(struct chunkfile *cf,
+			   const unsigned char *mfile,
+			   size_t mfile_size,
+			   uint64_t toc_offset,
+			   int toc_length)
+{
+	uint32_t chunk_id;
+	const unsigned char *table_of_contents = mfile + toc_offset;
+
+	ALLOC_GROW(cf->chunks, toc_length, cf->chunks_alloc);
+
+	while (toc_length--) {
+		uint64_t chunk_offset, next_chunk_offset;
+
+		chunk_id = get_be32(table_of_contents);
+		chunk_offset = get_be64(table_of_contents + 4);
+
+		if (!chunk_id) {
+			error(_("terminating chunk id appears earlier than expected"));
+			return 1;
+		}
+
+		table_of_contents += CHUNK_LOOKUP_WIDTH;
+		next_chunk_offset = get_be64(table_of_contents + 4);
+
+		if (next_chunk_offset < chunk_offset ||
+		    next_chunk_offset > mfile_size - the_hash_algo->rawsz) {
+			error(_("improper chunk offset(s) %"PRIx64" and %"PRIx64""),
+			      chunk_offset, next_chunk_offset);
+			return -1;
+		}
+
+		cf->chunks[cf->chunks_nr].id = chunk_id;
+		cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
+		cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset;
+		cf->chunks_nr++;
+	}
+
+	chunk_id = get_be32(table_of_contents);
+	if (chunk_id) {
+		error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
+		return -1;
+	}
+
+	return 0;
+}
+
+static int pair_chunk_fn(const unsigned char *chunk_start,
+			 size_t chunk_size,
+			 void *data)
+{
+	const unsigned char **p = data;
+	*p = chunk_start;
+	return 0;
+}
+
+int pair_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       const unsigned char **p)
+{
+	return read_chunk(cf, chunk_id, pair_chunk_fn, p);
+}
+
+int read_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       chunk_read_fn fn,
+	       void *data)
+{
+	int i;
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		if (cf->chunks[i].id == chunk_id)
+			return fn(cf->chunks[i].start, cf->chunks[i].size, data);
+	}
+
+	return CHUNK_NOT_FOUND;
+}
diff --git a/chunk-format.h b/chunk-format.h
index 9a1d770accec..0edcc57db4e7 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -6,6 +6,19 @@
 struct hashfile;
 struct chunkfile;
 
+/*
+ * Initialize a 'struct chunkfile' for writing _or_ reading a file
+ * with the chunk format.
+ *
+ * If writing a file, supply a non-NULL 'struct hashfile *' that will
+ * be used to write.
+ *
+ * If reading a file, then supply the memory-mapped data to the
+ * pair_chunk() or read_chunk() methods, as appropriate.
+ *
+ * DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances
+ * for reading and writing.
+ */
 struct chunkfile *init_chunkfile(struct hashfile *f);
 void free_chunkfile(struct chunkfile *cf);
 int get_num_chunks(struct chunkfile *cf);
@@ -16,4 +29,37 @@ void add_chunk(struct chunkfile *cf,
 	       chunk_write_fn fn);
 int write_chunkfile(struct chunkfile *cf, void *data);
 
+int read_table_of_contents(struct chunkfile *cf,
+			   const unsigned char *mfile,
+			   size_t mfile_size,
+			   uint64_t toc_offset,
+			   int toc_length);
+
+#define CHUNK_NOT_FOUND (-2)
+
+/*
+ * Find 'chunk_id' in the given chunkfile and assign the
+ * given pointer to the position in the mmap'd file where
+ * that chunk begins.
+ *
+ * Returns CHUNK_NOT_FOUND if the chunk does not exist.
+ */
+int pair_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       const unsigned char **p);
+
+typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
+			     size_t chunk_size, void *data);
+/*
+ * Find 'chunk_id' in the given chunkfile and call the
+ * given chunk_read_fn method with the information for
+ * that chunk.
+ *
+ * Returns CHUNK_NOT_FOUND if the chunk does not exist.
+ */
+int read_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       chunk_read_fn fn,
+	       void *data);
+
 #endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 13/17] commit-graph: use chunk-format read API
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (11 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 14/17] midx: " Derrick Stolee via GitGitGadget
                       ` (4 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). While the current
implementation loses the duplicate-chunk detection, that will be added
in a future change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 154 ++++++++++++++--------------------------
 t/t5318-commit-graph.sh |   2 +-
 2 files changed, 53 insertions(+), 103 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7c607d23b29f..32cf5091d2fb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -299,15 +299,43 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
+static int graph_read_oid_lookup(const unsigned char *chunk_start,
+				 size_t chunk_size, void *data)
+{
+	struct commit_graph *g = data;
+	g->chunk_oid_lookup = chunk_start;
+	g->num_commits = chunk_size / g->hash_len;
+	return 0;
+}
+
+static int graph_read_bloom_data(const unsigned char *chunk_start,
+				  size_t chunk_size, void *data)
+{
+	struct commit_graph *g = data;
+	uint32_t hash_version;
+	g->chunk_bloom_data = chunk_start;
+	hash_version = get_be32(chunk_start);
+
+	if (hash_version != 1)
+		return 0;
+
+	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
+	g->bloom_filter_settings->hash_version = hash_version;
+	g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4);
+	g->bloom_filter_settings->bits_per_entry = get_be32(chunk_start + 8);
+	g->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
+
+	return 0;
+}
+
 struct commit_graph *parse_commit_graph(struct repository *r,
 					void *graph_map, size_t graph_size)
 {
-	const unsigned char *data, *chunk_lookup;
-	uint32_t i;
+	const unsigned char *data;
 	struct commit_graph *graph;
-	uint64_t next_chunk_offset;
 	uint32_t graph_signature;
 	unsigned char graph_version, hash_version;
+	struct chunkfile *cf = NULL;
 
 	if (!graph_map)
 		return NULL;
@@ -356,108 +384,28 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		return NULL;
 	}
 
-	chunk_lookup = data + 8;
-	next_chunk_offset = get_be64(chunk_lookup + 4);
-	for (i = 0; i < graph->num_chunks; i++) {
-		uint32_t chunk_id;
-		uint64_t chunk_offset = next_chunk_offset;
-		int chunk_repeated = 0;
-
-		chunk_id = get_be32(chunk_lookup + 0);
-
-		chunk_lookup += GRAPH_CHUNKLOOKUP_WIDTH;
-		next_chunk_offset = get_be64(chunk_lookup + 4);
-
-		if (chunk_offset > graph_size - the_hash_algo->rawsz) {
-			error(_("commit-graph improper chunk offset %08x%08x"), (uint32_t)(chunk_offset >> 32),
-			      (uint32_t)chunk_offset);
-			goto free_and_return;
-		}
-
-		switch (chunk_id) {
-		case GRAPH_CHUNKID_OIDFANOUT:
-			if (graph->chunk_oid_fanout)
-				chunk_repeated = 1;
-			else
-				graph->chunk_oid_fanout = (uint32_t*)(data + chunk_offset);
-			break;
-
-		case GRAPH_CHUNKID_OIDLOOKUP:
-			if (graph->chunk_oid_lookup)
-				chunk_repeated = 1;
-			else {
-				graph->chunk_oid_lookup = data + chunk_offset;
-				graph->num_commits = (next_chunk_offset - chunk_offset)
-						     / graph->hash_len;
-			}
-			break;
+	cf = init_chunkfile(NULL);
 
-		case GRAPH_CHUNKID_DATA:
-			if (graph->chunk_commit_data)
-				chunk_repeated = 1;
-			else
-				graph->chunk_commit_data = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_GENERATION_DATA:
-			if (graph->chunk_generation_data)
-				chunk_repeated = 1;
-			else
-				graph->chunk_generation_data = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
-			if (graph->chunk_generation_data_overflow)
-				chunk_repeated = 1;
-			else
-				graph->chunk_generation_data_overflow = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_EXTRAEDGES:
-			if (graph->chunk_extra_edges)
-				chunk_repeated = 1;
-			else
-				graph->chunk_extra_edges = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BASE:
-			if (graph->chunk_base_graphs)
-				chunk_repeated = 1;
-			else
-				graph->chunk_base_graphs = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BLOOMINDEXES:
-			if (graph->chunk_bloom_indexes)
-				chunk_repeated = 1;
-			else if (r->settings.commit_graph_read_changed_paths)
-				graph->chunk_bloom_indexes = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BLOOMDATA:
-			if (graph->chunk_bloom_data)
-				chunk_repeated = 1;
-			else if (r->settings.commit_graph_read_changed_paths) {
-				uint32_t hash_version;
-				graph->chunk_bloom_data = data + chunk_offset;
-				hash_version = get_be32(data + chunk_offset);
-
-				if (hash_version != 1)
-					break;
+	if (read_table_of_contents(cf, graph->data, graph_size,
+				   GRAPH_HEADER_SIZE, graph->num_chunks))
+		goto free_and_return;
 
-				graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
-				graph->bloom_filter_settings->hash_version = hash_version;
-				graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
-				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
-				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
-			}
-			break;
-		}
+	pair_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
+		   (const unsigned char **)&graph->chunk_oid_fanout);
+	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
+	pair_chunk(cf, GRAPH_CHUNKID_DATA, &graph->chunk_commit_data);
+	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges);
+	pair_chunk(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs);
+	pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
+		   &graph->chunk_generation_data);
+	pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
+		   &graph->chunk_generation_data_overflow);
 
-		if (chunk_repeated) {
-			error(_("commit-graph chunk id %08x appears multiple times"), chunk_id);
-			goto free_and_return;
-		}
+	if (r->settings.commit_graph_read_changed_paths) {
+		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+			   &graph->chunk_bloom_indexes);
+		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
+			   graph_read_bloom_data, graph);
 	}
 
 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
@@ -474,9 +422,11 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	if (verify_commit_graph_lite(graph))
 		goto free_and_return;
 
+	free_chunkfile(cf);
 	return graph;
 
 free_and_return:
+	free_chunkfile(cf);
 	free(graph->bloom_filter_settings);
 	free(graph);
 	return NULL;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index fa27df579a57..c7da741284e5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -564,7 +564,7 @@ test_expect_success 'detect bad hash version' '
 
 test_expect_success 'detect low chunk count' '
 	corrupt_graph_and_verify $GRAPH_BYTE_CHUNK_COUNT "\01" \
-		"missing the .* chunk"
+		"final chunk has non-zero id"
 '
 
 test_expect_success 'detect missing OID fanout chunk' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 14/17] midx: use chunk-format read API
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (12 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
                       ` (3 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). In particular, we
can use the return value of pair_chunk() to generate an error when a
required chunk is missing.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 72 ++++++++++++++-----------------------
 t/t5319-multi-pack-index.sh |  6 ++--
 2 files changed, 29 insertions(+), 49 deletions(-)

diff --git a/midx.c b/midx.c
index d9c7411b083b..aee9ed832d52 100644
--- a/midx.c
+++ b/midx.c
@@ -54,6 +54,19 @@ static char *get_midx_filename(const char *object_dir)
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
 
+static int midx_read_oid_fanout(const unsigned char *chunk_start,
+				size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = data;
+	m->chunk_oid_fanout = (uint32_t *)chunk_start;
+
+	if (chunk_size != 4 * 256) {
+		error(_("multi-pack-index OID fanout is of the wrong size"));
+		return 1;
+	}
+	return 0;
+}
+
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local)
 {
 	struct multi_pack_index *m = NULL;
@@ -65,6 +78,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 	char *midx_name = get_midx_filename(object_dir);
 	uint32_t i;
 	const char *cur_pack_name;
+	struct chunkfile *cf = NULL;
 
 	fd = git_open(midx_name);
 
@@ -114,58 +128,23 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 
 	m->num_packs = get_be32(m->data + MIDX_BYTE_NUM_PACKS);
 
-	for (i = 0; i < m->num_chunks; i++) {
-		uint32_t chunk_id = get_be32(m->data + MIDX_HEADER_SIZE +
-					     MIDX_CHUNKLOOKUP_WIDTH * i);
-		uint64_t chunk_offset = get_be64(m->data + MIDX_HEADER_SIZE + 4 +
-						 MIDX_CHUNKLOOKUP_WIDTH * i);
-
-		if (chunk_offset >= m->data_len)
-			die(_("invalid chunk offset (too large)"));
-
-		switch (chunk_id) {
-			case MIDX_CHUNKID_PACKNAMES:
-				m->chunk_pack_names = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_OIDFANOUT:
-				m->chunk_oid_fanout = (uint32_t *)(m->data + chunk_offset);
-				break;
-
-			case MIDX_CHUNKID_OIDLOOKUP:
-				m->chunk_oid_lookup = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_OBJECTOFFSETS:
-				m->chunk_object_offsets = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_LARGEOFFSETS:
-				m->chunk_large_offsets = m->data + chunk_offset;
-				break;
-
-			case 0:
-				die(_("terminating multi-pack-index chunk id appears earlier than expected"));
-				break;
-
-			default:
-				/*
-				 * Do nothing on unrecognized chunks, allowing future
-				 * extensions to add optional chunks.
-				 */
-				break;
-		}
-	}
+	cf = init_chunkfile(NULL);
 
-	if (!m->chunk_pack_names)
+	if (read_table_of_contents(cf, m->data, midx_size,
+				   MIDX_HEADER_SIZE, m->num_chunks))
+		goto cleanup_fail;
+
+	if (pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required pack-name chunk"));
-	if (!m->chunk_oid_fanout)
+	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required OID fanout chunk"));
-	if (!m->chunk_oid_lookup)
+	if (pair_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required OID lookup chunk"));
-	if (!m->chunk_object_offsets)
+	if (pair_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, &m->chunk_object_offsets) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required object offsets chunk"));
 
+	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets);
+
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
 	m->pack_names = xcalloc(m->num_packs, sizeof(*m->pack_names));
@@ -191,6 +170,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 cleanup_fail:
 	free(m);
 	free(midx_name);
+	free(cf);
 	if (midx_map)
 		munmap(midx_map, midx_size);
 	if (0 <= fd)
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 297de502a94f..ad4e878b65b8 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -314,12 +314,12 @@ test_expect_success 'verify bad OID version' '
 
 test_expect_success 'verify truncated chunk count' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\01" $objdir \
-		"missing required"
+		"final chunk has non-zero id"
 '
 
 test_expect_success 'verify extended chunk count' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\07" $objdir \
-		"terminating multi-pack-index chunk id appears earlier than expected"
+		"terminating chunk id appears earlier than expected"
 '
 
 test_expect_success 'verify missing required chunk' '
@@ -329,7 +329,7 @@ test_expect_success 'verify missing required chunk' '
 
 test_expect_success 'verify invalid chunk offset' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_OFFSET "\01" $objdir \
-		"invalid chunk offset (too large)"
+		"improper chunk offset(s)"
 '
 
 test_expect_success 'verify packnames out of order' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 15/17] midx: use 64-bit multiplication for chunk sizes
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (13 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 14/17] midx: " Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
                       ` (2 subsequent siblings)
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When calculating the sizes of certain chunks, we should use 64-bit
multiplication always. This allows us to properly predict the chunk
sizes without risk of overflow.

Other possible overflows were discovered by evaluating each
multiplication in midx.c and ensuring that at least one side of the
operator was of type size_t or off_t.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/midx.c b/midx.c
index aee9ed832d52..95648a1f368f 100644
--- a/midx.c
+++ b/midx.c
@@ -246,7 +246,7 @@ static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
 	const unsigned char *offset_data;
 	uint32_t offset32;
 
-	offset_data = m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH;
+	offset_data = m->chunk_object_offsets + (off_t)pos * MIDX_CHUNK_OFFSET_WIDTH;
 	offset32 = get_be32(offset_data + sizeof(uint32_t));
 
 	if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) {
@@ -262,7 +262,8 @@ static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
 
 static uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos)
 {
-	return get_be32(m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH);
+	return get_be32(m->chunk_object_offsets +
+			(off_t)pos * MIDX_CHUNK_OFFSET_WIDTH);
 }
 
 static int nth_midxed_pack_entry(struct repository *r,
@@ -914,15 +915,15 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT, MIDX_CHUNK_FANOUT_SIZE,
 		  write_midx_oid_fanout);
 	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
-		  ctx.entries_nr * the_hash_algo->rawsz,
+		  (size_t)ctx.entries_nr * the_hash_algo->rawsz,
 		  write_midx_oid_lookup);
 	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
-		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
+		  (size_t)ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
 		  write_midx_object_offsets);
 
 	if (ctx.large_offsets_needed)
 		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
-			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
+			(size_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
 			write_midx_large_offsets);
 
 	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 16/17] chunk-format: restore duplicate chunk checks
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (14 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-05 14:30     ` [PATCH v3 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Before refactoring into the chunk-format API, the commit-graph parsing
logic included checks for duplicate chunks. It is unlikely that we would
desire a chunk-based file format that allows duplicate chunk IDs in the
table of contents, so add duplicate checks into
read_table_of_contents().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/chunk-format.c b/chunk-format.c
index bc9d4caf7276..e4889d9efcd1 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -98,6 +98,7 @@ int read_table_of_contents(struct chunkfile *cf,
 			   uint64_t toc_offset,
 			   int toc_length)
 {
+	int i;
 	uint32_t chunk_id;
 	const unsigned char *table_of_contents = mfile + toc_offset;
 
@@ -124,6 +125,14 @@ int read_table_of_contents(struct chunkfile *cf,
 			return -1;
 		}
 
+		for (i = 0; i < cf->chunks_nr; i++) {
+			if (cf->chunks[i].id == chunk_id) {
+				error(_("duplicate chunk ID %"PRIx32" found"),
+					chunk_id);
+				return -1;
+			}
+		}
+
 		cf->chunks[cf->chunks_nr].id = chunk_id;
 		cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
 		cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v3 17/17] chunk-format: add technical docs
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (15 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
@ 2021-02-05 14:30     ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
  17 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-05 14:30 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The chunk-based file format is now an API in the code, but we should
also take time to document it as a file format. Specifically, it matches
the CHUNK LOOKUP sections of the commit-graph and multi-pack-index
files, but there are some commonalities that should be grouped in this
document.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/chunk-format.txt      | 116 ++++++++++++++++++
 .../technical/commit-graph-format.txt         |   3 +
 Documentation/technical/pack-format.txt       |   3 +
 3 files changed, 122 insertions(+)
 create mode 100644 Documentation/technical/chunk-format.txt

diff --git a/Documentation/technical/chunk-format.txt b/Documentation/technical/chunk-format.txt
new file mode 100644
index 000000000000..593614fcedab
--- /dev/null
+++ b/Documentation/technical/chunk-format.txt
@@ -0,0 +1,116 @@
+Chunk-based file formats
+========================
+
+Some file formats in Git use a common concept of "chunks" to describe
+sections of the file. This allows structured access to a large file by
+scanning a small "table of contents" for the remaining data. This common
+format is used by the `commit-graph` and `multi-pack-index` files. See
+link:technical/pack-format.html[the `multi-pack-index` format] and
+link:technical/commit-graph-format.html[the `commit-graph` format] for
+how they use the chunks to describe structured data.
+
+A chunk-based file format begins with some header information custom to
+that format. That header should include enough information to identify
+the file type, format version, and number of chunks in the file. From this
+information, that file can determine the start of the chunk-based region.
+
+The chunk-based region starts with a table of contents describing where
+each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
+where C is the number of chunks. Consider the following table:
+
+  | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
+  |--------------------|------------------------|
+  | ID[0]              | OFFSET[0]              |
+  | ...                | ...                    |
+  | ID[C]              | OFFSET[C]              |
+  | 0x0000             | OFFSET[C+1]            |
+
+Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
+Each integer is stored in network-byte order.
+
+The chunk identifier `ID[i]` is a label for the data stored within this
+fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
+size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
+and `OFFSET[i]`. This requires that the chunk data appears contiguously
+in the same order as the table of contents.
+
+The final entry in the table of contents must be four zero bytes. This
+confirms that the table of contents is ending and provides the offset for
+the end of the chunk-based data.
+
+Note: The chunk-based format expects that the file contains _at least_ a
+trailing hash after `OFFSET[C+1]`.
+
+Functions for working with chunk-based file formats are declared in
+`chunk-format.h`. Using these methods provide extra checks that assist
+developers when creating new file formats.
+
+Writing chunk-based file formats
+--------------------------------
+
+To write a chunk-based file format, create a `struct chunkfile` by
+calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
+caller is responsible for opening the `hashfile` and writing header
+information so the file format is identifiable before the chunk-based
+format begins.
+
+Then, call `add_chunk()` for each chunk that is intended for write. This
+populates the `chunkfile` with information about the order and size of
+each chunk to write. Provide a `chunk_write_fn` function pointer to
+perform the write of the chunk data upon request.
+
+Call `write_chunkfile()` to write the table of contents to the `hashfile`
+followed by each of the chunks. This will verify that each chunk wrote
+the expected amount of data so the table of contents is correct.
+
+Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
+caller is responsible for finalizing the `hashfile` by writing the trailing
+hash and closing the file.
+
+Reading chunk-based file formats
+--------------------------------
+
+To read a chunk-based file format, the file must be opened as a
+memory-mapped region. The chunk-format API expects that the entire file
+is mapped as a contiguous memory region.
+
+Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.
+
+After reading the header information from the beginning of the file,
+including the chunk count, call `read_table_of_contents()` to populate
+the `struct chunkfile` with the list of chunks, their offsets, and their
+sizes.
+
+Extract the data information for each chunk using `pair_chunk()` or
+`read_chunk()`:
+
+* `pair_chunk()` assigns a given pointer with the location inside the
+  memory-mapped file corresponding to that chunk's offset. If the chunk
+  does not exist, then the pointer is not modified.
+
+* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
+  with the appropriate initial pointer and size information. The function
+  is not called if the chunk does not exist. Use this method to read chunks
+  if you need to perform immediate parsing or if you need to execute logic
+  based on the size of the chunk.
+
+After calling these methods, call `free_chunkfile()` to clear the
+`struct chunkfile` data. This will not close the memory-mapped region.
+Callers are expected to own that data for the timeframe the pointers into
+the region are needed.
+
+Examples
+--------
+
+These file formats use the chunk-format API, and can be used as examples
+for future formats:
+
+* *commit-graph:* see `write_commit_graph_file()` and `parse_commit_graph()`
+  in `commit-graph.c` for how the chunk-format API is used to write and
+  parse the commit-graph file format documented in
+  link:technical/commit-graph-format.html[the commit-graph file format].
+
+* *multi-pack-index:* see `write_midx_internal()` and `load_multi_pack_index()`
+  in `midx.c` for how the chunk-format API is used to write and
+  parse the multi-pack-index file format documented in
+  link:technical/pack-format.html[the multi-pack-index file format].
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b6658eff1882..87971c27dd73 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -61,6 +61,9 @@ CHUNK LOOKUP:
       the length using the next chunk position if necessary.) Each chunk
       ID appears at most once.
 
+  The CHUNK LOOKUP matches the table of contents from
+  link:technical/chunk-format.html[the chunk-based file format].
+
   The remaining data in the body is described one chunk at a time, and
   these chunks may be given in any order. Chunks are required unless
   otherwise specified.
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index f96b2e605f34..2fb1e60d29ec 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -301,6 +301,9 @@ CHUNK LOOKUP:
 	    (Chunks are provided in file-order, so you can infer the length
 	    using the next chunk position if necessary.)
 
+	The CHUNK LOOKUP matches the table of contents from
+	link:technical/chunk-format.html[the chunk-based file format].
+
 	The remaining data in the body is described one chunk at a time, and
 	these chunks may be given in any order. Chunks are required unless
 	otherwise specified.
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 02/17] chunk-format: create chunk format write API
  2021-02-05 11:37       ` Derrick Stolee
@ 2021-02-05 19:25         ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-02-05 19:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, l.s.r, szeder.dev,
	Chris Torek, Derrick Stolee, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>> But a silent type casting from size_t to uint64_t done silently by
>> assignment bothers me. 
>
> Does this bother you only because its part of the external interface?
> If I understand correctly, uint64_t will always be at least as big
> as size_t, so this doesn't need any protections for overflow or
> anything. Is there something I should be doing before casting?

I am OK to use uint64_t on the caller-facing side, as long as we
explicitly declare that uint64_t ought to be large enough for
everybody.  Struct members and variables that are closer to the
on-disk data need to be of sized type to avoid repeating the pain
caused by our early "unsigned long ought to be large enough for
everybody" attitude, but it is nicer to be working with more
abstract types in the layer higher up.  And if there is a risk of
truncation in either direction, we should be defensive.  That's all.

>>> +int write_chunkfile(struct chunkfile *cf, void *data)
>>> +{
>>> +	int i;
>>> +	size_t cur_offset = cf->f->offset + cf->f->total;
>> 
>> That ought to be off_t, as it is a seek position inside a file
>> (struct hashfile.total is already off_t).
>
> I can use off_t for the other offsets in this computation, but
> cur_offset will be used in hashwrite_be64(), so maybe it is best
> to use uint64_t here?

As I discovered in the later parts, I think off_t makes less sense
than size_t in the context of this design, so size_t is fine, as
long as we keep the "users of chunkfile API work on a mmapped region
of contiguous memory" (which I think is OK).  uint64_t is also fine,
as long as this one is an internal implementation detail; iow,
callers of the API do not have to be responsible for casting their
more abstract and platform neutral types down to these fixed-sized
types even if we choose to use uint64_t here.

> The current chunk format API makes the same assumption (ToC comes
> first) but could be adjusted later to let this part of the method
> dynamically compute the chunk sizes and fill a ToC at the end. The
> way to modify this API would be to add a 'flags' parameter.
>
> So far, this has not been necessary, but might be in the future.

Yup, and I am happy with the current design for now.  Thanks for
clarifying the thinking behind it.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 12/17] chunk-format: create read chunk API
  2021-02-05 12:19       ` Derrick Stolee
@ 2021-02-05 19:37         ` Junio C Hamano
  2021-02-08 22:26           ` Junio C Hamano
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-02-05 19:37 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, l.s.r, szeder.dev,
	Chris Torek, Derrick Stolee, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>>> +	chunk_id = get_be32(table_of_contents);
>>> +	if (chunk_id) {
>>> +		error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
>>> +		return -1;
>>> +	}
>> 
>> Shouldn't we be validating the size component associated with this
>> "id=0" fake chunk that appears at the end as well?

No, please disregard this comment, which was based on my incorrect
understanding of the "size" field associated with this fake ID==0
chunk (I incorrectly thought the size had something to do with the
file header plus TOC, but it is not---it is to allow discovering the
size of the last chunk by being a sentinel that records the offset
of an extra chunk at the end that does not actually exist).

> I like this, but why not just use pair_chunk_fn inside of
> the implementation of pair_chunk() so callers have an easy
> interface.

Yes, I didn't realize that earlier design iteration resulted in the
introduction of the "pair_chunk()" after discovering that it often
is necessary to just note the address where the data begins, so I
am OK to leave something like pair_chunk() as a public interface,
and implementing the pair_chunk() helper like you suggest would be a
perfectly fine way to do so.

It however is curious that the callers who use pair_chunk() do not
get the same quality of data as read_chunk() callers.

The users of pair_chunk() presumably are not ready to (or simply do
not want to) process the data immediately by using read_chunk() with
callback, but when they get ready to process the data, unlike
read_chunk callbacks, they do not get to learn how much they ought
to process---all they learn is the address of the beginning of the
chunk.  I do not see a way to write pair_chunk() users safely to
guarantee that they do not overrun at the tail end of the chunk they
are processing.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes
  2021-02-05 12:30       ` Derrick Stolee
@ 2021-02-05 19:42         ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-02-05 19:42 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, l.s.r, szeder.dev,
	Chris Torek, Derrick Stolee, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> On 2/4/2021 7:00 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> 
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>
>>> When calculating the sizes of certain chunks, we should use 64-bit
>>> multiplication always. This allows us to properly predict the chunk
>>> sizes without risk of overflow.
>>>
>>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>>> ---
>>>  midx.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>> 
>> This one I find somewhat questionable for multiple reasons.
>> 
>>  * the fourth parameter of add_chunk() is of size_t, not uint64_t;
>>    shouldn't the multiplication be done in type size_t instead?
>
> This is probably appropriate because we will truncate to size_t if
> it is smaller than uint64_t.

In other words, if size_t turns out to be too small, doing
multiplication in uint64_t would not help at all and add_chunk() API
needs its parameter types updated [*].

    side note: I really wish that the language and the compiler
    helped us so that we didn't have to do this---after all, our
    function prototype says the result will be passed as a certain
    type, so it would be nice if the arithmetic to compute that
    result were automatically carried out in a way not to cause
    truncation.

>>  * these mutiplications were introduced in "midx: use chunk-format
>>    API in write_midx_internal()"; that step should use the
>>    arithmetic with cast (if necessary) from the start, no?
>
> I wanted to isolate these changes specifically so we could be
> careful about the multiplications and not be distracted by them
> when converting to the chunk-format API. The multiplications were
> "moved" by that patch, not "introduced".

Hmph, I somehow had an impression that they did not have truncation
issue in the original context, but perhaps I was wrong.  OK.

> I did a full inspection of all multiplications in midx.c and
> found a few more instances of possible overflow. Two are on the
> read side, but they require the object lookup chunk to have size
> 4gb or larger. This is not _that_ far off from possibility! My
> multi-pack-index for the Windows repository is currently ~1.6 GB
> (in total, including the other chunks).

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes
  2021-02-05 10:59       ` Chris Torek
@ 2021-02-05 20:41         ` Junio C Hamano
  2021-02-06 20:35           ` Chris Torek
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-02-05 20:41 UTC (permalink / raw)
  To: Chris Torek
  Cc: Derrick Stolee via GitGitGadget, Git List, Taylor Blau,
	René Scharfe, szeder.dev, Derrick Stolee, Derrick Stolee,
	Derrick Stolee

Chris Torek <chris.torek@gmail.com> writes:

> On Thu, Feb 4, 2021 at 4:00 PM Junio C Hamano <gitster@pobox.com> wrote:
>>  * the fourth parameter of add_chunk() is of size_t, not uint64_t;
>>    shouldn't the multiplication be done in type size_t instead?
>
> There are (still) systems with 32-bit size_t (but 64-bit
> off_t / file sizes), so ... probably not.  Is size_t ever more than
> 64 bits these days?

Sorry, you lost me.  I do not see how it would help to perform the
multiplication in uint64_t, when you suspect that size_t is too
small, if the final destination of the result of the multiplication
is a function argument of type size_t?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes
  2021-02-05 20:41         ` Junio C Hamano
@ 2021-02-06 20:35           ` Chris Torek
  0 siblings, 0 replies; 120+ messages in thread
From: Chris Torek @ 2021-02-06 20:35 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, Git List, Taylor Blau,
	René Scharfe, szeder.dev, Derrick Stolee, Derrick Stolee,
	Derrick Stolee

On Fri, Feb 5, 2021 at 12:41 PM Junio C Hamano <gitster@pobox.com> wrote:
> Chris Torek <chris.torek@gmail.com> writes:
> > There are (still) systems with 32-bit size_t (but 64-bit
> > off_t / file sizes), so ... probably not.  Is size_t ever more than
> > 64 bits these days?
>
> Sorry, you lost me.  I do not see how it would help to perform the
> multiplication in uint64_t, when you suspect that size_t is too
> small, if the final destination of the result of the multiplication
> is a function argument of type size_t?

No, you and Derrick Stolee are right, I wasn't looking out far enough
here (to the actual function).

(I was wondering though if there are systems where the valid range
for size_t could exceed that for off_t.  Are there still systems
using 32-bit off_t?  Sometimes I think there are too many abstracted
types running around here -- how do we know which sizes are big
enough?  There is always uintmax_t, though, and for unsigned
types, ((T)-1) gets you the maximum possible value.)

Chris

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes
  2021-02-05  0:00     ` Junio C Hamano
  2021-02-05 10:59       ` Chris Torek
  2021-02-05 12:30       ` Derrick Stolee
@ 2021-02-07 19:50       ` SZEDER Gábor
  2021-02-08  5:41         ` Junio C Hamano
  2 siblings, 1 reply; 120+ messages in thread
From: SZEDER Gábor @ 2021-02-07 19:50 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, me, l.s.r, Chris Torek,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

On Thu, Feb 04, 2021 at 04:00:19PM -0800, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Derrick Stolee <dstolee@microsoft.com>
> >
> > When calculating the sizes of certain chunks, we should use 64-bit
> > multiplication always. This allows us to properly predict the chunk
> > sizes without risk of overflow.
> >
> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> > ---
> >  midx.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> This one I find somewhat questionable for multiple reasons.
> 
>  * the fourth parameter of add_chunk() is of size_t, not uint64_t;
>    shouldn't the multiplication be done in type size_t instead?
> 
>  * these mutiplications were introduced in "midx: use chunk-format
>    API in write_midx_internal()";

No, that patch also removes lines like: 

-       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;

-               chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
-                                          ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;

So those potentially problematic multiplications were already there
before this series, and in fact trace all the way back to the initial
midx patch series (commits 0d5b3a5ef7 (midx: write object ids in a 
chunk, 2018-07-12) and 662148c435 (midx: write object offsets,
2018-07-12)).

>    that step should use the
>    arithmetic with cast (if necessary) from the start, no?

As it fixes a long-standing issue, it should rather be a bugfix patch
at the beginning of the series.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 12/17] chunk-format: create read chunk API
  2021-02-05 14:30     ` [PATCH v3 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
@ 2021-02-07 20:20       ` SZEDER Gábor
  2021-02-08 13:35         ` Derrick Stolee
  0 siblings, 1 reply; 120+ messages in thread
From: SZEDER Gábor @ 2021-02-07 20:20 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, l.s.r, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

On Fri, Feb 05, 2021 at 02:30:47PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> Add the capability to read the table of contents, then pair the chunks
> with necessary logic using read_chunk_fn pointers. Callers will be added
> in future changes, but the typical outline will be:
> 
>  1. initialize a 'struct chunkfile' with init_chunkfile(NULL).
>  2. call read_table_of_contents().

A reader should call read_table_of_contents(), noted.

>  3. for each chunk to parse,
>     a. call pair_chunk() to assign a pointer with the chunk position, or
>     b. call read_chunk() to run a callback on the chunk start and size.
>  4. call free_chunkfile() to clear the 'struct chunkfile' data.

How could a user of this API learn about all chunks present in the
chunkfile, including unrecognized chunks?

> We are re-using the anonymous 'struct chunkfile' data, as it is internal
> to the chunk-format API. This gives it essentially two modes: write and
> read. If the same struct instance was used for both reads and writes,
> then there would be failures.
> 
> Helped-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

> diff --git a/chunk-format.h b/chunk-format.h
> index 9a1d770accec..0edcc57db4e7 100644
> --- a/chunk-format.h
> +++ b/chunk-format.h
> @@ -6,6 +6,19 @@
>  struct hashfile;
>  struct chunkfile;
>  
> +/*
> + * Initialize a 'struct chunkfile' for writing _or_ reading a file
> + * with the chunk format.
> + *
> + * If writing a file, supply a non-NULL 'struct hashfile *' that will
> + * be used to write.
> + *
> + * If reading a file, then supply the memory-mapped data to the
> + * pair_chunk() or read_chunk() methods, as appropriate.

And call read_table_of_contents() in between.

> + *
> + * DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances
> + * for reading and writing.
> + */
>  struct chunkfile *init_chunkfile(struct hashfile *f);
>  void free_chunkfile(struct chunkfile *cf);
>  int get_num_chunks(struct chunkfile *cf);
> @@ -16,4 +29,37 @@ void add_chunk(struct chunkfile *cf,
>  	       chunk_write_fn fn);
>  int write_chunkfile(struct chunkfile *cf, void *data);
>  
> +int read_table_of_contents(struct chunkfile *cf,
> +			   const unsigned char *mfile,
> +			   size_t mfile_size,
> +			   uint64_t toc_offset,
> +			   int toc_length);
> +
> +#define CHUNK_NOT_FOUND (-2)
> +
> +/*
> + * Find 'chunk_id' in the given chunkfile and assign the
> + * given pointer to the position in the mmap'd file where
> + * that chunk begins.
> + *
> + * Returns CHUNK_NOT_FOUND if the chunk does not exist.
> + */
> +int pair_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       const unsigned char **p);
> +
> +typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
> +			     size_t chunk_size, void *data);
> +/*
> + * Find 'chunk_id' in the given chunkfile and call the
> + * given chunk_read_fn method with the information for
> + * that chunk.
> + *
> + * Returns CHUNK_NOT_FOUND if the chunk does not exist.
> + */
> +int read_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       chunk_read_fn fn,
> +	       void *data);
> +
>  #endif
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/17] chunk-format: create chunk format write API
  2021-02-05 14:30     ` [PATCH v3 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
@ 2021-02-07 21:13       ` SZEDER Gábor
  2021-02-08 13:44         ` Derrick Stolee
  0 siblings, 1 reply; 120+ messages in thread
From: SZEDER Gábor @ 2021-02-07 21:13 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, l.s.r, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

> diff --git a/chunk-format.c b/chunk-format.c
> new file mode 100644
> index 000000000000..6e0f1900213e
> --- /dev/null
> +++ b/chunk-format.c
> @@ -0,0 +1,91 @@
> +#include "cache.h"
> +#include "chunk-format.h"
> +#include "csum-file.h"
> +#define CHUNK_LOOKUP_WIDTH 12

As this macro is defined in 'chunk-format.c' it's not part of the
chunkfile API.  However, at the end of this patch series
'commit-graph.c' still contains:

  #define GRAPH_CHUNKLOOKUP_WIDTH 12

and uses it in a couple of safety checks (that didn't became part of
the common chunkfile module; why?), while 'midx.c' contains:

  #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))

though it's not used anymore.

I think we should have only one such constant as part of the chunkfile
API; and preferably use the definition from 'midx.c' as it is more
informative than yet another magic number.

Furthermore, being called 'CHUNK_LOOKUP_WIDTH', I had to look up the
places where this constant is used to make sure that it indeed means
what I suspect it means.  Perhaps CHUNK_TOC_ENTRY_SIZE would be a more
descriptive name for this constant.


On a somewhat related note: 'commit-graph.c' and 'midx.c' still
contains the constants MAX_NUM_CHUNKS and MIDX_MAX_CHUNKS,
respecticely, but neither of them is used anymore.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes
  2021-02-07 19:50       ` SZEDER Gábor
@ 2021-02-08  5:41         ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-02-08  5:41 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Derrick Stolee via GitGitGadget, git, me, l.s.r, Chris Torek,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

SZEDER Gábor <szeder.dev@gmail.com> writes:

> No, that patch also removes lines like: 
>
> -       chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;
>
> -               chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
> -                                          ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;

OK.  In other words, the above was replaced in the same patch with

    add_chunk(...., U32 * U32);

where the called function expects the result of the multiplication
as size_t in its function prototype.  It is a bit sad that U32*U32
to compute the argument that is to be passed as U64 must be casted
as (uint64_t)U32*U32 by the caller X-<.

The original that the above replaced, shown in your quote, is:

    U64 = U64 + U32 * U32;

I also wish that the fact that it is added to U64 is sufficient not
to require the RHS to be written as U64 + (uint64_t) U32 * U32 (in
other words, the original that was removed was OK without cast).

Sad.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 12/17] chunk-format: create read chunk API
  2021-02-07 20:20       ` SZEDER Gábor
@ 2021-02-08 13:35         ` Derrick Stolee
  0 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee @ 2021-02-08 13:35 UTC (permalink / raw)
  To: SZEDER Gábor, Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, l.s.r, Chris Torek, Derrick Stolee, Derrick Stolee

On 2/7/2021 3:20 PM, SZEDER Gábor wrote:
> On Fri, Feb 05, 2021 at 02:30:47PM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Add the capability to read the table of contents, then pair the chunks
>> with necessary logic using read_chunk_fn pointers. Callers will be added
>> in future changes, but the typical outline will be:
>>
>>  1. initialize a 'struct chunkfile' with init_chunkfile(NULL).
>>  2. call read_table_of_contents().
> 
> A reader should call read_table_of_contents(), noted.
> 
>>  3. for each chunk to parse,
>>     a. call pair_chunk() to assign a pointer with the chunk position, or
>>     b. call read_chunk() to run a callback on the chunk start and size.
>>  4. call free_chunkfile() to clear the 'struct chunkfile' data.
> 
> How could a user of this API learn about all chunks present in the
> chunkfile, including unrecognized chunks?

That could certainly be added (when needed) without modifying the data
structures.

>> +/*
>> + * Initialize a 'struct chunkfile' for writing _or_ reading a file
>> + * with the chunk format.
>> + *
>> + * If writing a file, supply a non-NULL 'struct hashfile *' that will
>> + * be used to write.
>> + *
>> + * If reading a file, then supply the memory-mapped data to the
>> + * pair_chunk() or read_chunk() methods, as appropriate.
> 
> And call read_table_of_contents() in between.

Yes, you are right.
 
Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/17] chunk-format: create chunk format write API
  2021-02-07 21:13       ` SZEDER Gábor
@ 2021-02-08 13:44         ` Derrick Stolee
  2021-02-11 19:43           ` SZEDER Gábor
  0 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee @ 2021-02-08 13:44 UTC (permalink / raw)
  To: SZEDER Gábor, Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, l.s.r, Chris Torek, Derrick Stolee, Derrick Stolee

On 2/7/2021 4:13 PM, SZEDER Gábor wrote:
>> +#define CHUNK_LOOKUP_WIDTH 12
> 
> As this macro is defined in 'chunk-format.c' it's not part of the
> chunkfile API.  However, at the end of this patch series
> 'commit-graph.c' still contains:
> 
>   #define GRAPH_CHUNKLOOKUP_WIDTH 12
> 
> and uses it in a couple of safety checks (that didn't became part of
> the common chunkfile module; why?),

Chunk-based files don't have a minimum size unless we know the header
size and a minimum number of required chunks. I suppose that we could
add this in the future to further simplify consumers of the API.

> while 'midx.c' contains:
> 
>   #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
> 
> though it's not used anymore.
> 
> I think we should have only one such constant as part of the chunkfile
> API; and preferably use the definition from 'midx.c' as it is more
> informative than yet another magic number.
> 
> Furthermore, being called 'CHUNK_LOOKUP_WIDTH', I had to look up the
> places where this constant is used to make sure that it indeed means
> what I suspect it means.  Perhaps CHUNK_TOC_ENTRY_SIZE would be a more
> descriptive name for this constant.

More descriptive, for sure.

> On a somewhat related note: 'commit-graph.c' and 'midx.c' still
> contains the constants MAX_NUM_CHUNKS and MIDX_MAX_CHUNKS,
> respecticely, but neither of them is used anymore.

Thanks. The following patch can be added on top of this series
to clean up these dangling macros.

Thanks,
-Stolee

--- >8 ---

From 839b880ccee65eac63e8b77b12fab6531acc55b0 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Mon, 8 Feb 2021 08:38:47 -0500
Subject: [PATCH] chunk-format: remove outdated macro constants
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The following macros were needed by midx.c and commit-graph.c to handle
their independent implementations of the chunk-based file format, but
now the chunk-format API makes them obsolete:

* MAX_NUM_CHUNKS
* MIDX_MAX_CHUNKS
* MIX_CHUNKLOOKUP_WIDTH

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 1 -
 midx.c         | 2 --
 2 files changed, 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 32cf5091d2f..3b5a8767269 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -45,7 +45,6 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 9
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
diff --git a/midx.c b/midx.c
index 95648a1f368..5c7f2ed2333 100644
--- a/midx.c
+++ b/midx.c
@@ -22,14 +22,12 @@
 #define MIDX_HEADER_SIZE 12
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + the_hash_algo->rawsz)
 
-#define MIDX_MAX_CHUNKS 5
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
 #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
 #define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */
-#define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
 #define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
 #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
-- 
2.30.0.vfs.0.0.exp


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 12/17] chunk-format: create read chunk API
  2021-02-05 19:37         ` Junio C Hamano
@ 2021-02-08 22:26           ` Junio C Hamano
  2021-02-09  1:33             ` Derrick Stolee
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-02-08 22:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, l.s.r, szeder.dev,
	Chris Torek, Derrick Stolee, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> The users of pair_chunk() presumably are not ready to (or simply do
> not want to) process the data immediately by using read_chunk() with
> callback, but when they get ready to process the data, unlike
> read_chunk callbacks, they do not get to learn how much they ought
> to process---all they learn is the address of the beginning of the
> chunk.  I do not see a way to write pair_chunk() users safely to
> guarantee that they do not overrun at the tail end of the chunk they
> are processing.

I've read through v3 and found it mostly done, but the above
question still stands.  I find it questionable why callers of
pair_chunk() only can learn where a chunk data begins, without
being able to learn how big the region of memory is.  IOW, why
can we get away without doing something like this?  The users
of pair_chunk() won't even know when they overrun the end of
the data the are given without something like this, no?

Thanks.

+struct memory_region {
+	const unsigned char *p;
+	size_t sz;
+};
+
 static int pair_chunk_fn(const unsigned char *chunk_start,
                          size_t chunk_size,
                          void *data)
 {
-        const unsigned char **p = data;
-        *p = chunk_start;
+        struct memory_region *x = data;
+        x->p = chunk_start;
+        x->sz = chunk_size;
         return 0;
 }



 int pair_chunk(struct chunkfile *cf,
                uint32_t chunk_id,
-                const unsigned char **p)
+                const unsigned char **p,
+                size_t *sz)
 {
+        int ret;
+        struct memory_region x;
=        return read_chunk(cf, chunk_id, pair_chunk_fn, &x);
+        ret = read_chunk(cf, chunk_id, pair_chunk_fn, &x);
+        *p = x.p;
+        *sz = x.sz;
+        return ret;
 }


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 12/17] chunk-format: create read chunk API
  2021-02-08 22:26           ` Junio C Hamano
@ 2021-02-09  1:33             ` Derrick Stolee
  2021-02-09 20:47               ` Junio C Hamano
  0 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee @ 2021-02-09  1:33 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, me, l.s.r, szeder.dev,
	Chris Torek, Derrick Stolee, Derrick Stolee

On 2/8/2021 5:26 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
>> The users of pair_chunk() presumably are not ready to (or simply do
>> not want to) process the data immediately by using read_chunk() with
>> callback, but when they get ready to process the data, unlike
>> read_chunk callbacks, they do not get to learn how much they ought
>> to process---all they learn is the address of the beginning of the
>> chunk.  I do not see a way to write pair_chunk() users safely to
>> guarantee that they do not overrun at the tail end of the chunk they
>> are processing.
> 
> I've read through v3 and found it mostly done, but the above
> question still stands.  I find it questionable why callers of
> pair_chunk() only can learn where a chunk data begins, without
> being able to learn how big the region of memory is.  IOW, why
> can we get away without doing something like this?  The users
> of pair_chunk() won't even know when they overrun the end of
> the data the are given without something like this, no?

I guess that the point is that if a caller wants to perform
logic on the size, then they should use read_chunk() instead.
We have some chunks that check the size is correct upon read,
but most chunks do not do this (currently).

In future series, additional protections could be added, and
I would expect that to be done by converting callers of
pair_chunk() into callers of read_chunk() with appropriate
callback functions.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v2 12/17] chunk-format: create read chunk API
  2021-02-09  1:33             ` Derrick Stolee
@ 2021-02-09 20:47               ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-02-09 20:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, l.s.r, szeder.dev,
	Chris Torek, Derrick Stolee, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> In future series, additional protections could be added, and
> I would expect that to be done by converting callers of
> pair_chunk() into callers of read_chunk() with appropriate
> callback functions.

I am perfectly OK with leaving the interface as-is, as apparently it
is sufficient for the current callers.  

It was just it looked insufficient for future callers that do not
want to use the callback interface, i.e. learn the <ptr, size> pair
with a single API call and then handle the data in that region
itself, instead of preparing a callback function and calling
read_chunk() on it.  As C does not have closures, it gets quickly
cumbersome if such a caller wants to share pieces of information
with the callback function that is given to read_chunk(), but a
caller that learns <ptr,size> from pair_chunk() and then works on
the region of memory itself do nto have to worry about that.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 02/17] chunk-format: create chunk format write API
  2021-02-08 13:44         ` Derrick Stolee
@ 2021-02-11 19:43           ` SZEDER Gábor
  0 siblings, 0 replies; 120+ messages in thread
From: SZEDER Gábor @ 2021-02-11 19:43 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, l.s.r,
	Chris Torek, Derrick Stolee, Derrick Stolee

On Mon, Feb 08, 2021 at 08:44:06AM -0500, Derrick Stolee wrote:
> On 2/7/2021 4:13 PM, SZEDER Gábor wrote:
> >> +#define CHUNK_LOOKUP_WIDTH 12
> > 
> > As this macro is defined in 'chunk-format.c' it's not part of the
> > chunkfile API.  However, at the end of this patch series
> > 'commit-graph.c' still contains:
> > 
> >   #define GRAPH_CHUNKLOOKUP_WIDTH 12
> > 
> > and uses it in a couple of safety checks (that didn't became part of
> > the common chunkfile module; why?),
> 
> Chunk-based files don't have a minimum size unless we know the header
> size and a minimum number of required chunks. I suppose that we could
> add this in the future to further simplify consumers of the API.
> 
> > while 'midx.c' contains:
> > 
> >   #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
> > 
> > though it's not used anymore.
> > 
> > I think we should have only one such constant as part of the chunkfile
> > API; and preferably use the definition from 'midx.c' as it is more
> > informative than yet another magic number.
> > 
> > Furthermore, being called 'CHUNK_LOOKUP_WIDTH', I had to look up the
> > places where this constant is used to make sure that it indeed means
> > what I suspect it means.  Perhaps CHUNK_TOC_ENTRY_SIZE would be a more
> > descriptive name for this constant.
> 
> More descriptive, for sure.
> 
> > On a somewhat related note: 'commit-graph.c' and 'midx.c' still
> > contains the constants MAX_NUM_CHUNKS and MIDX_MAX_CHUNKS,
> > respecticely, but neither of them is used anymore.
> 
> Thanks. The following patch can be added on top of this series
> to clean up these dangling macros.

It would be better to squash this into the patches that removed the
last uses of each of those constants.

And it still leaves the magic number '12' duplicated in
'commit-graph.c' and 'chunk-format.c'.

> Thanks,
> -Stolee
> 
> --- >8 ---
> 
> From 839b880ccee65eac63e8b77b12fab6531acc55b0 Mon Sep 17 00:00:00 2001
> From: Derrick Stolee <dstolee@microsoft.com>
> Date: Mon, 8 Feb 2021 08:38:47 -0500
> Subject: [PATCH] chunk-format: remove outdated macro constants
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> The following macros were needed by midx.c and commit-graph.c to handle
> their independent implementations of the chunk-based file format, but
> now the chunk-format API makes them obsolete:
> 
> * MAX_NUM_CHUNKS
> * MIDX_MAX_CHUNKS
> * MIX_CHUNKLOOKUP_WIDTH
> 
> Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 1 -
>  midx.c         | 2 --
>  2 files changed, 3 deletions(-)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index 32cf5091d2f..3b5a8767269 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -45,7 +45,6 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> -#define MAX_NUM_CHUNKS 9
>  
>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>  
> diff --git a/midx.c b/midx.c
> index 95648a1f368..5c7f2ed2333 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -22,14 +22,12 @@
>  #define MIDX_HEADER_SIZE 12
>  #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + the_hash_algo->rawsz)
>  
> -#define MIDX_MAX_CHUNKS 5
>  #define MIDX_CHUNK_ALIGNMENT 4
>  #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
>  #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
>  #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
>  #define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */
> -#define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
>  #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
>  #define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
>  #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
> -- 
> 2.30.0.vfs.0.0.exp
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 00/17] Refactor chunk-format into an API
  2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (16 preceding siblings ...)
  2021-02-05 14:30     ` [PATCH v3 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07     ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
                         ` (16 more replies)
  17 siblings, 17 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee

This is a restart on the topic previously submitted [1] but dropped because
ak/corrected-commit-date was still in progress. This version is based on
that branch.

[1]
https://lore.kernel.org/git/pull.804.git.1607012215.gitgitgadget@gmail.com/

This version also changes the approach to use a more dynamic interaction
with a struct chunkfile pointer. This idea is credited to Taylor Blau [2],
but I started again from scratch. I also go further to make struct chunkfile
anonymous to API consumers. It is defined only in chunk-format.c, which
should hopefully deter future users from interacting with that data
directly.

[2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

This combined API is beneficial to reduce duplicated logic. Or rather, to
ensure that similar file formats have similar protections against bad data.
The multi-pack-index code did not have as many guards as the commit-graph
code did, but now they both share a common base that checks for things like
duplicate chunks or offsets outside the size of the file.

Here are some stats for the end-to-end change:

 * 570 insertions(+), 456 deletions(-).
 * commit-graph.c: 107 insertions(+), 192 deletions(-)
 * midx.c: 164 insertions(+), 260 deletions(-)

While there is an overall increase to the code size, the consumers do get
smaller. Boilerplate things like abstracting method to match chunk_write_fn
and chunk_read_fn make up a lot of these insertions. The "interesting" code
gets a lot smaller and cleaner.


Updates in V4
=============

 * Out-of-date macros in commit-graph.c and midx.c are removed in their
   appropriate patches.
 * Documentation around the read API is improved.


Updates in V3
=============

 * API methods use better types and changed their order to match internal
   data more closely.

 * Use hashfile_total() instead of internal data values.

 * The implementation of pair_chunk() uses read_chunk().

 * init_chunkfile() has an in-code doc comment warning against using the
   same struct chunkfile for reads and writes.

 * More multiplications are correctly cast in midx.c.

 * The chunk-format technical docs are expanded.


Updates in V2
=============

 * The method pair_chunk() now automatically sets a pointer while
   read_chunk() uses the callback. This greatly reduces the code size.

 * Pointer casts are now implicit instead of explicit.

 * Extra care is taken to not overflow when verifying chunk sizes on write.

Thanks, -Stolee

Derrick Stolee (17):
  commit-graph: anonymize data in chunk_write_fn
  chunk-format: create chunk format write API
  commit-graph: use chunk-format write API
  midx: rename pack_info to write_midx_context
  midx: use context in write_midx_pack_names()
  midx: add entries to write_midx_context
  midx: add pack_perm to write_midx_context
  midx: add num_large_offsets to write_midx_context
  midx: return success/failure in chunk write methods
  midx: drop chunk progress during write
  midx: use chunk-format API in write_midx_internal()
  chunk-format: create read chunk API
  commit-graph: use chunk-format read API
  midx: use chunk-format read API
  midx: use 64-bit multiplication for chunk sizes
  chunk-format: restore duplicate chunk checks
  chunk-format: add technical docs

 Documentation/technical/chunk-format.txt      | 116 +++++
 .../technical/commit-graph-format.txt         |   3 +
 Documentation/technical/pack-format.txt       |   3 +
 Makefile                                      |   1 +
 chunk-format.c                                | 179 ++++++++
 chunk-format.h                                |  68 +++
 commit-graph.c                                | 305 +++++-------
 midx.c                                        | 433 +++++++-----------
 t/t5318-commit-graph.sh                       |   2 +-
 t/t5319-multi-pack-index.sh                   |   6 +-
 10 files changed, 652 insertions(+), 464 deletions(-)
 create mode 100644 Documentation/technical/chunk-format.txt
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h


base-commit: 5a3b130cad0d5c770f766e3af6d32b41766374c0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-848%2Fderrickstolee%2Fchunk-format%2Frefactor-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/848

Range-diff vs v3:

  1:  243dcec94368 =  1:  243dcec94368 commit-graph: anonymize data in chunk_write_fn
  2:  16c37d2370cf !  2:  da1fcc683576 chunk-format: create chunk format write API
     @@ chunk-format.c (new)
      +#include "cache.h"
      +#include "chunk-format.h"
      +#include "csum-file.h"
     -+#define CHUNK_LOOKUP_WIDTH 12
      +
      +/*
      + * When writing a chunk-based file format, collect the chunks in
     @@ chunk-format.c (new)
      +	uint64_t cur_offset = hashfile_total(cf->f);
      +
      +	/* Add the table of contents to the current offset */
     -+	cur_offset += (cf->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;
     ++	cur_offset += (cf->chunks_nr + 1) * CHUNK_TOC_ENTRY_SIZE;
      +
      +	for (i = 0; i < cf->chunks_nr; i++) {
      +		hashwrite_be32(cf->f, cf->chunks[i].id);
     @@ chunk-format.h (new)
      +struct hashfile;
      +struct chunkfile;
      +
     ++#define CHUNK_TOC_ENTRY_SIZE (sizeof(uint32_t) + sizeof(uint64_t))
     ++
      +struct chunkfile *init_chunkfile(struct hashfile *f);
      +void free_chunkfile(struct chunkfile *cf);
      +int get_num_chunks(struct chunkfile *cf);
  3:  e549e24d79af !  3:  ddebf2389880 commit-graph: use chunk-format write API
     @@ commit-graph.c
       
       void git_test_write_commit_graph_or_die(void)
       {
     +@@ commit-graph.c: void git_test_write_commit_graph_or_die(void)
     + #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
     + #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
     + #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
     +-#define MAX_NUM_CHUNKS 9
     + 
     + #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
     + 
      @@ commit-graph.c: static int write_graph_chunk_base(struct hashfile *f,
       	return 0;
       }
  4:  66ff49ed9309 =  4:  8fdaae017dab midx: rename pack_info to write_midx_context
  5:  1d7484c0cffa =  5:  c18f3738f5d4 midx: use context in write_midx_pack_names()
  6:  ea0e7d40e537 =  6:  2bb2847aad6e midx: add entries to write_midx_context
  7:  b283a38fb775 =  7:  f5d44c012ad5 midx: add pack_perm to write_midx_context
  8:  e7064512ab7f =  8:  f5baadf276ff midx: add num_large_offsets to write_midx_context
  9:  7aa3242e15b7 =  9:  fd008449fc89 midx: return success/failure in chunk write methods
 10:  70f68c95e479 = 10:  60e77e451800 midx: drop chunk progress during write
 11:  787cd7f18d2e ! 11:  0b95aa7e152f midx: use chunk-format API in write_midx_internal()
     @@ midx.c
       
       #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
       #define MIDX_VERSION 1
     +@@
     + #define MIDX_HEADER_SIZE 12
     + #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + the_hash_algo->rawsz)
     + 
     +-#define MIDX_MAX_CHUNKS 5
     + #define MIDX_CHUNK_ALIGNMENT 4
     + #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
     + #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
      @@ midx.c: static int write_midx_large_offsets(struct hashfile *f,
       static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
       			       struct string_list *packs_to_drop, unsigned flags)
 12:  366eb2afee83 ! 12:  3e0dbc45ce7f chunk-format: create read chunk API
     @@ chunk-format.c: int write_chunkfile(struct chunkfile *cf, void *data)
      +			return 1;
      +		}
      +
     -+		table_of_contents += CHUNK_LOOKUP_WIDTH;
     ++		table_of_contents += CHUNK_TOC_ENTRY_SIZE;
      +		next_chunk_offset = get_be64(table_of_contents + 4);
      +
      +		if (next_chunk_offset < chunk_offset ||
     @@ chunk-format.c: int write_chunkfile(struct chunkfile *cf, void *data)
      +}
      
       ## chunk-format.h ##
     -@@
     - struct hashfile;
     - struct chunkfile;
     +@@ chunk-format.h: struct chunkfile;
     + 
     + #define CHUNK_TOC_ENTRY_SIZE (sizeof(uint32_t) + sizeof(uint64_t))
       
      +/*
      + * Initialize a 'struct chunkfile' for writing _or_ reading a file
     @@ chunk-format.h
      + * If writing a file, supply a non-NULL 'struct hashfile *' that will
      + * be used to write.
      + *
     -+ * If reading a file, then supply the memory-mapped data to the
     ++ * If reading a file, use a NULL 'struct hashfile *' and then call
     ++ * read_table_of_contents(). Supply the memory-mapped data to the
      + * pair_chunk() or read_chunk() methods, as appropriate.
      + *
      + * DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances
 13:  7838ad32e2e0 ! 13:  87f441fc5e15 commit-graph: use chunk-format read API
     @@ Commit message
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## commit-graph.c ##
     +@@ commit-graph.c: void git_test_write_commit_graph_or_die(void)
     + 
     + #define GRAPH_HEADER_SIZE 8
     + #define GRAPH_FANOUT_SIZE (4 * 256)
     +-#define GRAPH_CHUNKLOOKUP_WIDTH 12
     +-#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
     ++#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * CHUNK_TOC_ENTRY_SIZE \
     + 			+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
     + 
     + #define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
      @@ commit-graph.c: static int verify_commit_graph_lite(struct commit_graph *g)
       	return 0;
       }
     @@ commit-graph.c: static int verify_commit_graph_lite(struct commit_graph *g)
       
       	if (!graph_map)
       		return NULL;
     +@@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
     + 	graph->data_len = graph_size;
     + 
     + 	if (graph_size < GRAPH_HEADER_SIZE +
     +-			 (graph->num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH +
     ++			 (graph->num_chunks + 1) * CHUNK_TOC_ENTRY_SIZE +
     + 			 GRAPH_FANOUT_SIZE + the_hash_algo->rawsz) {
     + 		error(_("commit-graph file is too small to hold %u chunks"),
     + 		      graph->num_chunks);
      @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
       		return NULL;
       	}
 14:  6bddd9e63b9b ! 14:  64010031db0e midx: use chunk-format read API
     @@ Commit message
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## midx.c ##
     +@@
     + #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
     + #define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
     + #define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */
     +-#define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
     + #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
     + #define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
     + #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
      @@ midx.c: static char *get_midx_filename(const char *object_dir)
       	return xstrfmt("%s/pack/multi-pack-index", object_dir);
       }
 15:  3cd97f389f1f = 15:  16205d8f8c40 midx: use 64-bit multiplication for chunk sizes
 16:  b9a1bddf615f = 16:  0e58218be418 chunk-format: restore duplicate chunk checks
 17:  4c7d751f1e39 = 17:  84bf6506dc12 chunk-format: add technical docs

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 01/17] commit-graph: anonymize data in chunk_write_fn
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
                         ` (15 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In preparation for creating an API around file formats using chunks and
tables of contents, prepare the commit-graph write code to use
prototypes that will match this new API.

Specifically, convert chunk_write_fn to take a "void *data" parameter
instead of the commit-graph-specific "struct write_commit_graph_context"
pointer.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f3bde2ad95a1..fae7d1b63931 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1040,8 +1040,9 @@ struct write_commit_graph_context {
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
-				    struct write_commit_graph_context *ctx)
+				    void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	int i, count = 0;
 	struct commit **list = ctx->commits.list;
 
@@ -1066,8 +1067,9 @@ static int write_graph_chunk_fanout(struct hashfile *f,
 }
 
 static int write_graph_chunk_oids(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				  void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	int count;
 	for (count = 0; count < ctx->commits.nr; count++, list++) {
@@ -1085,8 +1087,9 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
 }
 
 static int write_graph_chunk_data(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				  void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t num_extra_edges = 0;
@@ -1187,8 +1190,9 @@ static int write_graph_chunk_data(struct hashfile *f,
 }
 
 static int write_graph_chunk_generation_data(struct hashfile *f,
-					      struct write_commit_graph_context *ctx)
+					     void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	int i, num_generation_data_overflows = 0;
 
 	for (i = 0; i < ctx->commits.nr; i++) {
@@ -1208,8 +1212,9 @@ static int write_graph_chunk_generation_data(struct hashfile *f,
 }
 
 static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
-						       struct write_commit_graph_context *ctx)
+						      void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	int i;
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = ctx->commits.list[i];
@@ -1226,8 +1231,9 @@ static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
 }
 
 static int write_graph_chunk_extra_edges(struct hashfile *f,
-					 struct write_commit_graph_context *ctx)
+					 void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	struct commit_list *parent;
@@ -1280,8 +1286,9 @@ static int write_graph_chunk_extra_edges(struct hashfile *f,
 }
 
 static int write_graph_chunk_bloom_indexes(struct hashfile *f,
-					   struct write_commit_graph_context *ctx)
+					   void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t cur_pos = 0;
@@ -1315,8 +1322,9 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 }
 
 static int write_graph_chunk_bloom_data(struct hashfile *f,
-					struct write_commit_graph_context *ctx)
+					void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 
@@ -1737,8 +1745,9 @@ static int write_graph_chunk_base_1(struct hashfile *f,
 }
 
 static int write_graph_chunk_base(struct hashfile *f,
-				  struct write_commit_graph_context *ctx)
+				    void *data)
 {
+	struct write_commit_graph_context *ctx = data;
 	int num = write_graph_chunk_base_1(f, ctx->new_base_graph);
 
 	if (num != ctx->num_commit_graphs_after - 1) {
@@ -1750,7 +1759,7 @@ static int write_graph_chunk_base(struct hashfile *f,
 }
 
 typedef int (*chunk_write_fn)(struct hashfile *f,
-			      struct write_commit_graph_context *ctx);
+			      void *data);
 
 struct chunk_info {
 	uint32_t id;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 02/17] chunk-format: create chunk format write API
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
                         ` (14 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In anticipation of combining the logic from the commit-graph and
multi-pack-index file formats, create a new chunk-format API. Use a
'struct chunkfile' pointer to keep track of data that has been
registered for writes. This struct is anonymous outside of
chunk-format.c to ensure no user attempts to interfere with the data.

The next change will use this API in commit-graph.c, but the general
approach is:

 1. initialize the chunkfile with init_chunkfile(f).
 2. add chunks in the intended writing order with add_chunk().
 3. write any header information to the hashfile f.
 4. write the chunkfile data using write_chunkfile().
 5. free the chunkfile struct using free_chunkfile().

Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile       |  1 +
 chunk-format.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++
 chunk-format.h | 21 ++++++++++++
 3 files changed, 112 insertions(+)
 create mode 100644 chunk-format.c
 create mode 100644 chunk-format.h

diff --git a/Makefile b/Makefile
index 7b64106930a6..50a7663841e9 100644
--- a/Makefile
+++ b/Makefile
@@ -854,6 +854,7 @@ LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += chdir-notify.o
 LIB_OBJS += checkout.o
+LIB_OBJS += chunk-format.o
 LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
diff --git a/chunk-format.c b/chunk-format.c
new file mode 100644
index 000000000000..6c9b52b70c10
--- /dev/null
+++ b/chunk-format.c
@@ -0,0 +1,90 @@
+#include "cache.h"
+#include "chunk-format.h"
+#include "csum-file.h"
+
+/*
+ * When writing a chunk-based file format, collect the chunks in
+ * an array of chunk_info structs. The size stores the _expected_
+ * amount of data that will be written by write_fn.
+ */
+struct chunk_info {
+	uint32_t id;
+	uint64_t size;
+	chunk_write_fn write_fn;
+};
+
+struct chunkfile {
+	struct hashfile *f;
+
+	struct chunk_info *chunks;
+	size_t chunks_nr;
+	size_t chunks_alloc;
+};
+
+struct chunkfile *init_chunkfile(struct hashfile *f)
+{
+	struct chunkfile *cf = xcalloc(1, sizeof(*cf));
+	cf->f = f;
+	return cf;
+}
+
+void free_chunkfile(struct chunkfile *cf)
+{
+	if (!cf)
+		return;
+	free(cf->chunks);
+	free(cf);
+}
+
+int get_num_chunks(struct chunkfile *cf)
+{
+	return cf->chunks_nr;
+}
+
+void add_chunk(struct chunkfile *cf,
+	       uint32_t id,
+	       size_t size,
+	       chunk_write_fn fn)
+{
+	ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
+
+	cf->chunks[cf->chunks_nr].id = id;
+	cf->chunks[cf->chunks_nr].write_fn = fn;
+	cf->chunks[cf->chunks_nr].size = size;
+	cf->chunks_nr++;
+}
+
+int write_chunkfile(struct chunkfile *cf, void *data)
+{
+	int i;
+	uint64_t cur_offset = hashfile_total(cf->f);
+
+	/* Add the table of contents to the current offset */
+	cur_offset += (cf->chunks_nr + 1) * CHUNK_TOC_ENTRY_SIZE;
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		hashwrite_be32(cf->f, cf->chunks[i].id);
+		hashwrite_be64(cf->f, cur_offset);
+
+		cur_offset += cf->chunks[i].size;
+	}
+
+	/* Trailing entry marks the end of the chunks */
+	hashwrite_be32(cf->f, 0);
+	hashwrite_be64(cf->f, cur_offset);
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		off_t start_offset = hashfile_total(cf->f);
+		int result = cf->chunks[i].write_fn(cf->f, data);
+
+		if (result)
+			return result;
+
+		if (hashfile_total(cf->f) - start_offset != cf->chunks[i].size)
+			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
+			    cf->chunks[i].size, cf->chunks[i].id,
+			    hashfile_total(cf->f) - start_offset);
+	}
+
+	return 0;
+}
diff --git a/chunk-format.h b/chunk-format.h
new file mode 100644
index 000000000000..ce598b66d9f8
--- /dev/null
+++ b/chunk-format.h
@@ -0,0 +1,21 @@
+#ifndef CHUNK_FORMAT_H
+#define CHUNK_FORMAT_H
+
+#include "git-compat-util.h"
+
+struct hashfile;
+struct chunkfile;
+
+#define CHUNK_TOC_ENTRY_SIZE (sizeof(uint32_t) + sizeof(uint64_t))
+
+struct chunkfile *init_chunkfile(struct hashfile *f);
+void free_chunkfile(struct chunkfile *cf);
+int get_num_chunks(struct chunkfile *cf);
+typedef int (*chunk_write_fn)(struct hashfile *f, void *data);
+void add_chunk(struct chunkfile *cf,
+	       uint32_t id,
+	       size_t size,
+	       chunk_write_fn fn);
+int write_chunkfile(struct chunkfile *cf, void *data);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 03/17] commit-graph: use chunk-format write API
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-24 16:52         ` SZEDER Gábor
  2021-02-18 14:07       ` [PATCH v4 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
                         ` (13 subsequent siblings)
  16 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The commit-graph write logic is ready to make use of the chunk-format
write API. Each chunk write method is already in the correct prototype.
We only need to use the 'struct chunkfile' pointer and the correct API
calls.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 119 +++++++++++++++----------------------------------
 1 file changed, 37 insertions(+), 82 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index fae7d1b63931..a889130cc849 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -19,6 +19,7 @@
 #include "shallow.h"
 #include "json-writer.h"
 #include "trace2.h"
+#include "chunk-format.h"
 
 void git_test_write_commit_graph_or_die(void)
 {
@@ -44,7 +45,6 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 9
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -1758,27 +1758,17 @@ static int write_graph_chunk_base(struct hashfile *f,
 	return 0;
 }
 
-typedef int (*chunk_write_fn)(struct hashfile *f,
-			      void *data);
-
-struct chunk_info {
-	uint32_t id;
-	uint64_t size;
-	chunk_write_fn write_fn;
-};
-
 static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
 	int fd;
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
-	struct chunk_info chunks[MAX_NUM_CHUNKS + 1];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
-	uint64_t chunk_offset;
 	struct object_id file_hash;
+	struct chunkfile *cf;
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -1824,76 +1814,50 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	}
 
-	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
-	chunks[0].size = GRAPH_FANOUT_SIZE;
-	chunks[0].write_fn = write_graph_chunk_fanout;
-	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
-	chunks[1].size = hashsz * ctx->commits.nr;
-	chunks[1].write_fn = write_graph_chunk_oids;
-	chunks[2].id = GRAPH_CHUNKID_DATA;
-	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
-	chunks[2].write_fn = write_graph_chunk_data;
+	cf = init_chunkfile(f);
+
+	add_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, GRAPH_FANOUT_SIZE,
+		  write_graph_chunk_fanout);
+	add_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, hashsz * ctx->commits.nr,
+		  write_graph_chunk_oids);
+	add_chunk(cf, GRAPH_CHUNKID_DATA, (hashsz + 16) * ctx->commits.nr,
+		  write_graph_chunk_data);
 
 	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
 		ctx->write_generation_data = 0;
-	if (ctx->write_generation_data) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
-		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
-		chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
-		num_chunks++;
-	}
-	if (ctx->num_generation_data_overflows) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
-		chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
-		chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
-		num_chunks++;
-	}
-	if (ctx->num_extra_edges) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
-		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
-		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
-		num_chunks++;
-	}
+	if (ctx->write_generation_data)
+		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
+			  sizeof(uint32_t) * ctx->commits.nr,
+			  write_graph_chunk_generation_data);
+	if (ctx->num_generation_data_overflows)
+		add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
+			  sizeof(timestamp_t) * ctx->num_generation_data_overflows,
+			  write_graph_chunk_generation_data_overflow);
+	if (ctx->num_extra_edges)
+		add_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES,
+			  4 * ctx->num_extra_edges,
+			  write_graph_chunk_extra_edges);
 	if (ctx->changed_paths) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
-		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
-		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
-		num_chunks++;
-		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
-		chunks[num_chunks].size = sizeof(uint32_t) * 3
-					  + ctx->total_bloom_filter_data_size;
-		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
-		num_chunks++;
-	}
-	if (ctx->num_commit_graphs_after > 1) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
-		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
-		chunks[num_chunks].write_fn = write_graph_chunk_base;
-		num_chunks++;
-	}
-
-	chunks[num_chunks].id = 0;
-	chunks[num_chunks].size = 0;
+		add_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+			  sizeof(uint32_t) * ctx->commits.nr,
+			  write_graph_chunk_bloom_indexes);
+		add_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
+			  sizeof(uint32_t) * 3
+				+ ctx->total_bloom_filter_data_size,
+			  write_graph_chunk_bloom_data);
+	}
+	if (ctx->num_commit_graphs_after > 1)
+		add_chunk(cf, GRAPH_CHUNKID_BASE,
+			  hashsz * (ctx->num_commit_graphs_after - 1),
+			  write_graph_chunk_base);
 
 	hashwrite_be32(f, GRAPH_SIGNATURE);
 
 	hashwrite_u8(f, GRAPH_VERSION);
 	hashwrite_u8(f, oid_version());
-	hashwrite_u8(f, num_chunks);
+	hashwrite_u8(f, get_num_chunks(cf));
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
 
-	chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
-	for (i = 0; i <= num_chunks; i++) {
-		uint32_t chunk_write[3];
-
-		chunk_write[0] = htonl(chunks[i].id);
-		chunk_write[1] = htonl(chunk_offset >> 32);
-		chunk_write[2] = htonl(chunk_offset & 0xffffffff);
-		hashwrite(f, chunk_write, 12);
-
-		chunk_offset += chunks[i].size;
-	}
-
 	if (ctx->report_progress) {
 		strbuf_addf(&progress_title,
 			    Q_("Writing out commit graph in %d pass",
@@ -1905,17 +1869,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			num_chunks * ctx->commits.nr);
 	}
 
-	for (i = 0; i < num_chunks; i++) {
-		uint64_t start_offset = f->total + f->offset;
-
-		if (chunks[i].write_fn(f, ctx))
-			return -1;
-
-		if (f->total + f->offset != start_offset + chunks[i].size)
-			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
-			    chunks[i].size, chunks[i].id,
-			    f->total + f->offset - start_offset);
-	}
+	write_chunkfile(cf, ctx);
 
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
@@ -1932,6 +1886,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	close_commit_graph(ctx->r->objects);
 	finalize_hashfile(f, file_hash.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
+	free_chunkfile(cf);
 
 	if (ctx->split) {
 		FILE *chainf = fdopen_lock_file(&lk, "w");
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 04/17] midx: rename pack_info to write_midx_context
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (2 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
                         ` (12 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to streamline our chunk-based file formats, align some of
the code structure in write_midx_internal() to be similar to the
patterns in write_commit_graph_file().

Specifically, let's create a "struct write_midx_context" that can be
used as a data parameter to abstract function types.

This change only renames "struct pack_info" to "struct
write_midx_context" and the names of instances from "packs" to "ctx". In
future changes, we will expand the data inside "struct
write_midx_context" and align our chunk-writing method with the
chunk-format API.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 130 ++++++++++++++++++++++++++++-----------------------------
 1 file changed, 65 insertions(+), 65 deletions(-)

diff --git a/midx.c b/midx.c
index 79c282b070d2..561f65a63a5b 100644
--- a/midx.c
+++ b/midx.c
@@ -451,7 +451,7 @@ static int pack_info_compare(const void *_a, const void *_b)
 	return strcmp(a->pack_name, b->pack_name);
 }
 
-struct pack_list {
+struct write_midx_context {
 	struct pack_info *info;
 	uint32_t nr;
 	uint32_t alloc;
@@ -463,37 +463,37 @@ struct pack_list {
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 			     const char *file_name, void *data)
 {
-	struct pack_list *packs = (struct pack_list *)data;
+	struct write_midx_context *ctx = data;
 
 	if (ends_with(file_name, ".idx")) {
-		display_progress(packs->progress, ++packs->pack_paths_checked);
-		if (packs->m && midx_contains_pack(packs->m, file_name))
+		display_progress(ctx->progress, ++ctx->pack_paths_checked);
+		if (ctx->m && midx_contains_pack(ctx->m, file_name))
 			return;
 
-		ALLOC_GROW(packs->info, packs->nr + 1, packs->alloc);
+		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
-		packs->info[packs->nr].p = add_packed_git(full_path,
-							  full_path_len,
-							  0);
+		ctx->info[ctx->nr].p = add_packed_git(full_path,
+						      full_path_len,
+						      0);
 
-		if (!packs->info[packs->nr].p) {
+		if (!ctx->info[ctx->nr].p) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
 			return;
 		}
 
-		if (open_pack_index(packs->info[packs->nr].p)) {
+		if (open_pack_index(ctx->info[ctx->nr].p)) {
 			warning(_("failed to open pack-index '%s'"),
 				full_path);
-			close_pack(packs->info[packs->nr].p);
-			FREE_AND_NULL(packs->info[packs->nr].p);
+			close_pack(ctx->info[ctx->nr].p);
+			FREE_AND_NULL(ctx->info[ctx->nr].p);
 			return;
 		}
 
-		packs->info[packs->nr].pack_name = xstrdup(file_name);
-		packs->info[packs->nr].orig_pack_int_id = packs->nr;
-		packs->info[packs->nr].expired = 0;
-		packs->nr++;
+		ctx->info[ctx->nr].pack_name = xstrdup(file_name);
+		ctx->info[ctx->nr].orig_pack_int_id = ctx->nr;
+		ctx->info[ctx->nr].expired = 0;
+		ctx->nr++;
 	}
 }
 
@@ -801,7 +801,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
-	struct pack_list packs;
+	struct write_midx_context ctx = { 0 };
 	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
@@ -820,40 +820,40 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 			  midx_name);
 
 	if (m)
-		packs.m = m;
+		ctx.m = m;
 	else
-		packs.m = load_multi_pack_index(object_dir, 1);
-
-	packs.nr = 0;
-	packs.alloc = packs.m ? packs.m->num_packs : 16;
-	packs.info = NULL;
-	ALLOC_ARRAY(packs.info, packs.alloc);
-
-	if (packs.m) {
-		for (i = 0; i < packs.m->num_packs; i++) {
-			ALLOC_GROW(packs.info, packs.nr + 1, packs.alloc);
-
-			packs.info[packs.nr].orig_pack_int_id = i;
-			packs.info[packs.nr].pack_name = xstrdup(packs.m->pack_names[i]);
-			packs.info[packs.nr].p = NULL;
-			packs.info[packs.nr].expired = 0;
-			packs.nr++;
+		ctx.m = load_multi_pack_index(object_dir, 1);
+
+	ctx.nr = 0;
+	ctx.alloc = ctx.m ? ctx.m->num_packs : 16;
+	ctx.info = NULL;
+	ALLOC_ARRAY(ctx.info, ctx.alloc);
+
+	if (ctx.m) {
+		for (i = 0; i < ctx.m->num_packs; i++) {
+			ALLOC_GROW(ctx.info, ctx.nr + 1, ctx.alloc);
+
+			ctx.info[ctx.nr].orig_pack_int_id = i;
+			ctx.info[ctx.nr].pack_name = xstrdup(ctx.m->pack_names[i]);
+			ctx.info[ctx.nr].p = NULL;
+			ctx.info[ctx.nr].expired = 0;
+			ctx.nr++;
 		}
 	}
 
-	packs.pack_paths_checked = 0;
+	ctx.pack_paths_checked = 0;
 	if (flags & MIDX_PROGRESS)
-		packs.progress = start_delayed_progress(_("Adding packfiles to multi-pack-index"), 0);
+		ctx.progress = start_delayed_progress(_("Adding packfiles to multi-pack-index"), 0);
 	else
-		packs.progress = NULL;
+		ctx.progress = NULL;
 
-	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &packs);
-	stop_progress(&packs.progress);
+	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
+	stop_progress(&ctx.progress);
 
-	if (packs.m && packs.nr == packs.m->num_packs && !packs_to_drop)
+	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
-	entries = get_sorted_entries(packs.m, packs.info, packs.nr, &nr_entries);
+	entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &nr_entries);
 
 	for (i = 0; i < nr_entries; i++) {
 		if (entries[i].offset > 0x7fffffff)
@@ -862,19 +862,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 			large_offsets_needed = 1;
 	}
 
-	QSORT(packs.info, packs.nr, pack_info_compare);
+	QSORT(ctx.info, ctx.nr, pack_info_compare);
 
 	if (packs_to_drop && packs_to_drop->nr) {
 		int drop_index = 0;
 		int missing_drops = 0;
 
-		for (i = 0; i < packs.nr && drop_index < packs_to_drop->nr; i++) {
-			int cmp = strcmp(packs.info[i].pack_name,
+		for (i = 0; i < ctx.nr && drop_index < packs_to_drop->nr; i++) {
+			int cmp = strcmp(ctx.info[i].pack_name,
 					 packs_to_drop->items[drop_index].string);
 
 			if (!cmp) {
 				drop_index++;
-				packs.info[i].expired = 1;
+				ctx.info[i].expired = 1;
 			} else if (cmp > 0) {
 				error(_("did not see pack-file %s to drop"),
 				      packs_to_drop->items[drop_index].string);
@@ -882,7 +882,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				missing_drops++;
 				i--;
 			} else {
-				packs.info[i].expired = 0;
+				ctx.info[i].expired = 0;
 			}
 		}
 
@@ -898,19 +898,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	 *
 	 * pack_perm[old_id] = new_id
 	 */
-	ALLOC_ARRAY(pack_perm, packs.nr);
-	for (i = 0; i < packs.nr; i++) {
-		if (packs.info[i].expired) {
+	ALLOC_ARRAY(pack_perm, ctx.nr);
+	for (i = 0; i < ctx.nr; i++) {
+		if (ctx.info[i].expired) {
 			dropped_packs++;
-			pack_perm[packs.info[i].orig_pack_int_id] = PACK_EXPIRED;
+			pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
 		} else {
-			pack_perm[packs.info[i].orig_pack_int_id] = i - dropped_packs;
+			pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
 		}
 	}
 
-	for (i = 0; i < packs.nr; i++) {
-		if (!packs.info[i].expired)
-			pack_name_concat_len += strlen(packs.info[i].pack_name) + 1;
+	for (i = 0; i < ctx.nr; i++) {
+		if (!ctx.info[i].expired)
+			pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1;
 	}
 
 	if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT)
@@ -921,19 +921,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	FREE_AND_NULL(midx_name);
 
-	if (packs.m)
-		close_midx(packs.m);
+	if (ctx.m)
+		close_midx(ctx.m);
 
 	cur_chunk = 0;
 	num_chunks = large_offsets_needed ? 5 : 4;
 
-	if (packs.nr - dropped_packs == 0) {
+	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
 		result = 1;
 		goto cleanup;
 	}
 
-	written = write_midx_header(f, num_chunks, packs.nr - dropped_packs);
+	written = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
 	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
@@ -990,7 +990,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, packs.info, packs.nr);
+				written += write_midx_pack_names(f, ctx.info, ctx.nr);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
@@ -1027,15 +1027,15 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	commit_lock_file(&lk);
 
 cleanup:
-	for (i = 0; i < packs.nr; i++) {
-		if (packs.info[i].p) {
-			close_pack(packs.info[i].p);
-			free(packs.info[i].p);
+	for (i = 0; i < ctx.nr; i++) {
+		if (ctx.info[i].p) {
+			close_pack(ctx.info[i].p);
+			free(ctx.info[i].p);
 		}
-		free(packs.info[i].pack_name);
+		free(ctx.info[i].pack_name);
 	}
 
-	free(packs.info);
+	free(ctx.info);
 	free(entries);
 	free(pack_perm);
 	free(midx_name);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 05/17] midx: use context in write_midx_pack_names()
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (3 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
                         ` (11 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align the write_midx_internal() to use the chunk-format
API, start converting chunk writing methods to match chunk_write_fn. The
first case is to convert write_midx_pack_names() to take "void *data".
We already have the necessary data in "struct write_midx_context", so
this conversion is rather mechanical.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/midx.c b/midx.c
index 561f65a63a5b..88452b044337 100644
--- a/midx.c
+++ b/midx.c
@@ -643,27 +643,26 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	return deduplicated_entries;
 }
 
-static size_t write_midx_pack_names(struct hashfile *f,
-				    struct pack_info *info,
-				    uint32_t num_packs)
+static size_t write_midx_pack_names(struct hashfile *f, void *data)
 {
+	struct write_midx_context *ctx = data;
 	uint32_t i;
 	unsigned char padding[MIDX_CHUNK_ALIGNMENT];
 	size_t written = 0;
 
-	for (i = 0; i < num_packs; i++) {
+	for (i = 0; i < ctx->nr; i++) {
 		size_t writelen;
 
-		if (info[i].expired)
+		if (ctx->info[i].expired)
 			continue;
 
-		if (i && strcmp(info[i].pack_name, info[i - 1].pack_name) <= 0)
+		if (i && strcmp(ctx->info[i].pack_name, ctx->info[i - 1].pack_name) <= 0)
 			BUG("incorrect pack-file order: %s before %s",
-			    info[i - 1].pack_name,
-			    info[i].pack_name);
+			    ctx->info[i - 1].pack_name,
+			    ctx->info[i].pack_name);
 
-		writelen = strlen(info[i].pack_name) + 1;
-		hashwrite(f, info[i].pack_name, writelen);
+		writelen = strlen(ctx->info[i].pack_name) + 1;
+		hashwrite(f, ctx->info[i].pack_name, writelen);
 		written += writelen;
 	}
 
@@ -990,7 +989,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, ctx.info, ctx.nr);
+				written += write_midx_pack_names(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 06/17] midx: add entries to write_midx_context
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (4 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
                         ` (10 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "struct pack_midx_entry *entries" list and its count
into the context.

Update write_midx_oid_fanout() and write_midx_oid_lookup() to take the
context directly, as these are easy conversions with this new data.

Only the callers of write_midx_object_offsets() and
write_midx_large_offsets() are updated here, since additional data in
the context before those methods can match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 49 ++++++++++++++++++++++++++-----------------------
 1 file changed, 26 insertions(+), 23 deletions(-)

diff --git a/midx.c b/midx.c
index 88452b044337..4520ef82b91b 100644
--- a/midx.c
+++ b/midx.c
@@ -458,6 +458,9 @@ struct write_midx_context {
 	struct multi_pack_index *m;
 	struct progress *progress;
 	unsigned pack_paths_checked;
+
+	struct pack_midx_entry *entries;
+	uint32_t entries_nr;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -678,11 +681,11 @@ static size_t write_midx_pack_names(struct hashfile *f, void *data)
 }
 
 static size_t write_midx_oid_fanout(struct hashfile *f,
-				    struct pack_midx_entry *objects,
-				    uint32_t nr_objects)
+				    void *data)
 {
-	struct pack_midx_entry *list = objects;
-	struct pack_midx_entry *last = objects + nr_objects;
+	struct write_midx_context *ctx = data;
+	struct pack_midx_entry *list = ctx->entries;
+	struct pack_midx_entry *last = ctx->entries + ctx->entries_nr;
 	uint32_t count = 0;
 	uint32_t i;
 
@@ -706,18 +709,19 @@ static size_t write_midx_oid_fanout(struct hashfile *f,
 	return MIDX_CHUNK_FANOUT_SIZE;
 }
 
-static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char hash_len,
-				    struct pack_midx_entry *objects,
-				    uint32_t nr_objects)
+static size_t write_midx_oid_lookup(struct hashfile *f,
+				    void *data)
 {
-	struct pack_midx_entry *list = objects;
+	struct write_midx_context *ctx = data;
+	unsigned char hash_len = the_hash_algo->rawsz;
+	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i;
 	size_t written = 0;
 
-	for (i = 0; i < nr_objects; i++) {
+	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		if (i < nr_objects - 1) {
+		if (i < ctx->entries_nr - 1) {
 			struct pack_midx_entry *next = list;
 			if (oidcmp(&obj->oid, &next->oid) >= 0)
 				BUG("OIDs not in order: %s >= %s",
@@ -805,8 +809,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	uint32_t nr_entries, num_large_offsets = 0;
-	struct pack_midx_entry *entries = NULL;
+	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
 	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
@@ -852,12 +855,12 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop)
 		goto cleanup;
 
-	entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &nr_entries);
+	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr);
 
-	for (i = 0; i < nr_entries; i++) {
-		if (entries[i].offset > 0x7fffffff)
+	for (i = 0; i < ctx.entries_nr; i++) {
+		if (ctx.entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
-		if (entries[i].offset > 0xffffffff)
+		if (ctx.entries[i].offset > 0xffffffff)
 			large_offsets_needed = 1;
 	}
 
@@ -947,10 +950,10 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * the_hash_algo->rawsz;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;
 
 	cur_chunk++;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries * MIDX_CHUNK_OFFSET_WIDTH;
+	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
 	if (large_offsets_needed) {
 		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
 
@@ -993,19 +996,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-				written += write_midx_oid_fanout(f, entries, nr_entries);
+				written += write_midx_oid_fanout(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDLOOKUP:
-				written += write_midx_oid_lookup(f, the_hash_algo->rawsz, entries, nr_entries);
+				written += write_midx_oid_lookup(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, entries, nr_entries);
+				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, ctx.entries, ctx.entries_nr);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, num_large_offsets, entries, nr_entries);
+				written += write_midx_large_offsets(f, num_large_offsets, ctx.entries, ctx.entries_nr);
 				break;
 
 			default:
@@ -1035,7 +1038,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	}
 
 	free(ctx.info);
-	free(entries);
+	free(ctx.entries);
 	free(pack_perm);
 	free(midx_name);
 	return result;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 07/17] midx: add pack_perm to write_midx_context
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (5 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
                         ` (9 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t *pack_perm" and large_offsets_needed bit
into the context.

Update write_midx_object_offsets() to match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/midx.c b/midx.c
index 4520ef82b91b..cd994e333ecb 100644
--- a/midx.c
+++ b/midx.c
@@ -461,6 +461,9 @@ struct write_midx_context {
 
 	struct pack_midx_entry *entries;
 	uint32_t entries_nr;
+
+	uint32_t *pack_perm;
+	unsigned large_offsets_needed:1;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -736,27 +739,27 @@ static size_t write_midx_oid_lookup(struct hashfile *f,
 	return written;
 }
 
-static size_t write_midx_object_offsets(struct hashfile *f, int large_offset_needed,
-					uint32_t *perm,
-					struct pack_midx_entry *objects, uint32_t nr_objects)
+static size_t write_midx_object_offsets(struct hashfile *f,
+					void *data)
 {
-	struct pack_midx_entry *list = objects;
+	struct write_midx_context *ctx = data;
+	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i, nr_large_offset = 0;
 	size_t written = 0;
 
-	for (i = 0; i < nr_objects; i++) {
+	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
 
-		if (perm[obj->pack_int_id] == PACK_EXPIRED)
+		if (ctx->pack_perm[obj->pack_int_id] == PACK_EXPIRED)
 			BUG("object %s is in an expired pack with int-id %d",
 			    oid_to_hex(&obj->oid),
 			    obj->pack_int_id);
 
-		hashwrite_be32(f, perm[obj->pack_int_id]);
+		hashwrite_be32(f, ctx->pack_perm[obj->pack_int_id]);
 
-		if (large_offset_needed && obj->offset >> 31)
+		if (ctx->large_offsets_needed && obj->offset >> 31)
 			hashwrite_be32(f, MIDX_LARGE_OFFSET_NEEDED | nr_large_offset++);
-		else if (!large_offset_needed && obj->offset >> 32)
+		else if (!ctx->large_offsets_needed && obj->offset >> 32)
 			BUG("object %s requires a large offset (%"PRIx64") but the MIDX is not writing large offsets!",
 			    oid_to_hex(&obj->oid),
 			    obj->offset);
@@ -805,13 +808,11 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint32_t *pack_perm = NULL;
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
-	int large_offsets_needed = 0;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -857,11 +858,12 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr);
 
+	ctx.large_offsets_needed = 0;
 	for (i = 0; i < ctx.entries_nr; i++) {
 		if (ctx.entries[i].offset > 0x7fffffff)
 			num_large_offsets++;
 		if (ctx.entries[i].offset > 0xffffffff)
-			large_offsets_needed = 1;
+			ctx.large_offsets_needed = 1;
 	}
 
 	QSORT(ctx.info, ctx.nr, pack_info_compare);
@@ -900,13 +902,13 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	 *
 	 * pack_perm[old_id] = new_id
 	 */
-	ALLOC_ARRAY(pack_perm, ctx.nr);
+	ALLOC_ARRAY(ctx.pack_perm, ctx.nr);
 	for (i = 0; i < ctx.nr; i++) {
 		if (ctx.info[i].expired) {
 			dropped_packs++;
-			pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
+			ctx.pack_perm[ctx.info[i].orig_pack_int_id] = PACK_EXPIRED;
 		} else {
-			pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
+			ctx.pack_perm[ctx.info[i].orig_pack_int_id] = i - dropped_packs;
 		}
 	}
 
@@ -927,7 +929,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		close_midx(ctx.m);
 
 	cur_chunk = 0;
-	num_chunks = large_offsets_needed ? 5 : 4;
+	num_chunks = ctx.large_offsets_needed ? 5 : 4;
 
 	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
@@ -954,7 +956,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	cur_chunk++;
 	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
-	if (large_offsets_needed) {
+	if (ctx.large_offsets_needed) {
 		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
 
 		cur_chunk++;
@@ -1004,7 +1006,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, large_offsets_needed, pack_perm, ctx.entries, ctx.entries_nr);
+				written += write_midx_object_offsets(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
@@ -1039,7 +1041,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 	free(ctx.info);
 	free(ctx.entries);
-	free(pack_perm);
+	free(ctx.pack_perm);
 	free(midx_name);
 	return result;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 08/17] midx: add num_large_offsets to write_midx_context
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (6 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
                         ` (8 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t num_large_offsets" into the context. With
this new data, write_midx_large_offsets() now matches the
chunk_write_fn type.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/midx.c b/midx.c
index cd994e333ecb..5be081f229ad 100644
--- a/midx.c
+++ b/midx.c
@@ -464,6 +464,7 @@ struct write_midx_context {
 
 	uint32_t *pack_perm;
 	unsigned large_offsets_needed:1;
+	uint32_t num_large_offsets;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -772,11 +773,14 @@ static size_t write_midx_object_offsets(struct hashfile *f,
 	return written;
 }
 
-static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_offset,
-				       struct pack_midx_entry *objects, uint32_t nr_objects)
+static size_t write_midx_large_offsets(struct hashfile *f,
+				       void *data)
 {
-	struct pack_midx_entry *list = objects, *end = objects + nr_objects;
+	struct write_midx_context *ctx = data;
+	struct pack_midx_entry *list = ctx->entries;
+	struct pack_midx_entry *end = ctx->entries + ctx->entries_nr;
 	size_t written = 0;
+	uint32_t nr_large_offset = ctx->num_large_offsets;
 
 	while (nr_large_offset) {
 		struct pack_midx_entry *obj;
@@ -811,7 +815,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t written = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	uint32_t num_large_offsets = 0;
 	struct progress *progress = NULL;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
@@ -861,7 +864,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	ctx.large_offsets_needed = 0;
 	for (i = 0; i < ctx.entries_nr; i++) {
 		if (ctx.entries[i].offset > 0x7fffffff)
-			num_large_offsets++;
+			ctx.num_large_offsets++;
 		if (ctx.entries[i].offset > 0xffffffff)
 			ctx.large_offsets_needed = 1;
 	}
@@ -961,7 +964,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		cur_chunk++;
 		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
-					   num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
+					   ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
 	}
 
 	chunk_ids[cur_chunk] = 0;
@@ -1010,7 +1013,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, num_large_offsets, ctx.entries, ctx.entries_nr);
+				written += write_midx_large_offsets(f, &ctx);
 				break;
 
 			default:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 09/17] midx: return success/failure in chunk write methods
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (7 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
                         ` (7 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Historically, the chunk-writing methods in midx.c have returned the
amount of data written so the writer method could compare this with the
table of contents. This presents with some interesting issues:

1. If a chunk writing method has a bug that miscalculates the written
   bytes, then we can satisfy the table of contents without actually
   writing the right amount of data to the hashfile. The commit-graph
   writing code checks the hashfile struct directly for a more robust
   verification.

2. There is no way for a chunk writing method to gracefully fail.
   Returning an int presents an opportunity to fail without a die().

3. The current pattern doesn't match chunk_write_fn type exactly, so we
   cannot share code with commit-graph.c

For these reasons, convert the midx chunk writer methods to return an
'int'. Since none of them fail at the moment, they all return 0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 63 +++++++++++++++++++++++++---------------------------------
 1 file changed, 27 insertions(+), 36 deletions(-)

diff --git a/midx.c b/midx.c
index 5be081f229ad..c92a6c47be01 100644
--- a/midx.c
+++ b/midx.c
@@ -650,7 +650,7 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m,
 	return deduplicated_entries;
 }
 
-static size_t write_midx_pack_names(struct hashfile *f, void *data)
+static int write_midx_pack_names(struct hashfile *f, void *data)
 {
 	struct write_midx_context *ctx = data;
 	uint32_t i;
@@ -678,14 +678,13 @@ static size_t write_midx_pack_names(struct hashfile *f, void *data)
 	if (i < MIDX_CHUNK_ALIGNMENT) {
 		memset(padding, 0, sizeof(padding));
 		hashwrite(f, padding, i);
-		written += i;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_oid_fanout(struct hashfile *f,
-				    void *data)
+static int write_midx_oid_fanout(struct hashfile *f,
+				 void *data)
 {
 	struct write_midx_context *ctx = data;
 	struct pack_midx_entry *list = ctx->entries;
@@ -710,17 +709,16 @@ static size_t write_midx_oid_fanout(struct hashfile *f,
 		list = next;
 	}
 
-	return MIDX_CHUNK_FANOUT_SIZE;
+	return 0;
 }
 
-static size_t write_midx_oid_lookup(struct hashfile *f,
-				    void *data)
+static int write_midx_oid_lookup(struct hashfile *f,
+				 void *data)
 {
 	struct write_midx_context *ctx = data;
 	unsigned char hash_len = the_hash_algo->rawsz;
 	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i;
-	size_t written = 0;
 
 	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
@@ -734,19 +732,17 @@ static size_t write_midx_oid_lookup(struct hashfile *f,
 		}
 
 		hashwrite(f, obj->oid.hash, (int)hash_len);
-		written += hash_len;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_object_offsets(struct hashfile *f,
-					void *data)
+static int write_midx_object_offsets(struct hashfile *f,
+				     void *data)
 {
 	struct write_midx_context *ctx = data;
 	struct pack_midx_entry *list = ctx->entries;
 	uint32_t i, nr_large_offset = 0;
-	size_t written = 0;
 
 	for (i = 0; i < ctx->entries_nr; i++) {
 		struct pack_midx_entry *obj = list++;
@@ -766,20 +762,17 @@ static size_t write_midx_object_offsets(struct hashfile *f,
 			    obj->offset);
 		else
 			hashwrite_be32(f, (uint32_t)obj->offset);
-
-		written += MIDX_CHUNK_OFFSET_WIDTH;
 	}
 
-	return written;
+	return 0;
 }
 
-static size_t write_midx_large_offsets(struct hashfile *f,
-				       void *data)
+static int write_midx_large_offsets(struct hashfile *f,
+				    void *data)
 {
 	struct write_midx_context *ctx = data;
 	struct pack_midx_entry *list = ctx->entries;
 	struct pack_midx_entry *end = ctx->entries + ctx->entries_nr;
-	size_t written = 0;
 	uint32_t nr_large_offset = ctx->num_large_offsets;
 
 	while (nr_large_offset) {
@@ -795,12 +788,12 @@ static size_t write_midx_large_offsets(struct hashfile *f,
 		if (!(offset >> 31))
 			continue;
 
-		written += hashwrite_be64(f, offset);
+		hashwrite_be64(f, offset);
 
 		nr_large_offset--;
 	}
 
-	return written;
+	return 0;
 }
 
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
@@ -812,7 +805,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint64_t written = 0;
+	uint64_t header_size = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	struct progress *progress = NULL;
@@ -940,10 +933,10 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		goto cleanup;
 	}
 
-	written = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
+	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
 
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
-	chunk_offsets[cur_chunk] = written + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
+	chunk_offsets[cur_chunk] = header_size + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
 
 	cur_chunk++;
 	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
@@ -981,39 +974,37 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 
 		hashwrite_be32(f, chunk_ids[i]);
 		hashwrite_be64(f, chunk_offsets[i]);
-
-		written += MIDX_CHUNKLOOKUP_WIDTH;
 	}
 
 	if (flags & MIDX_PROGRESS)
 		progress = start_delayed_progress(_("Writing chunks to multi-pack-index"),
 					  num_chunks);
 	for (i = 0; i < num_chunks; i++) {
-		if (written != chunk_offsets[i])
+		if (f->total + f->offset != chunk_offsets[i])
 			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
 			    chunk_offsets[i],
-			    written,
+			    f->total + f->offset,
 			    chunk_ids[i]);
 
 		switch (chunk_ids[i]) {
 			case MIDX_CHUNKID_PACKNAMES:
-				written += write_midx_pack_names(f, &ctx);
+				write_midx_pack_names(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDFANOUT:
-				written += write_midx_oid_fanout(f, &ctx);
+				write_midx_oid_fanout(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OIDLOOKUP:
-				written += write_midx_oid_lookup(f, &ctx);
+				write_midx_oid_lookup(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_OBJECTOFFSETS:
-				written += write_midx_object_offsets(f, &ctx);
+				write_midx_object_offsets(f, &ctx);
 				break;
 
 			case MIDX_CHUNKID_LARGEOFFSETS:
-				written += write_midx_large_offsets(f, &ctx);
+				write_midx_large_offsets(f, &ctx);
 				break;
 
 			default:
@@ -1025,9 +1016,9 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	}
 	stop_progress(&progress);
 
-	if (written != chunk_offsets[num_chunks])
+	if (hashfile_total(f) != chunk_offsets[num_chunks])
 		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-		    written,
+		    hashfile_total(f),
 		    chunk_offsets[num_chunks]);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 10/17] midx: drop chunk progress during write
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (8 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
                         ` (6 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Most expensive operations in write_midx_internal() use the context
struct's progress member, and these indicate the process of the
expensive operations within the chunk writing methods. However, there is
a competing progress struct that counts the progress over all chunks.
This is not very helpful compared to the others, so drop it.

This also reduces our barriers to combining the chunk writing code with
chunk-format.c.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/midx.c b/midx.c
index c92a6c47be01..4f4aa351e60e 100644
--- a/midx.c
+++ b/midx.c
@@ -808,7 +808,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	uint64_t header_size = 0;
 	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
 	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
-	struct progress *progress = NULL;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -976,9 +975,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 		hashwrite_be64(f, chunk_offsets[i]);
 	}
 
-	if (flags & MIDX_PROGRESS)
-		progress = start_delayed_progress(_("Writing chunks to multi-pack-index"),
-					  num_chunks);
 	for (i = 0; i < num_chunks; i++) {
 		if (f->total + f->offset != chunk_offsets[i])
 			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
@@ -1011,10 +1007,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 				BUG("trying to write unknown chunk id %"PRIx32,
 				    chunk_ids[i]);
 		}
-
-		display_progress(progress, i + 1);
 	}
-	stop_progress(&progress);
 
 	if (hashfile_total(f) != chunk_offsets[num_chunks])
 		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 11/17] midx: use chunk-format API in write_midx_internal()
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (9 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
                         ` (5 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The chunk-format API allows writing the table of contents and all chunks
using the anonymous 'struct chunkfile' type. We only need to convert our
local chunk logic to this API for the multi-pack-index writes to share
that logic with the commit-graph file writes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 106 +++++++++++----------------------------------------------
 1 file changed, 20 insertions(+), 86 deletions(-)

diff --git a/midx.c b/midx.c
index 4f4aa351e60e..d2fd9c10feea 100644
--- a/midx.c
+++ b/midx.c
@@ -11,6 +11,7 @@
 #include "trace2.h"
 #include "run-command.h"
 #include "repository.h"
+#include "chunk-format.h"
 
 #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
 #define MIDX_VERSION 1
@@ -21,7 +22,6 @@
 #define MIDX_HEADER_SIZE 12
 #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + the_hash_algo->rawsz)
 
-#define MIDX_MAX_CHUNKS 5
 #define MIDX_CHUNK_ALIGNMENT 4
 #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
 #define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -799,18 +799,15 @@ static int write_midx_large_offsets(struct hashfile *f,
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
 			       struct string_list *packs_to_drop, unsigned flags)
 {
-	unsigned char cur_chunk, num_chunks = 0;
 	char *midx_name;
 	uint32_t i;
 	struct hashfile *f = NULL;
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
-	uint64_t header_size = 0;
-	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
-	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
+	struct chunkfile *cf;
 
 	midx_name = get_midx_filename(object_dir);
 	if (safe_create_leading_directories(midx_name))
@@ -923,98 +920,35 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	if (ctx.m)
 		close_midx(ctx.m);
 
-	cur_chunk = 0;
-	num_chunks = ctx.large_offsets_needed ? 5 : 4;
-
 	if (ctx.nr - dropped_packs == 0) {
 		error(_("no pack files to index."));
 		result = 1;
 		goto cleanup;
 	}
 
-	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
-
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
-	chunk_offsets[cur_chunk] = header_size + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
-
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + pack_name_concat_len;
-
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + MIDX_CHUNK_FANOUT_SIZE;
-
-	cur_chunk++;
-	chunk_ids[cur_chunk] = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * the_hash_algo->rawsz;
-
-	cur_chunk++;
-	chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
-	if (ctx.large_offsets_needed) {
-		chunk_ids[cur_chunk] = MIDX_CHUNKID_LARGEOFFSETS;
-
-		cur_chunk++;
-		chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] +
-					   ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
-	}
-
-	chunk_ids[cur_chunk] = 0;
-
-	for (i = 0; i <= num_chunks; i++) {
-		if (i && chunk_offsets[i] < chunk_offsets[i - 1])
-			BUG("incorrect chunk offsets: %"PRIu64" before %"PRIu64,
-			    chunk_offsets[i - 1],
-			    chunk_offsets[i]);
-
-		if (chunk_offsets[i] % MIDX_CHUNK_ALIGNMENT)
-			BUG("chunk offset %"PRIu64" is not properly aligned",
-			    chunk_offsets[i]);
-
-		hashwrite_be32(f, chunk_ids[i]);
-		hashwrite_be64(f, chunk_offsets[i]);
-	}
-
-	for (i = 0; i < num_chunks; i++) {
-		if (f->total + f->offset != chunk_offsets[i])
-			BUG("incorrect chunk offset (%"PRIu64" != %"PRIu64") for chunk id %"PRIx32,
-			    chunk_offsets[i],
-			    f->total + f->offset,
-			    chunk_ids[i]);
+	cf = init_chunkfile(f);
 
-		switch (chunk_ids[i]) {
-			case MIDX_CHUNKID_PACKNAMES:
-				write_midx_pack_names(f, &ctx);
-				break;
+	add_chunk(cf, MIDX_CHUNKID_PACKNAMES, pack_name_concat_len,
+		  write_midx_pack_names);
+	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT, MIDX_CHUNK_FANOUT_SIZE,
+		  write_midx_oid_fanout);
+	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
+		  ctx.entries_nr * the_hash_algo->rawsz,
+		  write_midx_oid_lookup);
+	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
+		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
+		  write_midx_object_offsets);
 
-			case MIDX_CHUNKID_OIDFANOUT:
-				write_midx_oid_fanout(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_OIDLOOKUP:
-				write_midx_oid_lookup(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_OBJECTOFFSETS:
-				write_midx_object_offsets(f, &ctx);
-				break;
-
-			case MIDX_CHUNKID_LARGEOFFSETS:
-				write_midx_large_offsets(f, &ctx);
-				break;
-
-			default:
-				BUG("trying to write unknown chunk id %"PRIx32,
-				    chunk_ids[i]);
-		}
-	}
+	if (ctx.large_offsets_needed)
+		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
+			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
+			write_midx_large_offsets);
 
-	if (hashfile_total(f) != chunk_offsets[num_chunks])
-		BUG("incorrect final offset %"PRIu64" != %"PRIu64,
-		    hashfile_total(f),
-		    chunk_offsets[num_chunks]);
+	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
+	write_chunkfile(cf, &ctx);
 
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
+	free_chunkfile(cf);
 	commit_lock_file(&lk);
 
 cleanup:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 12/17] chunk-format: create read chunk API
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (10 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
                         ` (4 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add the capability to read the table of contents, then pair the chunks
with necessary logic using read_chunk_fn pointers. Callers will be added
in future changes, but the typical outline will be:

 1. initialize a 'struct chunkfile' with init_chunkfile(NULL).
 2. call read_table_of_contents().
 3. for each chunk to parse,
    a. call pair_chunk() to assign a pointer with the chunk position, or
    b. call read_chunk() to run a callback on the chunk start and size.
 4. call free_chunkfile() to clear the 'struct chunkfile' data.

We are re-using the anonymous 'struct chunkfile' data, as it is internal
to the chunk-format API. This gives it essentially two modes: write and
read. If the same struct instance was used for both reads and writes,
then there would be failures.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++
 chunk-format.h | 47 +++++++++++++++++++++++++++++
 2 files changed, 127 insertions(+)

diff --git a/chunk-format.c b/chunk-format.c
index 6c9b52b70c10..2c1fecf1c3e5 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -11,6 +11,8 @@ struct chunk_info {
 	uint32_t id;
 	uint64_t size;
 	chunk_write_fn write_fn;
+
+	const void *start;
 };
 
 struct chunkfile {
@@ -88,3 +90,81 @@ int write_chunkfile(struct chunkfile *cf, void *data)
 
 	return 0;
 }
+
+int read_table_of_contents(struct chunkfile *cf,
+			   const unsigned char *mfile,
+			   size_t mfile_size,
+			   uint64_t toc_offset,
+			   int toc_length)
+{
+	uint32_t chunk_id;
+	const unsigned char *table_of_contents = mfile + toc_offset;
+
+	ALLOC_GROW(cf->chunks, toc_length, cf->chunks_alloc);
+
+	while (toc_length--) {
+		uint64_t chunk_offset, next_chunk_offset;
+
+		chunk_id = get_be32(table_of_contents);
+		chunk_offset = get_be64(table_of_contents + 4);
+
+		if (!chunk_id) {
+			error(_("terminating chunk id appears earlier than expected"));
+			return 1;
+		}
+
+		table_of_contents += CHUNK_TOC_ENTRY_SIZE;
+		next_chunk_offset = get_be64(table_of_contents + 4);
+
+		if (next_chunk_offset < chunk_offset ||
+		    next_chunk_offset > mfile_size - the_hash_algo->rawsz) {
+			error(_("improper chunk offset(s) %"PRIx64" and %"PRIx64""),
+			      chunk_offset, next_chunk_offset);
+			return -1;
+		}
+
+		cf->chunks[cf->chunks_nr].id = chunk_id;
+		cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
+		cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset;
+		cf->chunks_nr++;
+	}
+
+	chunk_id = get_be32(table_of_contents);
+	if (chunk_id) {
+		error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
+		return -1;
+	}
+
+	return 0;
+}
+
+static int pair_chunk_fn(const unsigned char *chunk_start,
+			 size_t chunk_size,
+			 void *data)
+{
+	const unsigned char **p = data;
+	*p = chunk_start;
+	return 0;
+}
+
+int pair_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       const unsigned char **p)
+{
+	return read_chunk(cf, chunk_id, pair_chunk_fn, p);
+}
+
+int read_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       chunk_read_fn fn,
+	       void *data)
+{
+	int i;
+
+	for (i = 0; i < cf->chunks_nr; i++) {
+		if (cf->chunks[i].id == chunk_id)
+			return fn(cf->chunks[i].start, cf->chunks[i].size, data);
+	}
+
+	return CHUNK_NOT_FOUND;
+}
diff --git a/chunk-format.h b/chunk-format.h
index ce598b66d9f8..9ccbe0037792 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -8,6 +8,20 @@ struct chunkfile;
 
 #define CHUNK_TOC_ENTRY_SIZE (sizeof(uint32_t) + sizeof(uint64_t))
 
+/*
+ * Initialize a 'struct chunkfile' for writing _or_ reading a file
+ * with the chunk format.
+ *
+ * If writing a file, supply a non-NULL 'struct hashfile *' that will
+ * be used to write.
+ *
+ * If reading a file, use a NULL 'struct hashfile *' and then call
+ * read_table_of_contents(). Supply the memory-mapped data to the
+ * pair_chunk() or read_chunk() methods, as appropriate.
+ *
+ * DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances
+ * for reading and writing.
+ */
 struct chunkfile *init_chunkfile(struct hashfile *f);
 void free_chunkfile(struct chunkfile *cf);
 int get_num_chunks(struct chunkfile *cf);
@@ -18,4 +32,37 @@ void add_chunk(struct chunkfile *cf,
 	       chunk_write_fn fn);
 int write_chunkfile(struct chunkfile *cf, void *data);
 
+int read_table_of_contents(struct chunkfile *cf,
+			   const unsigned char *mfile,
+			   size_t mfile_size,
+			   uint64_t toc_offset,
+			   int toc_length);
+
+#define CHUNK_NOT_FOUND (-2)
+
+/*
+ * Find 'chunk_id' in the given chunkfile and assign the
+ * given pointer to the position in the mmap'd file where
+ * that chunk begins.
+ *
+ * Returns CHUNK_NOT_FOUND if the chunk does not exist.
+ */
+int pair_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       const unsigned char **p);
+
+typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
+			     size_t chunk_size, void *data);
+/*
+ * Find 'chunk_id' in the given chunkfile and call the
+ * given chunk_read_fn method with the information for
+ * that chunk.
+ *
+ * Returns CHUNK_NOT_FOUND if the chunk does not exist.
+ */
+int read_chunk(struct chunkfile *cf,
+	       uint32_t chunk_id,
+	       chunk_read_fn fn,
+	       void *data);
+
 #endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 13/17] commit-graph: use chunk-format read API
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (11 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 14/17] midx: " Derrick Stolee via GitGitGadget
                         ` (3 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). While the current
implementation loses the duplicate-chunk detection, that will be added
in a future change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c          | 159 ++++++++++++++--------------------------
 t/t5318-commit-graph.sh |   2 +-
 2 files changed, 55 insertions(+), 106 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a889130cc849..76514a879e88 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -59,8 +59,7 @@ void git_test_write_commit_graph_or_die(void)
 
 #define GRAPH_HEADER_SIZE 8
 #define GRAPH_FANOUT_SIZE (4 * 256)
-#define GRAPH_CHUNKLOOKUP_WIDTH 12
-#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
+#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * CHUNK_TOC_ENTRY_SIZE \
 			+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
 
 #define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
@@ -298,15 +297,43 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
+static int graph_read_oid_lookup(const unsigned char *chunk_start,
+				 size_t chunk_size, void *data)
+{
+	struct commit_graph *g = data;
+	g->chunk_oid_lookup = chunk_start;
+	g->num_commits = chunk_size / g->hash_len;
+	return 0;
+}
+
+static int graph_read_bloom_data(const unsigned char *chunk_start,
+				  size_t chunk_size, void *data)
+{
+	struct commit_graph *g = data;
+	uint32_t hash_version;
+	g->chunk_bloom_data = chunk_start;
+	hash_version = get_be32(chunk_start);
+
+	if (hash_version != 1)
+		return 0;
+
+	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
+	g->bloom_filter_settings->hash_version = hash_version;
+	g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4);
+	g->bloom_filter_settings->bits_per_entry = get_be32(chunk_start + 8);
+	g->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
+
+	return 0;
+}
+
 struct commit_graph *parse_commit_graph(struct repository *r,
 					void *graph_map, size_t graph_size)
 {
-	const unsigned char *data, *chunk_lookup;
-	uint32_t i;
+	const unsigned char *data;
 	struct commit_graph *graph;
-	uint64_t next_chunk_offset;
 	uint32_t graph_signature;
 	unsigned char graph_version, hash_version;
+	struct chunkfile *cf = NULL;
 
 	if (!graph_map)
 		return NULL;
@@ -347,7 +374,7 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	graph->data_len = graph_size;
 
 	if (graph_size < GRAPH_HEADER_SIZE +
-			 (graph->num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH +
+			 (graph->num_chunks + 1) * CHUNK_TOC_ENTRY_SIZE +
 			 GRAPH_FANOUT_SIZE + the_hash_algo->rawsz) {
 		error(_("commit-graph file is too small to hold %u chunks"),
 		      graph->num_chunks);
@@ -355,108 +382,28 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		return NULL;
 	}
 
-	chunk_lookup = data + 8;
-	next_chunk_offset = get_be64(chunk_lookup + 4);
-	for (i = 0; i < graph->num_chunks; i++) {
-		uint32_t chunk_id;
-		uint64_t chunk_offset = next_chunk_offset;
-		int chunk_repeated = 0;
-
-		chunk_id = get_be32(chunk_lookup + 0);
-
-		chunk_lookup += GRAPH_CHUNKLOOKUP_WIDTH;
-		next_chunk_offset = get_be64(chunk_lookup + 4);
-
-		if (chunk_offset > graph_size - the_hash_algo->rawsz) {
-			error(_("commit-graph improper chunk offset %08x%08x"), (uint32_t)(chunk_offset >> 32),
-			      (uint32_t)chunk_offset);
-			goto free_and_return;
-		}
-
-		switch (chunk_id) {
-		case GRAPH_CHUNKID_OIDFANOUT:
-			if (graph->chunk_oid_fanout)
-				chunk_repeated = 1;
-			else
-				graph->chunk_oid_fanout = (uint32_t*)(data + chunk_offset);
-			break;
-
-		case GRAPH_CHUNKID_OIDLOOKUP:
-			if (graph->chunk_oid_lookup)
-				chunk_repeated = 1;
-			else {
-				graph->chunk_oid_lookup = data + chunk_offset;
-				graph->num_commits = (next_chunk_offset - chunk_offset)
-						     / graph->hash_len;
-			}
-			break;
+	cf = init_chunkfile(NULL);
 
-		case GRAPH_CHUNKID_DATA:
-			if (graph->chunk_commit_data)
-				chunk_repeated = 1;
-			else
-				graph->chunk_commit_data = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_GENERATION_DATA:
-			if (graph->chunk_generation_data)
-				chunk_repeated = 1;
-			else
-				graph->chunk_generation_data = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
-			if (graph->chunk_generation_data_overflow)
-				chunk_repeated = 1;
-			else
-				graph->chunk_generation_data_overflow = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_EXTRAEDGES:
-			if (graph->chunk_extra_edges)
-				chunk_repeated = 1;
-			else
-				graph->chunk_extra_edges = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BASE:
-			if (graph->chunk_base_graphs)
-				chunk_repeated = 1;
-			else
-				graph->chunk_base_graphs = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BLOOMINDEXES:
-			if (graph->chunk_bloom_indexes)
-				chunk_repeated = 1;
-			else if (r->settings.commit_graph_read_changed_paths)
-				graph->chunk_bloom_indexes = data + chunk_offset;
-			break;
-
-		case GRAPH_CHUNKID_BLOOMDATA:
-			if (graph->chunk_bloom_data)
-				chunk_repeated = 1;
-			else if (r->settings.commit_graph_read_changed_paths) {
-				uint32_t hash_version;
-				graph->chunk_bloom_data = data + chunk_offset;
-				hash_version = get_be32(data + chunk_offset);
-
-				if (hash_version != 1)
-					break;
+	if (read_table_of_contents(cf, graph->data, graph_size,
+				   GRAPH_HEADER_SIZE, graph->num_chunks))
+		goto free_and_return;
 
-				graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
-				graph->bloom_filter_settings->hash_version = hash_version;
-				graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
-				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
-				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
-			}
-			break;
-		}
+	pair_chunk(cf, GRAPH_CHUNKID_OIDFANOUT,
+		   (const unsigned char **)&graph->chunk_oid_fanout);
+	read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
+	pair_chunk(cf, GRAPH_CHUNKID_DATA, &graph->chunk_commit_data);
+	pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges);
+	pair_chunk(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs);
+	pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
+		   &graph->chunk_generation_data);
+	pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
+		   &graph->chunk_generation_data_overflow);
 
-		if (chunk_repeated) {
-			error(_("commit-graph chunk id %08x appears multiple times"), chunk_id);
-			goto free_and_return;
-		}
+	if (r->settings.commit_graph_read_changed_paths) {
+		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
+			   &graph->chunk_bloom_indexes);
+		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
+			   graph_read_bloom_data, graph);
 	}
 
 	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
@@ -473,9 +420,11 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 	if (verify_commit_graph_lite(graph))
 		goto free_and_return;
 
+	free_chunkfile(cf);
 	return graph;
 
 free_and_return:
+	free_chunkfile(cf);
 	free(graph->bloom_filter_settings);
 	free(graph);
 	return NULL;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index fa27df579a57..c7da741284e5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -564,7 +564,7 @@ test_expect_success 'detect bad hash version' '
 
 test_expect_success 'detect low chunk count' '
 	corrupt_graph_and_verify $GRAPH_BYTE_CHUNK_COUNT "\01" \
-		"missing the .* chunk"
+		"final chunk has non-zero id"
 '
 
 test_expect_success 'detect missing OID fanout chunk' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 14/17] midx: use chunk-format read API
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (12 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
                         ` (2 subsequent siblings)
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). In particular, we
can use the return value of pair_chunk() to generate an error when a
required chunk is missing.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c                      | 73 +++++++++++++------------------------
 t/t5319-multi-pack-index.sh |  6 +--
 2 files changed, 29 insertions(+), 50 deletions(-)

diff --git a/midx.c b/midx.c
index d2fd9c10feea..d7ea0d1375fa 100644
--- a/midx.c
+++ b/midx.c
@@ -28,7 +28,6 @@
 #define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
 #define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */
-#define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
 #define MIDX_CHUNK_FANOUT_SIZE (sizeof(uint32_t) * 256)
 #define MIDX_CHUNK_OFFSET_WIDTH (2 * sizeof(uint32_t))
 #define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
@@ -53,6 +52,19 @@ static char *get_midx_filename(const char *object_dir)
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
 
+static int midx_read_oid_fanout(const unsigned char *chunk_start,
+				size_t chunk_size, void *data)
+{
+	struct multi_pack_index *m = data;
+	m->chunk_oid_fanout = (uint32_t *)chunk_start;
+
+	if (chunk_size != 4 * 256) {
+		error(_("multi-pack-index OID fanout is of the wrong size"));
+		return 1;
+	}
+	return 0;
+}
+
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local)
 {
 	struct multi_pack_index *m = NULL;
@@ -64,6 +76,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 	char *midx_name = get_midx_filename(object_dir);
 	uint32_t i;
 	const char *cur_pack_name;
+	struct chunkfile *cf = NULL;
 
 	fd = git_open(midx_name);
 
@@ -113,58 +126,23 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 
 	m->num_packs = get_be32(m->data + MIDX_BYTE_NUM_PACKS);
 
-	for (i = 0; i < m->num_chunks; i++) {
-		uint32_t chunk_id = get_be32(m->data + MIDX_HEADER_SIZE +
-					     MIDX_CHUNKLOOKUP_WIDTH * i);
-		uint64_t chunk_offset = get_be64(m->data + MIDX_HEADER_SIZE + 4 +
-						 MIDX_CHUNKLOOKUP_WIDTH * i);
-
-		if (chunk_offset >= m->data_len)
-			die(_("invalid chunk offset (too large)"));
-
-		switch (chunk_id) {
-			case MIDX_CHUNKID_PACKNAMES:
-				m->chunk_pack_names = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_OIDFANOUT:
-				m->chunk_oid_fanout = (uint32_t *)(m->data + chunk_offset);
-				break;
-
-			case MIDX_CHUNKID_OIDLOOKUP:
-				m->chunk_oid_lookup = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_OBJECTOFFSETS:
-				m->chunk_object_offsets = m->data + chunk_offset;
-				break;
-
-			case MIDX_CHUNKID_LARGEOFFSETS:
-				m->chunk_large_offsets = m->data + chunk_offset;
-				break;
-
-			case 0:
-				die(_("terminating multi-pack-index chunk id appears earlier than expected"));
-				break;
-
-			default:
-				/*
-				 * Do nothing on unrecognized chunks, allowing future
-				 * extensions to add optional chunks.
-				 */
-				break;
-		}
-	}
+	cf = init_chunkfile(NULL);
 
-	if (!m->chunk_pack_names)
+	if (read_table_of_contents(cf, m->data, midx_size,
+				   MIDX_HEADER_SIZE, m->num_chunks))
+		goto cleanup_fail;
+
+	if (pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required pack-name chunk"));
-	if (!m->chunk_oid_fanout)
+	if (read_chunk(cf, MIDX_CHUNKID_OIDFANOUT, midx_read_oid_fanout, m) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required OID fanout chunk"));
-	if (!m->chunk_oid_lookup)
+	if (pair_chunk(cf, MIDX_CHUNKID_OIDLOOKUP, &m->chunk_oid_lookup) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required OID lookup chunk"));
-	if (!m->chunk_object_offsets)
+	if (pair_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, &m->chunk_object_offsets) == CHUNK_NOT_FOUND)
 		die(_("multi-pack-index missing required object offsets chunk"));
 
+	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets);
+
 	m->num_objects = ntohl(m->chunk_oid_fanout[255]);
 
 	m->pack_names = xcalloc(m->num_packs, sizeof(*m->pack_names));
@@ -190,6 +168,7 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 cleanup_fail:
 	free(m);
 	free(midx_name);
+	free(cf);
 	if (midx_map)
 		munmap(midx_map, midx_size);
 	if (0 <= fd)
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 297de502a94f..ad4e878b65b8 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -314,12 +314,12 @@ test_expect_success 'verify bad OID version' '
 
 test_expect_success 'verify truncated chunk count' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\01" $objdir \
-		"missing required"
+		"final chunk has non-zero id"
 '
 
 test_expect_success 'verify extended chunk count' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\07" $objdir \
-		"terminating multi-pack-index chunk id appears earlier than expected"
+		"terminating chunk id appears earlier than expected"
 '
 
 test_expect_success 'verify missing required chunk' '
@@ -329,7 +329,7 @@ test_expect_success 'verify missing required chunk' '
 
 test_expect_success 'verify invalid chunk offset' '
 	corrupt_midx_and_verify $MIDX_BYTE_CHUNK_OFFSET "\01" $objdir \
-		"invalid chunk offset (too large)"
+		"improper chunk offset(s)"
 '
 
 test_expect_success 'verify packnames out of order' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 15/17] midx: use 64-bit multiplication for chunk sizes
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (13 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 14/17] midx: " Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When calculating the sizes of certain chunks, we should use 64-bit
multiplication always. This allows us to properly predict the chunk
sizes without risk of overflow.

Other possible overflows were discovered by evaluating each
multiplication in midx.c and ensuring that at least one side of the
operator was of type size_t or off_t.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 midx.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/midx.c b/midx.c
index d7ea0d1375fa..5c7f2ed23330 100644
--- a/midx.c
+++ b/midx.c
@@ -244,7 +244,7 @@ static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
 	const unsigned char *offset_data;
 	uint32_t offset32;
 
-	offset_data = m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH;
+	offset_data = m->chunk_object_offsets + (off_t)pos * MIDX_CHUNK_OFFSET_WIDTH;
 	offset32 = get_be32(offset_data + sizeof(uint32_t));
 
 	if (m->chunk_large_offsets && offset32 & MIDX_LARGE_OFFSET_NEEDED) {
@@ -260,7 +260,8 @@ static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
 
 static uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos)
 {
-	return get_be32(m->chunk_object_offsets + pos * MIDX_CHUNK_OFFSET_WIDTH);
+	return get_be32(m->chunk_object_offsets +
+			(off_t)pos * MIDX_CHUNK_OFFSET_WIDTH);
 }
 
 static int nth_midxed_pack_entry(struct repository *r,
@@ -912,15 +913,15 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	add_chunk(cf, MIDX_CHUNKID_OIDFANOUT, MIDX_CHUNK_FANOUT_SIZE,
 		  write_midx_oid_fanout);
 	add_chunk(cf, MIDX_CHUNKID_OIDLOOKUP,
-		  ctx.entries_nr * the_hash_algo->rawsz,
+		  (size_t)ctx.entries_nr * the_hash_algo->rawsz,
 		  write_midx_oid_lookup);
 	add_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS,
-		  ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
+		  (size_t)ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH,
 		  write_midx_object_offsets);
 
 	if (ctx.large_offsets_needed)
 		add_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS,
-			ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
+			(size_t)ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
 			write_midx_large_offsets);
 
 	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 16/17] chunk-format: restore duplicate chunk checks
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (14 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 14:07       ` [PATCH v4 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
  16 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Before refactoring into the chunk-format API, the commit-graph parsing
logic included checks for duplicate chunks. It is unlikely that we would
desire a chunk-based file format that allows duplicate chunk IDs in the
table of contents, so add duplicate checks into
read_table_of_contents().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 chunk-format.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/chunk-format.c b/chunk-format.c
index 2c1fecf1c3e5..da191e59a29d 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -97,6 +97,7 @@ int read_table_of_contents(struct chunkfile *cf,
 			   uint64_t toc_offset,
 			   int toc_length)
 {
+	int i;
 	uint32_t chunk_id;
 	const unsigned char *table_of_contents = mfile + toc_offset;
 
@@ -123,6 +124,14 @@ int read_table_of_contents(struct chunkfile *cf,
 			return -1;
 		}
 
+		for (i = 0; i < cf->chunks_nr; i++) {
+			if (cf->chunks[i].id == chunk_id) {
+				error(_("duplicate chunk ID %"PRIx32" found"),
+					chunk_id);
+				return -1;
+			}
+		}
+
 		cf->chunks[cf->chunks_nr].id = chunk_id;
 		cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
 		cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v4 17/17] chunk-format: add technical docs
  2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
                         ` (15 preceding siblings ...)
  2021-02-18 14:07       ` [PATCH v4 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
@ 2021-02-18 14:07       ` Derrick Stolee via GitGitGadget
  2021-02-18 21:47         ` Junio C Hamano
  16 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-18 14:07 UTC (permalink / raw)
  To: git
  Cc: me, gitster, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The chunk-based file format is now an API in the code, but we should
also take time to document it as a file format. Specifically, it matches
the CHUNK LOOKUP sections of the commit-graph and multi-pack-index
files, but there are some commonalities that should be grouped in this
document.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/chunk-format.txt      | 116 ++++++++++++++++++
 .../technical/commit-graph-format.txt         |   3 +
 Documentation/technical/pack-format.txt       |   3 +
 3 files changed, 122 insertions(+)
 create mode 100644 Documentation/technical/chunk-format.txt

diff --git a/Documentation/technical/chunk-format.txt b/Documentation/technical/chunk-format.txt
new file mode 100644
index 000000000000..593614fcedab
--- /dev/null
+++ b/Documentation/technical/chunk-format.txt
@@ -0,0 +1,116 @@
+Chunk-based file formats
+========================
+
+Some file formats in Git use a common concept of "chunks" to describe
+sections of the file. This allows structured access to a large file by
+scanning a small "table of contents" for the remaining data. This common
+format is used by the `commit-graph` and `multi-pack-index` files. See
+link:technical/pack-format.html[the `multi-pack-index` format] and
+link:technical/commit-graph-format.html[the `commit-graph` format] for
+how they use the chunks to describe structured data.
+
+A chunk-based file format begins with some header information custom to
+that format. That header should include enough information to identify
+the file type, format version, and number of chunks in the file. From this
+information, that file can determine the start of the chunk-based region.
+
+The chunk-based region starts with a table of contents describing where
+each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
+where C is the number of chunks. Consider the following table:
+
+  | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
+  |--------------------|------------------------|
+  | ID[0]              | OFFSET[0]              |
+  | ...                | ...                    |
+  | ID[C]              | OFFSET[C]              |
+  | 0x0000             | OFFSET[C+1]            |
+
+Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
+Each integer is stored in network-byte order.
+
+The chunk identifier `ID[i]` is a label for the data stored within this
+fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
+size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
+and `OFFSET[i]`. This requires that the chunk data appears contiguously
+in the same order as the table of contents.
+
+The final entry in the table of contents must be four zero bytes. This
+confirms that the table of contents is ending and provides the offset for
+the end of the chunk-based data.
+
+Note: The chunk-based format expects that the file contains _at least_ a
+trailing hash after `OFFSET[C+1]`.
+
+Functions for working with chunk-based file formats are declared in
+`chunk-format.h`. Using these methods provide extra checks that assist
+developers when creating new file formats.
+
+Writing chunk-based file formats
+--------------------------------
+
+To write a chunk-based file format, create a `struct chunkfile` by
+calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
+caller is responsible for opening the `hashfile` and writing header
+information so the file format is identifiable before the chunk-based
+format begins.
+
+Then, call `add_chunk()` for each chunk that is intended for write. This
+populates the `chunkfile` with information about the order and size of
+each chunk to write. Provide a `chunk_write_fn` function pointer to
+perform the write of the chunk data upon request.
+
+Call `write_chunkfile()` to write the table of contents to the `hashfile`
+followed by each of the chunks. This will verify that each chunk wrote
+the expected amount of data so the table of contents is correct.
+
+Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
+caller is responsible for finalizing the `hashfile` by writing the trailing
+hash and closing the file.
+
+Reading chunk-based file formats
+--------------------------------
+
+To read a chunk-based file format, the file must be opened as a
+memory-mapped region. The chunk-format API expects that the entire file
+is mapped as a contiguous memory region.
+
+Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.
+
+After reading the header information from the beginning of the file,
+including the chunk count, call `read_table_of_contents()` to populate
+the `struct chunkfile` with the list of chunks, their offsets, and their
+sizes.
+
+Extract the data information for each chunk using `pair_chunk()` or
+`read_chunk()`:
+
+* `pair_chunk()` assigns a given pointer with the location inside the
+  memory-mapped file corresponding to that chunk's offset. If the chunk
+  does not exist, then the pointer is not modified.
+
+* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
+  with the appropriate initial pointer and size information. The function
+  is not called if the chunk does not exist. Use this method to read chunks
+  if you need to perform immediate parsing or if you need to execute logic
+  based on the size of the chunk.
+
+After calling these methods, call `free_chunkfile()` to clear the
+`struct chunkfile` data. This will not close the memory-mapped region.
+Callers are expected to own that data for the timeframe the pointers into
+the region are needed.
+
+Examples
+--------
+
+These file formats use the chunk-format API, and can be used as examples
+for future formats:
+
+* *commit-graph:* see `write_commit_graph_file()` and `parse_commit_graph()`
+  in `commit-graph.c` for how the chunk-format API is used to write and
+  parse the commit-graph file format documented in
+  link:technical/commit-graph-format.html[the commit-graph file format].
+
+* *multi-pack-index:* see `write_midx_internal()` and `load_multi_pack_index()`
+  in `midx.c` for how the chunk-format API is used to write and
+  parse the multi-pack-index file format documented in
+  link:technical/pack-format.html[the multi-pack-index file format].
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b6658eff1882..87971c27dd73 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -61,6 +61,9 @@ CHUNK LOOKUP:
       the length using the next chunk position if necessary.) Each chunk
       ID appears at most once.
 
+  The CHUNK LOOKUP matches the table of contents from
+  link:technical/chunk-format.html[the chunk-based file format].
+
   The remaining data in the body is described one chunk at a time, and
   these chunks may be given in any order. Chunks are required unless
   otherwise specified.
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index f96b2e605f34..2fb1e60d29ec 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -301,6 +301,9 @@ CHUNK LOOKUP:
 	    (Chunks are provided in file-order, so you can infer the length
 	    using the next chunk position if necessary.)
 
+	The CHUNK LOOKUP matches the table of contents from
+	link:technical/chunk-format.html[the chunk-based file format].
+
 	The remaining data in the body is described one chunk at a time, and
 	these chunks may be given in any order. Chunks are required unless
 	otherwise specified.
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 17/17] chunk-format: add technical docs
  2021-02-18 14:07       ` [PATCH v4 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
@ 2021-02-18 21:47         ` Junio C Hamano
  2021-02-19 12:42           ` Derrick Stolee
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2021-02-18 21:47 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee, Han-Wen Nienhuys

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +Chunk-based file formats
> +========================
> +
> +Some file formats in Git use a common concept of "chunks" to describe
> +sections of the file. This allows structured access to a large file by
> +scanning a small "table of contents" for the remaining data. This common
> +format is used by the `commit-graph` and `multi-pack-index` files. See
> +link:technical/pack-format.html[the `multi-pack-index` format] and
> +link:technical/commit-graph-format.html[the `commit-graph` format] for
> +how they use the chunks to describe structured data.

I've read the doc added here to the end; well written and easy to
understand.

I wonder how/if well reftable files fit in the scheme, or if it
doesn't, should the chunk file format API be updated to accomodate
it (or the other way around)?

> +Extract the data information for each chunk using `pair_chunk()` or
> +`read_chunk()`:
> +
> +* `pair_chunk()` assigns a given pointer with the location inside the
> +  memory-mapped file corresponding to that chunk's offset. If the chunk
> +  does not exist, then the pointer is not modified.

I think it is worth adding:

    The caller is expected to know where the returned chunk ends by
    some out-of-band means, as this function only gives the offset
    but not the size, unlike the read_chunk() function.

> +* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
> +  with the appropriate initial pointer and size information. The function
> +  is not called if the chunk does not exist. Use this method to read chunks
> +  if you need to perform immediate parsing or if you need to execute logic
> +  based on the size of the chunk.
> +
> +After calling these methods, call `free_chunkfile()` to clear the
> +`struct chunkfile` data. This will not close the memory-mapped region.
> +Callers are expected to own that data for the timeframe the pointers into
> +the region are needed.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 17/17] chunk-format: add technical docs
  2021-02-18 21:47         ` Junio C Hamano
@ 2021-02-19 12:42           ` Derrick Stolee
  0 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee @ 2021-02-19 12:42 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, l.s.r, szeder.dev, Chris Torek, Derrick Stolee,
	Derrick Stolee, Han-Wen Nienhuys

On 2/18/2021 4:47 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +Chunk-based file formats
>> +========================
>> +
>> +Some file formats in Git use a common concept of "chunks" to describe
>> +sections of the file. This allows structured access to a large file by
>> +scanning a small "table of contents" for the remaining data. This common
>> +format is used by the `commit-graph` and `multi-pack-index` files. See
>> +link:technical/pack-format.html[the `multi-pack-index` format] and
>> +link:technical/commit-graph-format.html[the `commit-graph` format] for
>> +how they use the chunks to describe structured data.
> 
> I've read the doc added here to the end; well written and easy to
> understand.
> 
> I wonder how/if well reftable files fit in the scheme, or if it
> doesn't, should the chunk file format API be updated to accomodate
> it (or the other way around)?

I'm not sure that reftable can work with this format, especially with
its design to do most updates as append-only (IIUC). And to change the
format to work with the chunk format would violate the compatibility
with the JGit version. I would be interested if something like the
packed-refs file could use a minor update, but only if there is a
realistic benefit to using chunks over the current format.

The files that are on my radar for adopting a new file format using the
chunk-format API are:

* reachability bitmaps: using a similar approach to the commit-graph,
  we could avoid parsing the entire file before checking if a specific
  commit has a bitmap. (Requires a commit lookup chunk, a bitmap data
  chunk, and an offset chunk to connect them.)

* index v5: I'm trying to collect a bunch of information about how to
  update the index for better compression, and the chunk-based approach
  can provide some fixed-width columns that can vary in length depending
  on the required data (presenting the interesting behavior from v2 and v3,
  along with possible approaches previously presented as a potential v5).
  The paths could be presented as a chunk, giving the interesting options
  between v2/3 and v4 (prefix compression). I haven't even started the
  actual work here, but I've been thinking about it a lot. I'll have time
  next month to start prototyping.

Are there other interesting files that could use a new version here?
What other pain points are known to experts in the area?

>> +Extract the data information for each chunk using `pair_chunk()` or
>> +`read_chunk()`:
>> +
>> +* `pair_chunk()` assigns a given pointer with the location inside the
>> +  memory-mapped file corresponding to that chunk's offset. If the chunk
>> +  does not exist, then the pointer is not modified.
> 
> I think it is worth adding:
> 
>     The caller is expected to know where the returned chunk ends by
>     some out-of-band means, as this function only gives the offset
>     but not the size, unlike the read_chunk() function.

True. I suppose that could be more explicit, although it can be gleaned
from the omission of any size information.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/17] commit-graph: use chunk-format write API
  2021-02-18 14:07       ` [PATCH v4 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
@ 2021-02-24 16:52         ` SZEDER Gábor
  2021-02-24 17:12           ` Taylor Blau
  0 siblings, 1 reply; 120+ messages in thread
From: SZEDER Gábor @ 2021-02-24 16:52 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, gitster, l.s.r, Chris Torek, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

On Thu, Feb 18, 2021 at 02:07:25PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The commit-graph write logic is ready to make use of the chunk-format
> write API. Each chunk write method is already in the correct prototype.
> We only need to use the 'struct chunkfile' pointer and the correct API
> calls.

This patch series messes up the "Writing out commit graph" progress
display, and starting at this commit I get:

  $ git commit-graph write --reachable
  Expanding reachable commits in commit graph: 837569, done.
  Writing out commit graph in 3 passes: 166% (4187845/2512707), done.

Note that 166%.

Before this commit I got:

  Expanding reachable commits in commit graph: 837569, done.
  Writing out commit graph in 5 passes: 100% (4187845/4187845), done.

Note the different number of passes.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/17] commit-graph: use chunk-format write API
  2021-02-24 16:52         ` SZEDER Gábor
@ 2021-02-24 17:12           ` Taylor Blau
  2021-02-24 17:52             ` Derrick Stolee
  0 siblings, 1 reply; 120+ messages in thread
From: Taylor Blau @ 2021-02-24 17:12 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Derrick Stolee via GitGitGadget, git, me, gitster, l.s.r,
	Chris Torek, Derrick Stolee, Derrick Stolee, Derrick Stolee

On Wed, Feb 24, 2021 at 05:52:42PM +0100, SZEDER Gábor wrote:
> On Thu, Feb 18, 2021 at 02:07:25PM +0000, Derrick Stolee via GitGitGadget wrote:
> > From: Derrick Stolee <dstolee@microsoft.com>
> >
> > The commit-graph write logic is ready to make use of the chunk-format
> > write API. Each chunk write method is already in the correct prototype.
> > We only need to use the 'struct chunkfile' pointer and the correct API
> > calls.
>
> This patch series messes up the "Writing out commit graph" progress
> display, and starting at this commit I get:

I can confirm. It looks like we never dropped the 'num_chunks' variable,
which should have happened in this patch.

Here's something to apply on top which fixes the issue. Thanks for
reporting.

--- >8 ---

Subject: [PATCH] commit-graph.c: display correct number of chunks when writing

When writing a commit-graph, a progress meter is shown which indicates
the number of pieces of data to write (one per commit in each chunk).

In 47410aa837 (commit-graph: use chunk-format write API, 2021-02-18),
the number of chunks became tracked by the new chunk-format API. But a
stray local variable was left behind from when write_commit_graph_file()
used to keep track of the same.

Since this was no longer updated after 47410aa837, the progress meter
appeared broken:

    $ git commit-graph write --reachable
    Expanding reachable commits in commit graph: 837569, done.
    Writing out commit graph in 3 passes: 166% (4187845/2512707), done.

Drop the local variable and rely instead on the chunk-format API to tell
us the correct number of chunks.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 78b993c367..6aa0c488f5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1791,7 +1791,6 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	struct lock_file lk = LOCK_INIT;
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
-	int num_chunks = 3;
 	struct object_id file_hash;
 	struct chunkfile *cf;

@@ -1887,11 +1886,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		strbuf_addf(&progress_title,
 			    Q_("Writing out commit graph in %d pass",
 			       "Writing out commit graph in %d passes",
-			       num_chunks),
-			    num_chunks);
+			       get_num_chunks(cf)),
+			    get_num_chunks(cf));
 		ctx->progress = start_delayed_progress(
 			progress_title.buf,
-			num_chunks * ctx->commits.nr);
+			get_num_chunks(cf) * ctx->commits.nr);
 	}

 	write_chunkfile(cf, ctx);
--
2.30.0.667.g81c0cbc6fd


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/17] commit-graph: use chunk-format write API
  2021-02-24 17:12           ` Taylor Blau
@ 2021-02-24 17:52             ` Derrick Stolee
  2021-02-24 19:44               ` Junio C Hamano
  0 siblings, 1 reply; 120+ messages in thread
From: Derrick Stolee @ 2021-02-24 17:52 UTC (permalink / raw)
  To: Taylor Blau, SZEDER Gábor
  Cc: Derrick Stolee via GitGitGadget, git, gitster, l.s.r,
	Chris Torek, Derrick Stolee, Derrick Stolee

On 2/24/2021 12:12 PM, Taylor Blau wrote:
> On Wed, Feb 24, 2021 at 05:52:42PM +0100, SZEDER Gábor wrote:
>> On Thu, Feb 18, 2021 at 02:07:25PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>
>>> The commit-graph write logic is ready to make use of the chunk-format
>>> write API. Each chunk write method is already in the correct prototype.
>>> We only need to use the 'struct chunkfile' pointer and the correct API
>>> calls.
>>
>> This patch series messes up the "Writing out commit graph" progress
>> display, and starting at this commit I get:

Thanks for the report and identifying the exact place that caused the
mistake.
 
> I can confirm. It looks like we never dropped the 'num_chunks' variable,
> which should have happened in this patch.

Yes, makes sense. Hard to see that 'num_chunks' wasn't used because it
_was_ being used, just not as intended.

> @@ -1887,11 +1886,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  		strbuf_addf(&progress_title,
>  			    Q_("Writing out commit graph in %d pass",
>  			       "Writing out commit graph in %d passes",
> -			       num_chunks),
> -			    num_chunks);
> +			       get_num_chunks(cf)),
> +			    get_num_chunks(cf));
>  		ctx->progress = start_delayed_progress(
>  			progress_title.buf,
> -			num_chunks * ctx->commits.nr);
> +			get_num_chunks(cf) * ctx->commits.nr);

This is obviously correct. Thanks for the quick patch!

-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 03/17] commit-graph: use chunk-format write API
  2021-02-24 17:52             ` Derrick Stolee
@ 2021-02-24 19:44               ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2021-02-24 19:44 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Taylor Blau, SZEDER Gábor, Derrick Stolee via GitGitGadget,
	git, l.s.r, Chris Torek, Derrick Stolee, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>>> This patch series messes up the "Writing out commit graph" progress
>>> display, and starting at this commit I get:
>
> Thanks for the report and identifying the exact place that caused the
> mistake.
>  
>> I can confirm. It looks like we never dropped the 'num_chunks' variable,
>> which should have happened in this patch.
>
> Yes, makes sense. Hard to see that 'num_chunks' wasn't used because it
> _was_ being used, just not as intended.
> ...
>
> This is obviously correct. Thanks for the quick patch!

Thanks all for noticing and fixing before the series hit the master
branch.


^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2021-02-24 19:44 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-01-27  1:53   ` Chris Torek
2021-01-27  2:36     ` Taylor Blau
2021-01-26 16:01 ` [PATCH 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-01-27  2:42   ` Taylor Blau
2021-01-27 13:49     ` Derrick Stolee
2021-01-26 16:01 ` [PATCH 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-01-27  2:47   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27  2:49   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-01-27  3:02   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 14/17] midx: " Derrick Stolee via GitGitGadget
2021-01-27  3:06   ` Taylor Blau
2021-01-27 13:50     ` Derrick Stolee
2021-01-26 16:01 ` [PATCH 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-01-26 22:37 ` [PATCH 00/17] Refactor chunk-format into an API Junio C Hamano
2021-01-27  2:29 ` Taylor Blau
2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-04 21:24     ` Junio C Hamano
2021-02-04 22:40       ` Junio C Hamano
2021-02-05 11:37       ` Derrick Stolee
2021-02-05 19:25         ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-04 22:59     ` Junio C Hamano
2021-02-05 11:42       ` Derrick Stolee
2021-01-27 15:01   ` [PATCH v2 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-04 23:40     ` Junio C Hamano
2021-02-05 12:19       ` Derrick Stolee
2021-02-05 19:37         ` Junio C Hamano
2021-02-08 22:26           ` Junio C Hamano
2021-02-09  1:33             ` Derrick Stolee
2021-02-09 20:47               ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 14/17] midx: " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-05  0:00     ` Junio C Hamano
2021-02-05 10:59       ` Chris Torek
2021-02-05 20:41         ` Junio C Hamano
2021-02-06 20:35           ` Chris Torek
2021-02-05 12:30       ` Derrick Stolee
2021-02-05 19:42         ` Junio C Hamano
2021-02-07 19:50       ` SZEDER Gábor
2021-02-08  5:41         ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-05  0:05     ` Junio C Hamano
2021-02-05 12:31       ` Derrick Stolee
2021-01-27 15:01   ` [PATCH v2 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-05  0:15     ` Junio C Hamano
2021-01-27 16:03   ` [PATCH v2 00/17] Refactor chunk-format into an API Taylor Blau
2021-02-05  2:08   ` Junio C Hamano
2021-02-05  2:27     ` Derrick Stolee
2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-07 21:13       ` SZEDER Gábor
2021-02-08 13:44         ` Derrick Stolee
2021-02-11 19:43           ` SZEDER Gábor
2021-02-05 14:30     ` [PATCH v3 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-07 20:20       ` SZEDER Gábor
2021-02-08 13:35         ` Derrick Stolee
2021-02-05 14:30     ` [PATCH v3 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 14/17] midx: " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-02-24 16:52         ` SZEDER Gábor
2021-02-24 17:12           ` Taylor Blau
2021-02-24 17:52             ` Derrick Stolee
2021-02-24 19:44               ` Junio C Hamano
2021-02-18 14:07       ` [PATCH v4 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 14/17] midx: " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-18 21:47         ` Junio C Hamano
2021-02-19 12:42           ` Derrick Stolee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).