All of lore.kernel.org
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, szeder.dev@gmail.com,
	Derrick Stolee <derrickstolee@github.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 10/15] midx: use chunk-format API in write_midx_internal()
Date: Tue, 8 Dec 2020 13:42:05 -0500	[thread overview]
Message-ID: <X8/I/RzXZksio+ri@nand.local> (raw)
In-Reply-To: <f2f78ee1054f64fe767335c4b0c77ba2fbec5fae.1607012215.git.gitgitgadget@gmail.com>

On Thu, Dec 03, 2020 at 04:16:49PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The chunk-format API allows automatically writing the table of contents
> for a chunk-based file format when using an array of "struct
> chunk_info"s. Update write_midx_internal() to use this strategy, which
> also simplifies the chunk writing loop. This loop will be replaced with
> a chunk-format API call in an upcoming change.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c | 96 +++++++++++++---------------------------------------------
>  1 file changed, 21 insertions(+), 75 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index ce6d4339bd..0548266bea 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -11,6 +11,7 @@
>  #include "trace2.h"
>  #include "run-command.h"
>  #include "repository.h"
> +#include "chunk-format.h"
>
>  #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
>  #define MIDX_VERSION 1
> @@ -799,15 +800,14 @@ static int write_midx_large_offsets(struct hashfile *f,
>  static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
>  			       struct string_list *packs_to_drop, unsigned flags)
>  {
> -	unsigned char cur_chunk, num_chunks = 0;
> +	unsigned char num_chunks = 0;
>  	char *midx_name;
>  	uint32_t i;
>  	struct hashfile *f = NULL;
>  	struct lock_file lk;
>  	struct write_midx_context ctx = { 0 };
>  	uint64_t header_size = 0;
> -	uint32_t chunk_ids[MIDX_MAX_CHUNKS + 1];
> -	uint64_t chunk_offsets[MIDX_MAX_CHUNKS + 1];
> +	struct chunk_info chunks[MIDX_MAX_CHUNKS];
>  	int pack_name_concat_len = 0;
>  	int dropped_packs = 0;
>  	int result = 0;
> @@ -923,7 +923,6 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>  	if (ctx.m)
>  		close_midx(ctx.m);
>
> -	cur_chunk = 0;
>  	num_chunks = ctx.large_offsets_needed ? 5 : 4;
>
>  	if (ctx.nr - dropped_packs == 0) {
> @@ -934,85 +933,32 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
>
>  	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);
>
> -	chunk_ids[cur_chunk] = MIDX_CHUNKID_PACKNAMES;
> -	chunk_offsets[cur_chunk] = header_size + (num_chunks + 1) * MIDX_CHUNKLOOKUP_WIDTH;
> +	chunks[0].id = MIDX_CHUNKID_PACKNAMES;
> +	chunks[0].size = pack_name_concat_len;
> +	chunks[0].write_fn = write_midx_pack_names;

[...]

Hmm. The caller has to do quite a lot of work in order to feed chunks
into the new chunk-format code.

I wonder what it would look like if we introduced a new 'struct
chunkfile' type. It would know about the underlying hashfile that it's
writing to, as well as the chunks it's supposed to write. It would
likely support three operations: add a new chunk, write the TOC, and
write the contents (the latter two could probably be combined into one).

Below is a patch to do just that. It seems to work OK on t5319, but
fails some tests in t5318. I haven't looked into the commit-graph test
failures, but I'd be happy to do so if you think this is a direction
worth going in.

Before I show the patch, though, let's take a look at the stat:

 chunk-format.c | 61 ++++++++++++++++++++++++++++++++++++++----------
 chunk-format.h | 25 ++++++++++++--------
 commit-graph.c | 90 ++++++++++++++++++++++-------------------------------------------------
 midx.c         | 38 +++++++++++++-----------------
 4 files changed, 107 insertions(+), 107 deletions(-)

That's a little bit more code in the chunk-format code, which I can live
with, but the diff in commit-graph.c and midx.c is a net-negative, as I
would expect it to be. So, I'm happy that this seems to make things
easier on the callers.

OK, enough rambling. Here's the patch:

diff --git a/chunk-format.c b/chunk-format.c
index 771b6d98d0..14f2fe5c9a 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -1,26 +1,63 @@
 #include "git-compat-util.h"
 #include "chunk-format.h"
 #include "csum-file.h"
+#include "cache.h"
 #define CHUNK_LOOKUP_WIDTH 12

-void write_table_of_contents(struct hashfile *f,
-			     uint64_t cur_offset,
-			     struct chunk_info *chunks,
-			     int nr)
+void chunkfile_init(struct chunkfile *out, struct hashfile *f)
 {
-	int i;
+	if (!out)
+		BUG("cannot initialize null chunkfile");
+
+	memset(out, 0, sizeof(*out));
+	out->f = f;
+}
+
+void chunkfile_push_chunk(struct chunkfile *c,
+			  uint32_t id, uint64_t size, chunk_write_fn write_fn)
+{
+	ALLOC_GROW(c->chunks, c->chunks_nr + 1, c->chunks_alloc);
+	c->chunks[c->chunks_nr].id = id;
+	c->chunks[c->chunks_nr].size = size;
+	c->chunks[c->chunks_nr].write_fn = write_fn;
+	c->chunks_nr++;
+}
+
+void chunkfile_write_toc(struct chunkfile *c)
+{
+	size_t i;
+	off_t cur_offset = hashfile_total(c->f);

 	/* Add the table of contents to the current offset */
-	cur_offset += (nr + 1) * CHUNK_LOOKUP_WIDTH;
+	cur_offset += (c->chunks_nr + 1) * CHUNK_LOOKUP_WIDTH;

-	for (i = 0; i < nr; i++) {
-		hashwrite_be32(f, chunks[i].id);
-		hashwrite_be64(f, cur_offset);
+	for (i = 0; i < c->chunks_nr; i++) {
+		hashwrite_be32(c->f, c->chunks[i].id);
+		hashwrite_be64(c->f, cur_offset);

-		cur_offset += chunks[i].size;
+		cur_offset += c->chunks[i].size;
 	}

 	/* Trailing entry marks the end of the chunks */
-	hashwrite_be32(f, 0);
-	hashwrite_be64(f, cur_offset);
+	hashwrite_be32(c->f, 0);
+	hashwrite_be64(c->f, cur_offset);
+}
+
+void chunkfile_write_chunks(struct chunkfile *c, void *ctx)
+{
+	size_t i;
+
+	for (i = 0; i < c->chunks_nr; i++) {
+		off_t before, after;
+		struct chunk_info *chunk = &c->chunks[i];
+		if (!chunk->write_fn)
+			continue;
+
+		before = hashfile_total(c->f);
+		chunk->write_fn(c->f, ctx);
+		after = hashfile_total(c->f);
+
+		if (after - before != chunk->size)
+			BUG("unexpected chunk size");
+	}
 }
diff --git a/chunk-format.h b/chunk-format.h
index 4b9cbeb372..b0fb515425 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -19,18 +19,23 @@ struct chunk_info {
 	chunk_write_fn write_fn;
 };

+struct chunkfile {
+	struct hashfile *f;
+
+	struct chunk_info *chunks;
+	size_t chunks_nr;
+	size_t chunks_alloc;
+};
+
+void chunkfile_init(struct chunkfile *out, struct hashfile *f);
+void chunkfile_push_chunk(struct chunkfile *c,
+			  uint32_t id, uint64_t size, chunk_write_fn write_fn);
+
 /*
  * Write the chunk data into the supplied hashfile.
- *
- * * 'cur_offset' indicates the number of bytes written to the hashfile
- *   before the table of contents starts.
- *
- * * 'nr' is the number of chunks with non-zero IDs, so 'nr + 1'
- *   chunks are written in total.
  */
-void write_table_of_contents(struct hashfile *f,
-			     uint64_t cur_offset,
-			     struct chunk_info *chunks,
-			     int nr);
+void chunkfile_write_toc(struct chunkfile *c);
+
+void chunkfile_write_chunks(struct chunkfile *c, void *ctx);

 #endif
diff --git a/commit-graph.c b/commit-graph.c
index 5494fda1d3..abc0c9e46f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1702,11 +1702,9 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	uint32_t i;
 	int fd;
 	struct hashfile *f;
+	struct chunkfile cf;
 	struct lock_file lk = LOCK_INIT;
-	struct chunk_info chunks[MAX_NUM_CHUNKS + 1];
 	const unsigned hashsz = the_hash_algo->rawsz;
-	struct strbuf progress_title = STRBUF_INIT;
-	int num_chunks = 3;
 	struct object_id file_hash;

 	if (ctx->split) {
@@ -1753,76 +1751,42 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 	}

-	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
-	chunks[0].size = GRAPH_FANOUT_SIZE;
-	chunks[0].write_fn = write_graph_chunk_fanout;
-	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
-	chunks[1].size = hashsz * ctx->commits.nr;
-	chunks[1].write_fn = write_graph_chunk_oids;
-	chunks[2].id = GRAPH_CHUNKID_DATA;
-	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
-	chunks[2].write_fn = write_graph_chunk_data;
-	if (ctx->num_extra_edges) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
-		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
-		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
-		num_chunks++;
-	}
+	chunkfile_init(&cf, f);
+
+	chunkfile_push_chunk(&cf, GRAPH_CHUNKID_OIDFANOUT,
+			     GRAPH_FANOUT_SIZE, write_graph_chunk_fanout);
+	chunkfile_push_chunk(&cf, GRAPH_CHUNKID_OIDLOOKUP,
+			     hashsz * ctx->commits.nr, write_graph_chunk_oids);
+	chunkfile_push_chunk(&cf, GRAPH_CHUNKID_DATA,
+			     (hashsz + 16) * ctx->commits.nr, write_graph_chunk_data);
+	if (ctx->num_extra_edges)
+		chunkfile_push_chunk(&cf, GRAPH_CHUNKID_EXTRAEDGES,
+				     4 * ctx->num_extra_edges,
+				     write_graph_chunk_extra_edges);
 	if (ctx->changed_paths) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
-		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
-		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
-		num_chunks++;
-		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
-		chunks[num_chunks].size = sizeof(uint32_t) * 3
-					  + ctx->total_bloom_filter_data_size;
-		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
-		num_chunks++;
-	}
-	if (ctx->num_commit_graphs_after > 1) {
-		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
-		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
-		chunks[num_chunks].write_fn = write_graph_chunk_base;
-		num_chunks++;
+		chunkfile_push_chunk(&cf, GRAPH_CHUNKID_BLOOMINDEXES,
+				     sizeof(uint32_t) * ctx->commits.nr,
+				     write_graph_chunk_bloom_indexes);
+		chunkfile_push_chunk(&cf, GRAPH_CHUNKID_BLOOMDATA,
+				     sizeof(uint32_t) * 3 + ctx->total_bloom_filter_data_size,
+				     write_graph_chunk_bloom_data);
 	}
+	if (ctx->num_commit_graphs_after > 1)
+		chunkfile_push_chunk(&cf, GRAPH_CHUNKID_BASE,
+				     hashsz * (ctx->num_commit_graphs_after - 1),
+				     write_graph_chunk_base);

-	chunks[num_chunks].id = 0;
-	chunks[num_chunks].size = 0;
+	chunkfile_push_chunk(&cf, 0, 0, NULL);

 	hashwrite_be32(f, GRAPH_SIGNATURE);

 	hashwrite_u8(f, GRAPH_VERSION);
 	hashwrite_u8(f, oid_version());
-	hashwrite_u8(f, num_chunks);
+	hashwrite_u8(f, cf.chunks_nr - 1);
 	hashwrite_u8(f, ctx->num_commit_graphs_after - 1);

-	write_table_of_contents(f, /* cur_offset */ 8, chunks, num_chunks);
-
-	if (ctx->report_progress) {
-		strbuf_addf(&progress_title,
-			    Q_("Writing out commit graph in %d pass",
-			       "Writing out commit graph in %d passes",
-			       num_chunks),
-			    num_chunks);
-		ctx->progress = start_delayed_progress(
-			progress_title.buf,
-			num_chunks * ctx->commits.nr);
-	}
-
-	for (i = 0; i < num_chunks; i++) {
-		uint64_t start_offset = f->total + f->offset;
-
-		if (chunks[i].write_fn(f, ctx))
-			return -1;
-
-		if (f->total + f->offset != start_offset + chunks[i].size)
-			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
-			    chunks[i].size, chunks[i].id,
-			    f->total + f->offset - start_offset);
-	}
-
-	stop_progress(&ctx->progress);
-	strbuf_release(&progress_title);
+	chunkfile_write_toc(&cf);
+	chunkfile_write_chunks(&cf, ctx);

 	if (ctx->split && ctx->base_graph_name && ctx->num_commit_graphs_after > 1) {
 		char *new_base_hash = xstrdup(oid_to_hex(&ctx->new_base_graph->oid));
diff --git a/midx.c b/midx.c
index 0548266bea..6315ea7555 100644
--- a/midx.c
+++ b/midx.c
@@ -807,7 +807,7 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	struct lock_file lk;
 	struct write_midx_context ctx = { 0 };
 	uint64_t header_size = 0;
-	struct chunk_info chunks[MIDX_MAX_CHUNKS];
+	struct chunkfile cf;
 	int pack_name_concat_len = 0;
 	int dropped_packs = 0;
 	int result = 0;
@@ -933,33 +933,27 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *

 	header_size = write_midx_header(f, num_chunks, ctx.nr - dropped_packs);

-	chunks[0].id = MIDX_CHUNKID_PACKNAMES;
-	chunks[0].size = pack_name_concat_len;
-	chunks[0].write_fn = write_midx_pack_names;
+	chunkfile_init(&cf, f);

-	chunks[1].id = MIDX_CHUNKID_OIDFANOUT;
-	chunks[1].size = MIDX_CHUNK_FANOUT_SIZE;
-	chunks[1].write_fn = write_midx_oid_fanout;
-
-	chunks[2].id = MIDX_CHUNKID_OIDLOOKUP;
-	chunks[2].size = ctx.entries_nr * the_hash_algo->rawsz;
-	chunks[2].write_fn = write_midx_oid_lookup;
-
-	chunks[3].id = MIDX_CHUNKID_OBJECTOFFSETS;
-	chunks[3].size = ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH;
-	chunks[3].write_fn = write_midx_object_offsets;
+	chunkfile_push_chunk(&cf, MIDX_CHUNKID_PACKNAMES,
+			     pack_name_concat_len, write_midx_pack_names);
+	chunkfile_push_chunk(&cf, MIDX_CHUNKID_OIDFANOUT,
+			     MIDX_CHUNK_FANOUT_SIZE, write_midx_oid_fanout);
+	chunkfile_push_chunk(&cf, MIDX_CHUNKID_OIDLOOKUP,
+			     ctx.entries_nr * the_hash_algo->rawsz, write_midx_oid_lookup);
+	chunkfile_push_chunk(&cf, MIDX_CHUNKID_OBJECTOFFSETS,
+			     ctx.entries_nr * MIDX_CHUNK_OFFSET_WIDTH, write_midx_object_offsets);

 	if (ctx.large_offsets_needed) {
-		chunks[4].id = MIDX_CHUNKID_LARGEOFFSETS;
-		chunks[4].size = ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH;
-		chunks[4].write_fn = write_midx_large_offsets;
+		chunkfile_push_chunk(&cf, MIDX_CHUNKID_LARGEOFFSETS,
+				     ctx.num_large_offsets * MIDX_CHUNK_LARGE_OFFSET_WIDTH,
+				     write_midx_large_offsets);
 	}

-	write_table_of_contents(f, header_size, chunks, num_chunks);
-
-	for (i = 0; i < num_chunks; i++)
-		chunks[i].write_fn(f, &ctx);
+	chunkfile_write_toc(&cf);
+	chunkfile_write_chunks(&cf, &ctx);

+	/* maybe move this into chunkfile??? */
 	finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
 	commit_lock_file(&lk);


  reply	other threads:[~2020-12-08 20:44 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-03 16:16 [PATCH 00/15] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 01/15] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 02/15] chunk-format: add API for writing table of contents Derrick Stolee via GitGitGadget
2020-12-08 17:56   ` Taylor Blau
2020-12-03 16:16 ` [PATCH 03/15] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 04/15] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 05/15] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2020-12-03 21:42   ` Junio C Hamano
2020-12-04 13:39     ` Derrick Stolee
2020-12-08 18:00   ` Taylor Blau
2020-12-03 16:16 ` [PATCH 06/15] midx: add pack_perm " Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 07/15] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 08/15] midx: convert chunk write methods to return int Derrick Stolee via GitGitGadget
2020-12-03 21:50   ` Junio C Hamano
2020-12-04 13:40     ` Derrick Stolee
2020-12-03 16:16 ` [PATCH 09/15] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2020-12-03 16:16 ` [PATCH 10/15] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2020-12-08 18:42   ` Taylor Blau [this message]
2020-12-10 14:36     ` Derrick Stolee
2020-12-03 16:16 ` [PATCH 11/15] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2020-12-03 22:00   ` Junio C Hamano
2020-12-08 18:43     ` Taylor Blau
2020-12-03 16:16 ` [PATCH 12/15] chunk-format: create write_chunks() Derrick Stolee via GitGitGadget
2020-12-08 18:45   ` Taylor Blau
2020-12-03 16:16 ` [PATCH 13/15] chunk-format: create chunk reading API Derrick Stolee via GitGitGadget
2020-12-03 22:17   ` Junio C Hamano
2020-12-04 13:47     ` Derrick Stolee
2020-12-04 20:17       ` Junio C Hamano
2020-12-03 22:43   ` Junio C Hamano
2020-12-04 13:45     ` Derrick Stolee
2020-12-03 16:16 ` [PATCH 14/15] commit-graph: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2020-12-07 13:43   ` Derrick Stolee
2020-12-03 16:16 ` [PATCH 15/15] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2020-12-04 12:48 ` [PATCH 00/15] Refactor chunk-format into an API René Scharfe
2020-12-04 13:57   ` Derrick Stolee
2020-12-04 19:42   ` Junio C Hamano
2020-12-08 18:49   ` Taylor Blau
2020-12-09 17:13     ` René Scharfe
2020-12-10  0:50       ` Taylor Blau
2020-12-10 14:30         ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X8/I/RzXZksio+ri@nand.local \
    --to=me@ttaylorr.com \
    --cc=derrickstolee@github.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.