All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/7] merge-ort: implement support for packing objects together
@ 2023-10-19 17:28 Taylor Blau
  2023-10-19 17:28 ` [PATCH v4 1/7] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
                   ` (10 more replies)
  0 siblings, 11 replies; 63+ messages in thread
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

(Rebased onto the current tip of 'master', which is 813d9a9188 (The
nineteenth batch, 2023-10-18) at the time of writing).

This series implements support for a new merge-tree option,
`--write-pack`, which causes any newly-written objects to be packed
together instead of being stored individually as loose.

I intentionally broke this off from the existing thread, since I
accidentally rerolled mine and Jonathan Tan's Bloom v2 series into it,
causing some confusion.

This is a new round that is significantly simplified thanks to
another very helpful suggestion[1] from Junio. By factoring out a common
"deflate object to pack" that takes an abstract bulk_checkin_source as a
parameter, all of the earlier refactorings can be dropped since we
retain only a single caller instead of multiple.

This resulted in a rather satisfying range-diff (included below, as
usual), and a similarly satisfying inter-diff:

    $ git diff --stat tb/ort-bulk-checkin.v3..
     bulk-checkin.c | 203 ++++++++++++++++---------------------------------
     1 file changed, 64 insertions(+), 139 deletions(-)

Beyond that, the changes since last time can be viewed in the range-diff
below. Thanks in advance for any review!

[1]: https://lore.kernel.org/git/xmqq34y7plj4.fsf@gitster.g/

Taylor Blau (7):
  bulk-checkin: extract abstract `bulk_checkin_source`
  bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
  bulk-checkin: refactor deflate routine to accept a
    `bulk_checkin_source`
  bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
  bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
  bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
  builtin/merge-tree.c: implement support for `--write-pack`

 Documentation/git-merge-tree.txt |   4 +
 builtin/merge-tree.c             |   5 +
 bulk-checkin.c                   | 161 ++++++++++++++++++++++++++-----
 bulk-checkin.h                   |   8 ++
 merge-ort.c                      |  42 ++++++--
 merge-recursive.h                |   1 +
 t/t4301-merge-tree-write-tree.sh |  93 ++++++++++++++++++
 7 files changed, 280 insertions(+), 34 deletions(-)

Range-diff against v3:
 1:  2dffa45183 <  -:  ---------- bulk-checkin: factor out `format_object_header_hash()`
 2:  7a10dc794a <  -:  ---------- bulk-checkin: factor out `prepare_checkpoint()`
 3:  20c32d2178 <  -:  ---------- bulk-checkin: factor out `truncate_checkpoint()`
 4:  893051d0b7 <  -:  ---------- bulk-checkin: factor out `finalize_checkpoint()`
 5:  da52ec8380 !  1:  97bb6e9f59 bulk-checkin: extract abstract `bulk_checkin_source`
    @@ bulk-checkin.c: static int stream_blob_to_pack(struct bulk_checkin_packfile *sta
      			if (*already_hashed_to < offset) {
      				size_t hsize = offset - *already_hashed_to;
     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    - 	git_hash_ctx ctx;
    + 	unsigned header_len;
      	struct hashfile_checkpoint checkpoint = {0};
      	struct pack_idx_entry *idx = NULL;
     +	struct bulk_checkin_source source = {
    @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
      	seekback = lseek(fd, 0, SEEK_CUR);
      	if (seekback == (off_t) -1)
     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    - 	while (1) {
    - 		prepare_checkpoint(state, &checkpoint, idx, flags);
    + 			crc32_begin(state->f);
    + 		}
      		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
     -					 fd, size, path, flags))
     +					 &source, flags))
      			break;
    - 		truncate_checkpoint(state, &checkpoint, idx);
    + 		/*
    + 		 * Writing this object to the current pack will make
    +@@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    + 		hashfile_truncate(state->f, &checkpoint);
    + 		state->offset = checkpoint.offset;
    + 		flush_bulk_checkin_packfile(state);
     -		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
     +		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
      			return error("cannot seek back");
      	}
    - 	finalize_checkpoint(state, &ctx, &checkpoint, idx, result_oid);
    + 	the_hash_algo->final_oid_fn(result_oid, &ctx);
 7:  04ec74e357 !  2:  9d633df339 bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
    @@ bulk-checkin.c: static int stream_blob_to_pack(struct bulk_checkin_packfile *sta
      	s.avail_out = sizeof(obuf) - hdrlen;
      
     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    - 
    - 	while (1) {
    - 		prepare_checkpoint(state, &checkpoint, idx, flags);
    + 			idx->offset = state->offset;
    + 			crc32_begin(state->f);
    + 		}
     -		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
     -					 &source, flags))
     +		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
     +					&source, OBJ_BLOB, flags))
      			break;
    - 		truncate_checkpoint(state, &checkpoint, idx);
    - 		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
    + 		/*
    + 		 * Writing this object to the current pack will make
 -:  ---------- >  3:  d5bbd7810e bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source`
 6:  4e9bac5bc1 =  4:  e427fe6ad3 bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
 8:  8667b76365 !  5:  48095afe80 bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
    @@ Commit message
         objects individually as loose.
     
         Similar to the existing `index_blob_bulk_checkin()` function, the
    -    entrypoint delegates to `deflate_blob_to_pack_incore()`, which is
    -    responsible for formatting the pack header and then deflating the
    -    contents into the pack. The latter is accomplished by calling
    -    deflate_obj_contents_to_pack_incore(), which takes advantage of the
    -    earlier refactorings and is responsible for writing the object to the
    -    pack and handling any overage from pack.packSizeLimit.
    -
    -    The bulk of the new functionality is implemented in the function
    -    `stream_obj_to_pack()`, which can handle streaming objects from memory
    -    to the bulk-checkin pack as a result of the earlier refactoring.
    +    entrypoint delegates to `deflate_obj_to_pack_incore()`. That function in
    +    turn delegates to deflate_obj_to_pack(), which is responsible for
    +    formatting the pack header and then deflating the contents into the
    +    pack.
     
         Consistent with the rest of the bulk-checkin mechanism, there are no
         direct tests here. In future commits when we expose this new
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## bulk-checkin.c ##
    -@@ bulk-checkin.c: static void finalize_checkpoint(struct bulk_checkin_packfile *state,
    - 	}
    +@@ bulk-checkin.c: static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
    + 	return 0;
      }
      
    -+static int deflate_obj_contents_to_pack_incore(struct bulk_checkin_packfile *state,
    -+					       git_hash_ctx *ctx,
    -+					       struct hashfile_checkpoint *checkpoint,
    -+					       struct object_id *result_oid,
    -+					       const void *buf, size_t size,
    -+					       enum object_type type,
    -+					       const char *path, unsigned flags)
    ++static int deflate_obj_to_pack_incore(struct bulk_checkin_packfile *state,
    ++				       struct object_id *result_oid,
    ++				       const void *buf, size_t size,
    ++				       const char *path, enum object_type type,
    ++				       unsigned flags)
     +{
    -+	struct pack_idx_entry *idx = NULL;
    -+	off_t already_hashed_to = 0;
     +	struct bulk_checkin_source source = {
     +		.type = SOURCE_INCORE,
     +		.buf = buf,
    @@ bulk-checkin.c: static void finalize_checkpoint(struct bulk_checkin_packfile *st
     +		.path = path,
     +	};
     +
    -+	/* Note: idx is non-NULL when we are writing */
    -+	if (flags & HASH_WRITE_OBJECT)
    -+		CALLOC_ARRAY(idx, 1);
    -+
    -+	while (1) {
    -+		prepare_checkpoint(state, checkpoint, idx, flags);
    -+
    -+		if (!stream_obj_to_pack(state, ctx, &already_hashed_to, &source,
    -+					type, flags))
    -+			break;
    -+		truncate_checkpoint(state, checkpoint, idx);
    -+		bulk_checkin_source_seek_to(&source, 0);
    -+	}
    -+
    -+	finalize_checkpoint(state, ctx, checkpoint, idx, result_oid);
    -+
    -+	return 0;
    -+}
    -+
    -+static int deflate_blob_to_pack_incore(struct bulk_checkin_packfile *state,
    -+				       struct object_id *result_oid,
    -+				       const void *buf, size_t size,
    -+				       const char *path, unsigned flags)
    -+{
    -+	git_hash_ctx ctx;
    -+	struct hashfile_checkpoint checkpoint = {0};
    -+
    -+	format_object_header_hash(the_hash_algo, &ctx, &checkpoint, OBJ_BLOB,
    -+				  size);
    -+
    -+	return deflate_obj_contents_to_pack_incore(state, &ctx, &checkpoint,
    -+						   result_oid, buf, size,
    -+						   OBJ_BLOB, path, flags);
    ++	return deflate_obj_to_pack(state, result_oid, &source, type, 0, flags);
     +}
     +
      static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    @@ bulk-checkin.c: int index_blob_bulk_checkin(struct object_id *oid,
     +				   const void *buf, size_t size,
     +				   const char *path, unsigned flags)
     +{
    -+	int status = deflate_blob_to_pack_incore(&bulk_checkin_packfile, oid,
    -+						 buf, size, path, flags);
    ++	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
    ++						buf, size, path, OBJ_BLOB,
    ++						flags);
     +	if (!odb_transaction_nesting)
     +		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
     +	return status;
 9:  cba043ef14 !  6:  60568f9281 bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
    @@ Commit message
         machinery will have enough to keep track of the converted object's hash
         in order to update the compatibility mapping.
     
    -    Within `deflate_tree_to_pack_incore()`, the changes should be limited
    -    to something like:
    +    Within some thin wrapper around `deflate_obj_to_pack_incore()` (perhaps
    +    `deflate_tree_to_pack_incore()`), the changes should be limited to
    +    something like:
     
             struct strbuf converted = STRBUF_INIT;
             if (the_repository->compat_hash_algo) {
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## bulk-checkin.c ##
    -@@ bulk-checkin.c: static int deflate_blob_to_pack_incore(struct bulk_checkin_packfile *state,
    - 						   OBJ_BLOB, path, flags);
    - }
    - 
    -+static int deflate_tree_to_pack_incore(struct bulk_checkin_packfile *state,
    -+				       struct object_id *result_oid,
    -+				       const void *buf, size_t size,
    -+				       const char *path, unsigned flags)
    -+{
    -+	git_hash_ctx ctx;
    -+	struct hashfile_checkpoint checkpoint = {0};
    -+
    -+	format_object_header_hash(the_hash_algo, &ctx, &checkpoint, OBJ_TREE,
    -+				  size);
    -+
    -+	return deflate_obj_contents_to_pack_incore(state, &ctx, &checkpoint,
    -+						   result_oid, buf, size,
    -+						   OBJ_TREE, path, flags);
    -+}
    -+
    - static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    - 				struct object_id *result_oid,
    - 				int fd, size_t size,
     @@ bulk-checkin.c: int index_blob_bulk_checkin_incore(struct object_id *oid,
      	return status;
      }
    @@ bulk-checkin.c: int index_blob_bulk_checkin_incore(struct object_id *oid,
     +				   const void *buf, size_t size,
     +				   const char *path, unsigned flags)
     +{
    -+	int status = deflate_tree_to_pack_incore(&bulk_checkin_packfile, oid,
    -+						 buf, size, path, flags);
    ++	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
    ++						buf, size, path, OBJ_TREE,
    ++						flags);
     +	if (!odb_transaction_nesting)
     +		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
     +	return status;
10:  ae70508037 =  7:  b9be9df122 builtin/merge-tree.c: implement support for `--write-pack`
-- 
2.42.0.405.g86fe3250c2

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v4 1/7] bulk-checkin: extract abstract `bulk_checkin_source`
  2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
@ 2023-10-19 17:28 ` Taylor Blau
  2023-10-20  7:35   ` Jeff King
  2023-10-19 17:28 ` [PATCH v4 2/7] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types Taylor Blau
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

A future commit will want to implement a very similar routine as in
`stream_blob_to_pack()` with two notable changes:

  - Instead of streaming just OBJ_BLOBs, this new function may want to
    stream objects of arbitrary type.

  - Instead of streaming the object's contents from an open
    file-descriptor, this new function may want to "stream" its contents
    from memory.

To avoid duplicating a significant chunk of code between the existing
`stream_blob_to_pack()`, extract an abstract `bulk_checkin_source`. This
concept currently is a thin layer of `lseek()` and `read_in_full()`, but
will grow to understand how to perform analogous operations when writing
out an object's contents from memory.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 61 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 53 insertions(+), 8 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6ce62999e5..c05d06e1e1 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -140,8 +140,41 @@ static int already_written(struct bulk_checkin_packfile *state, struct object_id
 	return 0;
 }
 
+struct bulk_checkin_source {
+	enum { SOURCE_FILE } type;
+
+	/* SOURCE_FILE fields */
+	int fd;
+
+	/* common fields */
+	size_t size;
+	const char *path;
+};
+
+static off_t bulk_checkin_source_seek_to(struct bulk_checkin_source *source,
+					 off_t offset)
+{
+	switch (source->type) {
+	case SOURCE_FILE:
+		return lseek(source->fd, offset, SEEK_SET);
+	default:
+		BUG("unknown bulk-checkin source: %d", source->type);
+	}
+}
+
+static ssize_t bulk_checkin_source_read(struct bulk_checkin_source *source,
+					void *buf, size_t nr)
+{
+	switch (source->type) {
+	case SOURCE_FILE:
+		return read_in_full(source->fd, buf, nr);
+	default:
+		BUG("unknown bulk-checkin source: %d", source->type);
+	}
+}
+
 /*
- * Read the contents from fd for size bytes, streaming it to the
+ * Read the contents from 'source' for 'size' bytes, streaming it to the
  * packfile in state while updating the hash in ctx. Signal a failure
  * by returning a negative value when the resulting pack would exceed
  * the pack size limit and this is not the first object in the pack,
@@ -157,7 +190,7 @@ static int already_written(struct bulk_checkin_packfile *state, struct object_id
  */
 static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 			       git_hash_ctx *ctx, off_t *already_hashed_to,
-			       int fd, size_t size, const char *path,
+			       struct bulk_checkin_source *source,
 			       unsigned flags)
 {
 	git_zstream s;
@@ -167,22 +200,28 @@ static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 	int status = Z_OK;
 	int write_object = (flags & HASH_WRITE_OBJECT);
 	off_t offset = 0;
+	size_t size = source->size;
 
 	git_deflate_init(&s, pack_compression_level);
 
-	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB, size);
+	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB,
+					      size);
 	s.next_out = obuf + hdrlen;
 	s.avail_out = sizeof(obuf) - hdrlen;
 
 	while (status != Z_STREAM_END) {
 		if (size && !s.avail_in) {
 			ssize_t rsize = size < sizeof(ibuf) ? size : sizeof(ibuf);
-			ssize_t read_result = read_in_full(fd, ibuf, rsize);
+			ssize_t read_result;
+
+			read_result = bulk_checkin_source_read(source, ibuf,
+							       rsize);
 			if (read_result < 0)
-				die_errno("failed to read from '%s'", path);
+				die_errno("failed to read from '%s'",
+					  source->path);
 			if (read_result != rsize)
 				die("failed to read %d bytes from '%s'",
-				    (int)rsize, path);
+				    (int)rsize, source->path);
 			offset += rsize;
 			if (*already_hashed_to < offset) {
 				size_t hsize = offset - *already_hashed_to;
@@ -258,6 +297,12 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	unsigned header_len;
 	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;
+	struct bulk_checkin_source source = {
+		.type = SOURCE_FILE,
+		.fd = fd,
+		.size = size,
+		.path = path,
+	};
 
 	seekback = lseek(fd, 0, SEEK_CUR);
 	if (seekback == (off_t) -1)
@@ -283,7 +328,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 			crc32_begin(state->f);
 		}
 		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
-					 fd, size, path, flags))
+					 &source, flags))
 			break;
 		/*
 		 * Writing this object to the current pack will make
@@ -295,7 +340,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 		hashfile_truncate(state->f, &checkpoint);
 		state->offset = checkpoint.offset;
 		flush_bulk_checkin_packfile(state);
-		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
+		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
 			return error("cannot seek back");
 	}
 	the_hash_algo->final_oid_fn(result_oid, &ctx);
-- 
2.42.0.405.g86fe3250c2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 2/7] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
  2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
  2023-10-19 17:28 ` [PATCH v4 1/7] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
@ 2023-10-19 17:28 ` Taylor Blau
  2023-10-19 17:28 ` [PATCH v4 3/7] bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source` Taylor Blau
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

The existing `stream_blob_to_pack()` function is named based on the fact
that it knows only how to stream blobs into a bulk-checkin pack.

But there is no longer anything in this function which prevents us from
writing objects of arbitrary types to the bulk-checkin pack. Prepare to
write OBJ_TREEs by removing this assumption, adding an `enum
object_type` parameter to this function's argument list, and renaming it
to `stream_obj_to_pack()` accordingly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index c05d06e1e1..7e6b52112e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -188,10 +188,10 @@ static ssize_t bulk_checkin_source_read(struct bulk_checkin_source *source,
  * status before calling us just in case we ask it to call us again
  * with a new pack.
  */
-static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
-			       git_hash_ctx *ctx, off_t *already_hashed_to,
-			       struct bulk_checkin_source *source,
-			       unsigned flags)
+static int stream_obj_to_pack(struct bulk_checkin_packfile *state,
+			      git_hash_ctx *ctx, off_t *already_hashed_to,
+			      struct bulk_checkin_source *source,
+			      enum object_type type, unsigned flags)
 {
 	git_zstream s;
 	unsigned char ibuf[16384];
@@ -204,8 +204,7 @@ static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 
 	git_deflate_init(&s, pack_compression_level);
 
-	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB,
-					      size);
+	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), type, size);
 	s.next_out = obuf + hdrlen;
 	s.avail_out = sizeof(obuf) - hdrlen;
 
@@ -327,8 +326,8 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 			idx->offset = state->offset;
 			crc32_begin(state->f);
 		}
-		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
-					 &source, flags))
+		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
+					&source, OBJ_BLOB, flags))
 			break;
 		/*
 		 * Writing this object to the current pack will make
-- 
2.42.0.405.g86fe3250c2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 3/7] bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source`
  2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
  2023-10-19 17:28 ` [PATCH v4 1/7] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
  2023-10-19 17:28 ` [PATCH v4 2/7] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types Taylor Blau
@ 2023-10-19 17:28 ` Taylor Blau
  2023-10-19 17:28 ` [PATCH v4 4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source` Taylor Blau
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

Prepare for a future change where we will want to use a routine very
similar to the existing `deflate_blob_to_pack()` but over arbitrary
sources (i.e. either open file-descriptors, or a location in memory).

Extract out a common "deflate_obj_to_pack()" routine that acts on a
bulk_checkin_source, instead of a (int, size_t) pair. Then rewrite
`deflate_blob_to_pack()` in terms of it.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 52 ++++++++++++++++++++++++++++++--------------------
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 7e6b52112e..28bc8d5ab4 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -285,30 +285,23 @@ static void prepare_to_stream(struct bulk_checkin_packfile *state,
 		die_errno("unable to write pack header");
 }
 
-static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
-				struct object_id *result_oid,
-				int fd, size_t size,
-				const char *path, unsigned flags)
+
+static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
+			       struct object_id *result_oid,
+			       struct bulk_checkin_source *source,
+			       enum object_type type,
+			       off_t seekback,
+			       unsigned flags)
 {
-	off_t seekback, already_hashed_to;
+	off_t already_hashed_to = 0;
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
 	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;
-	struct bulk_checkin_source source = {
-		.type = SOURCE_FILE,
-		.fd = fd,
-		.size = size,
-		.path = path,
-	};
 
-	seekback = lseek(fd, 0, SEEK_CUR);
-	if (seekback == (off_t) -1)
-		return error("cannot find the current offset");
-
-	header_len = format_object_header((char *)obuf, sizeof(obuf),
-					  OBJ_BLOB, size);
+	header_len = format_object_header((char *)obuf, sizeof(obuf), type,
+					  source->size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
 	the_hash_algo->init_fn(&checkpoint.ctx);
@@ -317,8 +310,6 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	if ((flags & HASH_WRITE_OBJECT) != 0)
 		CALLOC_ARRAY(idx, 1);
 
-	already_hashed_to = 0;
-
 	while (1) {
 		prepare_to_stream(state, flags);
 		if (idx) {
@@ -327,7 +318,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 			crc32_begin(state->f);
 		}
 		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
-					&source, OBJ_BLOB, flags))
+					source, type, flags))
 			break;
 		/*
 		 * Writing this object to the current pack will make
@@ -339,7 +330,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 		hashfile_truncate(state->f, &checkpoint);
 		state->offset = checkpoint.offset;
 		flush_bulk_checkin_packfile(state);
-		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
+		if (bulk_checkin_source_seek_to(source, seekback) == (off_t)-1)
 			return error("cannot seek back");
 	}
 	the_hash_algo->final_oid_fn(result_oid, &ctx);
@@ -361,6 +352,25 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	return 0;
 }
 
+static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
+				struct object_id *result_oid,
+				int fd, size_t size,
+				const char *path, unsigned flags)
+{
+	struct bulk_checkin_source source = {
+		.type = SOURCE_FILE,
+		.fd = fd,
+		.size = size,
+		.path = path,
+	};
+	off_t seekback = lseek(fd, 0, SEEK_CUR);
+	if (seekback == (off_t) -1)
+		return error("cannot find the current offset");
+
+	return deflate_obj_to_pack(state, result_oid, &source, OBJ_BLOB,
+				   seekback, flags);
+}
+
 void prepare_loose_object_bulk_checkin(void)
 {
 	/*
-- 
2.42.0.405.g86fe3250c2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
  2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
                   ` (2 preceding siblings ...)
  2023-10-19 17:28 ` [PATCH v4 3/7] bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source` Taylor Blau
@ 2023-10-19 17:28 ` Taylor Blau
  2023-10-23  9:19   ` Patrick Steinhardt
  2023-10-19 17:28 ` [PATCH v4 5/7] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

Continue to prepare for streaming an object's contents directly from
memory by teaching `bulk_checkin_source` how to perform reads and seeks
based on an address in memory.

Unlike file descriptors, which manage their own offset internally, we
have to keep track of how many bytes we've read out of the buffer, and
make sure we don't read past the end of the buffer.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 28bc8d5ab4..60361b3e2e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -141,11 +141,15 @@ static int already_written(struct bulk_checkin_packfile *state, struct object_id
 }
 
 struct bulk_checkin_source {
-	enum { SOURCE_FILE } type;
+	enum { SOURCE_FILE, SOURCE_INCORE } type;
 
 	/* SOURCE_FILE fields */
 	int fd;
 
+	/* SOURCE_INCORE fields */
+	const void *buf;
+	size_t read;
+
 	/* common fields */
 	size_t size;
 	const char *path;
@@ -157,6 +161,11 @@ static off_t bulk_checkin_source_seek_to(struct bulk_checkin_source *source,
 	switch (source->type) {
 	case SOURCE_FILE:
 		return lseek(source->fd, offset, SEEK_SET);
+	case SOURCE_INCORE:
+		if (!(0 <= offset && offset < source->size))
+			return (off_t)-1;
+		source->read = offset;
+		return source->read;
 	default:
 		BUG("unknown bulk-checkin source: %d", source->type);
 	}
@@ -168,6 +177,13 @@ static ssize_t bulk_checkin_source_read(struct bulk_checkin_source *source,
 	switch (source->type) {
 	case SOURCE_FILE:
 		return read_in_full(source->fd, buf, nr);
+	case SOURCE_INCORE:
+		assert(source->read <= source->size);
+		if (nr > source->size - source->read)
+			nr = source->size - source->read;
+		memcpy(buf, (unsigned char *)source->buf + source->read, nr);
+		source->read += nr;
+		return nr;
 	default:
 		BUG("unknown bulk-checkin source: %d", source->type);
 	}
-- 
2.42.0.405.g86fe3250c2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 5/7] bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
  2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
                   ` (3 preceding siblings ...)
  2023-10-19 17:28 ` [PATCH v4 4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source` Taylor Blau
@ 2023-10-19 17:28 ` Taylor Blau
  2023-10-19 17:28 ` [PATCH v4 6/7] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

Now that we have factored out many of the common routines necessary to
index a new object into a pack created by the bulk-checkin machinery, we
can introduce a variant of `index_blob_bulk_checkin()` that acts on
blobs whose contents we can fit in memory.

This will be useful in a couple of more commits in order to provide the
`merge-tree` builtin with a mechanism to create a new pack containing
any objects it created during the merge, instead of storing those
objects individually as loose.

Similar to the existing `index_blob_bulk_checkin()` function, the
entrypoint delegates to `deflate_obj_to_pack_incore()`. That function in
turn delegates to deflate_obj_to_pack(), which is responsible for
formatting the pack header and then deflating the contents into the
pack.

Consistent with the rest of the bulk-checkin mechanism, there are no
direct tests here. In future commits when we expose this new
functionality via the `merge-tree` builtin, we will test it indirectly
there.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 29 +++++++++++++++++++++++++++++
 bulk-checkin.h |  4 ++++
 2 files changed, 33 insertions(+)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 60361b3e2e..655a583b06 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -368,6 +368,23 @@ static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
 	return 0;
 }
 
+static int deflate_obj_to_pack_incore(struct bulk_checkin_packfile *state,
+				       struct object_id *result_oid,
+				       const void *buf, size_t size,
+				       const char *path, enum object_type type,
+				       unsigned flags)
+{
+	struct bulk_checkin_source source = {
+		.type = SOURCE_INCORE,
+		.buf = buf,
+		.size = size,
+		.read = 0,
+		.path = path,
+	};
+
+	return deflate_obj_to_pack(state, result_oid, &source, type, 0, flags);
+}
+
 static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 				struct object_id *result_oid,
 				int fd, size_t size,
@@ -431,6 +448,18 @@ int index_blob_bulk_checkin(struct object_id *oid,
 	return status;
 }
 
+int index_blob_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags)
+{
+	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
+						buf, size, path, OBJ_BLOB,
+						flags);
+	if (!odb_transaction_nesting)
+		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
+	return status;
+}
+
 void begin_odb_transaction(void)
 {
 	odb_transaction_nesting += 1;
diff --git a/bulk-checkin.h b/bulk-checkin.h
index aa7286a7b3..1b91daeaee 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -13,6 +13,10 @@ int index_blob_bulk_checkin(struct object_id *oid,
 			    int fd, size_t size,
 			    const char *path, unsigned flags);
 
+int index_blob_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags);
+
 /*
  * Tell the object database to optimize for adding
  * multiple objects. end_odb_transaction must be called
-- 
2.42.0.405.g86fe3250c2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 6/7] bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
  2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
                   ` (4 preceding siblings ...)
  2023-10-19 17:28 ` [PATCH v4 5/7] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
@ 2023-10-19 17:28 ` Taylor Blau
  2023-10-19 17:29 ` [PATCH v4 7/7] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2023-10-19 17:28 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

The remaining missing piece in order to teach the `merge-tree` builtin
how to write the contents of a merge into a pack is a function to index
tree objects into a bulk-checkin pack.

This patch implements that missing piece, which is a thin wrapper around
all of the functionality introduced in previous commits.

If and when Git gains support for a "compatibility" hash algorithm, the
changes to support that here will be minimal. The bulk-checkin machinery
will need to convert the incoming tree to compute its length under the
compatibility hash, necessary to reconstruct its header. With that
information (and the converted contents of the tree), the bulk-checkin
machinery will have enough to keep track of the converted object's hash
in order to update the compatibility mapping.

Within some thin wrapper around `deflate_obj_to_pack_incore()` (perhaps
`deflate_tree_to_pack_incore()`), the changes should be limited to
something like:

    struct strbuf converted = STRBUF_INIT;
    if (the_repository->compat_hash_algo) {
      if (convert_object_file(&compat_obj,
                              the_repository->hash_algo,
                              the_repository->compat_hash_algo, ...) < 0)
        die(...);

      format_object_header_hash(the_repository->compat_hash_algo,
                                OBJ_TREE, size);
    }
    /* compute the converted tree's hash using the compat algorithm */
    strbuf_release(&converted);

, assuming related changes throughout the rest of the bulk-checkin
machinery necessary to update the hash of the converted object, which
are likewise minimal in size.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 12 ++++++++++++
 bulk-checkin.h |  4 ++++
 2 files changed, 16 insertions(+)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 655a583b06..c1faf75f5f 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -460,6 +460,18 @@ int index_blob_bulk_checkin_incore(struct object_id *oid,
 	return status;
 }
 
+int index_tree_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags)
+{
+	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
+						buf, size, path, OBJ_TREE,
+						flags);
+	if (!odb_transaction_nesting)
+		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
+	return status;
+}
+
 void begin_odb_transaction(void)
 {
 	odb_transaction_nesting += 1;
diff --git a/bulk-checkin.h b/bulk-checkin.h
index 1b91daeaee..89786b3954 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -17,6 +17,10 @@ int index_blob_bulk_checkin_incore(struct object_id *oid,
 				   const void *buf, size_t size,
 				   const char *path, unsigned flags);
 
+int index_tree_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags);
+
 /*
  * Tell the object database to optimize for adding
  * multiple objects. end_odb_transaction must be called
-- 
2.42.0.405.g86fe3250c2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v4 7/7] builtin/merge-tree.c: implement support for `--write-pack`
  2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
                   ` (5 preceding siblings ...)
  2023-10-19 17:28 ` [PATCH v4 6/7] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
@ 2023-10-19 17:29 ` Taylor Blau
  2023-10-19 21:47 ` [PATCH v4 0/7] merge-ort: implement support for packing objects together Junio C Hamano
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2023-10-19 17:29 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

When using merge-tree often within a repository[^1], it is possible to
generate a relatively large number of loose objects, which can result in
degraded performance, and inode exhaustion in extreme cases.

Building on the functionality introduced in previous commits, the
bulk-checkin machinery now has support to write arbitrary blob and tree
objects which are small enough to be held in-core. We can use this to
write any blob/tree objects generated by ORT into a separate pack
instead of writing them out individually as loose.

This functionality is gated behind a new `--write-pack` option to
`merge-tree` that works with the (non-deprecated) `--write-tree` mode.

The implementation is relatively straightforward. There are two spots
within the ORT mechanism where we call `write_object_file()`, one for
content differences within blobs, and another to assemble any new trees
necessary to construct the merge. In each of those locations,
conditionally replace calls to `write_object_file()` with
`index_blob_bulk_checkin_incore()` or `index_tree_bulk_checkin_incore()`
depending on which kind of object we are writing.

The only remaining task is to begin and end the transaction necessary to
initialize the bulk-checkin machinery, and move any new pack(s) it
created into the main object store.

[^1]: Such is the case at GitHub, where we run presumptive "test merges"
  on open pull requests to see whether or not we can light up the merge
  button green depending on whether or not the presumptive merge was
  conflicted.

  This is done in response to a number of user-initiated events,
  including viewing an open pull request whose last test merge is stale
  with respect to the current base and tip of the pull request. As a
  result, merge-tree can be run very frequently on large, active
  repositories.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-merge-tree.txt |  4 ++
 builtin/merge-tree.c             |  5 ++
 merge-ort.c                      | 42 +++++++++++----
 merge-recursive.h                |  1 +
 t/t4301-merge-tree-write-tree.sh | 93 ++++++++++++++++++++++++++++++++
 5 files changed, 136 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-merge-tree.txt b/Documentation/git-merge-tree.txt
index ffc4fbf7e8..9d37609ef1 100644
--- a/Documentation/git-merge-tree.txt
+++ b/Documentation/git-merge-tree.txt
@@ -69,6 +69,10 @@ OPTIONS
 	specify a merge-base for the merge, and specifying multiple bases is
 	currently not supported. This option is incompatible with `--stdin`.
 
+--write-pack::
+	Write any new objects into a separate packfile instead of as
+	individual loose objects.
+
 [[OUTPUT]]
 OUTPUT
 ------
diff --git a/builtin/merge-tree.c b/builtin/merge-tree.c
index 0de42aecf4..672ebd4c54 100644
--- a/builtin/merge-tree.c
+++ b/builtin/merge-tree.c
@@ -18,6 +18,7 @@
 #include "quote.h"
 #include "tree.h"
 #include "config.h"
+#include "bulk-checkin.h"
 
 static int line_termination = '\n';
 
@@ -414,6 +415,7 @@ struct merge_tree_options {
 	int show_messages;
 	int name_only;
 	int use_stdin;
+	int write_pack;
 };
 
 static int real_merge(struct merge_tree_options *o,
@@ -440,6 +442,7 @@ static int real_merge(struct merge_tree_options *o,
 	init_merge_options(&opt, the_repository);
 
 	opt.show_rename_progress = 0;
+	opt.write_pack = o->write_pack;
 
 	opt.branch1 = branch1;
 	opt.branch2 = branch2;
@@ -548,6 +551,8 @@ int cmd_merge_tree(int argc, const char **argv, const char *prefix)
 			   &merge_base,
 			   N_("commit"),
 			   N_("specify a merge-base for the merge")),
+		OPT_BOOL(0, "write-pack", &o.write_pack,
+			 N_("write new objects to a pack instead of as loose")),
 		OPT_END()
 	};
 
diff --git a/merge-ort.c b/merge-ort.c
index 3653725661..523577d71e 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -48,6 +48,7 @@
 #include "tree.h"
 #include "unpack-trees.h"
 #include "xdiff-interface.h"
+#include "bulk-checkin.h"
 
 /*
  * We have many arrays of size 3.  Whenever we have such an array, the
@@ -2108,10 +2109,19 @@ static int handle_content_merge(struct merge_options *opt,
 		if ((merge_status < 0) || !result_buf.ptr)
 			ret = error(_("failed to execute internal merge"));
 
-		if (!ret &&
-		    write_object_file(result_buf.ptr, result_buf.size,
-				      OBJ_BLOB, &result->oid))
-			ret = error(_("unable to add %s to database"), path);
+		if (!ret) {
+			ret = opt->write_pack
+				? index_blob_bulk_checkin_incore(&result->oid,
+								 result_buf.ptr,
+								 result_buf.size,
+								 path, 1)
+				: write_object_file(result_buf.ptr,
+						    result_buf.size,
+						    OBJ_BLOB, &result->oid);
+			if (ret)
+				ret = error(_("unable to add %s to database"),
+					    path);
+		}
 
 		free(result_buf.ptr);
 		if (ret)
@@ -3597,7 +3607,8 @@ static int tree_entry_order(const void *a_, const void *b_)
 				 b->string, strlen(b->string), bmi->result.mode);
 }
 
-static int write_tree(struct object_id *result_oid,
+static int write_tree(struct merge_options *opt,
+		      struct object_id *result_oid,
 		      struct string_list *versions,
 		      unsigned int offset,
 		      size_t hash_size)
@@ -3631,8 +3642,14 @@ static int write_tree(struct object_id *result_oid,
 	}
 
 	/* Write this object file out, and record in result_oid */
-	if (write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid))
+	ret = opt->write_pack
+		? index_tree_bulk_checkin_incore(result_oid,
+						 buf.buf, buf.len, "", 1)
+		: write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid);
+
+	if (ret)
 		ret = -1;
+
 	strbuf_release(&buf);
 	return ret;
 }
@@ -3797,8 +3814,8 @@ static int write_completed_directory(struct merge_options *opt,
 		 */
 		dir_info->is_null = 0;
 		dir_info->result.mode = S_IFDIR;
-		if (write_tree(&dir_info->result.oid, &info->versions, offset,
-			       opt->repo->hash_algo->rawsz) < 0)
+		if (write_tree(opt, &dir_info->result.oid, &info->versions,
+			       offset, opt->repo->hash_algo->rawsz) < 0)
 			ret = -1;
 	}
 
@@ -4332,9 +4349,13 @@ static int process_entries(struct merge_options *opt,
 		fflush(stdout);
 		BUG("dir_metadata accounting completely off; shouldn't happen");
 	}
-	if (write_tree(result_oid, &dir_metadata.versions, 0,
+	if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
 		       opt->repo->hash_algo->rawsz) < 0)
 		ret = -1;
+
+	if (opt->write_pack)
+		end_odb_transaction();
+
 cleanup:
 	string_list_clear(&plist, 0);
 	string_list_clear(&dir_metadata.versions, 0);
@@ -4878,6 +4899,9 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 	 */
 	strmap_init(&opt->priv->conflicts);
 
+	if (opt->write_pack)
+		begin_odb_transaction();
+
 	trace2_region_leave("merge", "allocate/init", opt->repo);
 }
 
diff --git a/merge-recursive.h b/merge-recursive.h
index b88000e3c2..156e160876 100644
--- a/merge-recursive.h
+++ b/merge-recursive.h
@@ -48,6 +48,7 @@ struct merge_options {
 	unsigned renormalize : 1;
 	unsigned record_conflict_msgs_as_headers : 1;
 	const char *msg_header_prefix;
+	unsigned write_pack : 1;
 
 	/* internal fields used by the implementation */
 	struct merge_options_internal *priv;
diff --git a/t/t4301-merge-tree-write-tree.sh b/t/t4301-merge-tree-write-tree.sh
index 250f721795..2d81ff4de5 100755
--- a/t/t4301-merge-tree-write-tree.sh
+++ b/t/t4301-merge-tree-write-tree.sh
@@ -922,4 +922,97 @@ test_expect_success 'check the input format when --stdin is passed' '
 	test_cmp expect actual
 '
 
+packdir=".git/objects/pack"
+
+test_expect_success 'merge-tree can pack its result with --write-pack' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+
+	# base has lines [3, 4, 5]
+	#   - side adds to the beginning, resulting in [1, 2, 3, 4, 5]
+	#   - other adds to the end, resulting in [3, 4, 5, 6, 7]
+	#
+	# merging the two should result in a new blob object containing
+	# [1, 2, 3, 4, 5, 6, 7], along with a new tree.
+	test_commit -C repo base file "$(test_seq 3 5)" &&
+	git -C repo branch -M main &&
+	git -C repo checkout -b side main &&
+	test_commit -C repo side file "$(test_seq 1 5)" &&
+	git -C repo checkout -b other main &&
+	test_commit -C repo other file "$(test_seq 3 7)" &&
+
+	find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
+	tree="$(git -C repo merge-tree --write-pack \
+		refs/tags/side refs/tags/other)" &&
+	blob="$(git -C repo rev-parse $tree:file)" &&
+	find repo/$packdir -type f -name "pack-*.idx" >packs.after &&
+
+	test_must_be_empty packs.before &&
+	test_line_count = 1 packs.after &&
+
+	git show-index <$(cat packs.after) >objects &&
+	test_line_count = 2 objects &&
+	grep "^[1-9][0-9]* $tree" objects &&
+	grep "^[1-9][0-9]* $blob" objects
+'
+
+test_expect_success 'merge-tree can write multiple packs with --write-pack' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		git config pack.packSizeLimit 512 &&
+
+		test_seq 512 >f &&
+
+		# "f" contains roughly ~2,000 bytes.
+		#
+		# Each side ("foo" and "bar") adds a small amount of data at the
+		# beginning and end of "base", respectively.
+		git add f &&
+		test_tick &&
+		git commit -m base &&
+		git branch -M main &&
+
+		git checkout -b foo main &&
+		{
+			echo foo && cat f
+		} >f.tmp &&
+		mv f.tmp f &&
+		git add f &&
+		test_tick &&
+		git commit -m foo &&
+
+		git checkout -b bar main &&
+		echo bar >>f &&
+		git add f &&
+		test_tick &&
+		git commit -m bar &&
+
+		find $packdir -type f -name "pack-*.idx" >packs.before &&
+		# Merging either side should result in a new object which is
+		# larger than 1M, thus the result should be split into two
+		# separate packs.
+		tree="$(git merge-tree --write-pack \
+			refs/heads/foo refs/heads/bar)" &&
+		blob="$(git rev-parse $tree:f)" &&
+		find $packdir -type f -name "pack-*.idx" >packs.after &&
+
+		test_must_be_empty packs.before &&
+		test_line_count = 2 packs.after &&
+		for idx in $(cat packs.after)
+		do
+			git show-index <$idx || return 1
+		done >objects &&
+
+		# The resulting set of packs should contain one copy of both
+		# objects, each in a separate pack.
+		test_line_count = 2 objects &&
+		grep "^[1-9][0-9]* $tree" objects &&
+		grep "^[1-9][0-9]* $blob" objects
+
+	)
+'
+
 test_done
-- 
2.42.0.405.g86fe3250c2

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 0/7] merge-ort: implement support for packing objects together
  2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
                   ` (6 preceding siblings ...)
  2023-10-19 17:29 ` [PATCH v4 7/7] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
@ 2023-10-19 21:47 ` Junio C Hamano
  2023-10-20  7:29 ` Jeff King
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 63+ messages in thread
From: Junio C Hamano @ 2023-10-19 21:47 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Elijah Newren, Eric W. Biederman, Jeff King, Patrick Steinhardt

Taylor Blau <me@ttaylorr.com> writes:

> By factoring out a common
> "deflate object to pack" that takes an abstract bulk_checkin_source as a
> parameter, all of the earlier refactorings can be dropped since we
> retain only a single caller instead of multiple.

Ah, that is how we managed to lose the preparatory split of new
helper functions.  The names given to these functions that represent
steps of checkpointing were well thought out, and they served their
documentation purposes very well, but if they all have a single
caller, then that is fine ;-).

I _think_ I carefully followed the code/data flow of the new code to
see that we do identical things when streaming from a file to a pack
(in other words, this does not introduce regression), and nothing
objectionable stood out.  The order of presentation was a good way
to let readers to do so, and the actual "now we have everything
working as we want, let's teach the machinery to also read from an
in-core buffer" step being very small gives us extra confidence with
the result.  Very nicely done.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 0/7] merge-ort: implement support for packing objects together
  2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
                   ` (7 preceding siblings ...)
  2023-10-19 21:47 ` [PATCH v4 0/7] merge-ort: implement support for packing objects together Junio C Hamano
@ 2023-10-20  7:29 ` Jeff King
  2023-10-20 16:53   ` Junio C Hamano
  2023-10-23  9:19 ` Patrick Steinhardt
  2023-10-23 22:44 ` [PATCH v5 0/5] " Taylor Blau
  10 siblings, 1 reply; 63+ messages in thread
From: Jeff King @ 2023-10-20  7:29 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Elijah Newren, Eric W. Biederman, Junio C Hamano,
	Patrick Steinhardt

On Thu, Oct 19, 2023 at 01:28:38PM -0400, Taylor Blau wrote:

> This is a new round that is significantly simplified thanks to
> another very helpful suggestion[1] from Junio. By factoring out a common
> "deflate object to pack" that takes an abstract bulk_checkin_source as a
> parameter, all of the earlier refactorings can be dropped since we
> retain only a single caller instead of multiple.

FWIW, I gave this a read-over and did not see anything wrong (on the
contrary, I think the new abstractions make it quite easy to follow).

Two thoughts that occurred to me while reading it (but neither
actionable for your series):

  - Having not worked with the bulk-checkin code much before, I thought
    it only put in one blob per pack, and thus you'd have to teach it to
    handle multiple objects, too. But fortunately I was wrong, and it
    handles this case already. (I mention this mainly because we had
    talked about it off-list a few weeks ago, and some of my confusion
    may have seeped into my comments then).

  - I did briefly wonder if we need this SOURCE abstraction at all. The
    file source assumes we can seek() and read(), so it must be a
    regular file. We could probably just mmap() it and treat it as
    in-core, too (this is what index_fd() and index_path() do under the
    hood!). But it would probably be a disservice to any platforms that
    do not support mmap, as the fallback is to read into a full buffer
    (and the whole point of the bulk checkin code is to avoid that).

-Peff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 1/7] bulk-checkin: extract abstract `bulk_checkin_source`
  2023-10-19 17:28 ` [PATCH v4 1/7] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
@ 2023-10-20  7:35   ` Jeff King
  2023-10-20 16:55     ` Junio C Hamano
  0 siblings, 1 reply; 63+ messages in thread
From: Jeff King @ 2023-10-20  7:35 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Elijah Newren, Eric W. Biederman, Junio C Hamano,
	Patrick Steinhardt

On Thu, Oct 19, 2023 at 01:28:42PM -0400, Taylor Blau wrote:

> +struct bulk_checkin_source {
> +	enum { SOURCE_FILE } type;
> +
> +	/* SOURCE_FILE fields */
> +	int fd;
> +
> +	/* common fields */
> +	size_t size;
> +	const char *path;
> +};

This is a pretty minor nit, but we may find that "SOURCE_FILE" is not
sufficiently name-spaced. Enum tags are in the global namespace, so
the compiler will barf if there are any conflicts.

It might be OK here, since this is local to a single C file (so we at
least are not hurting other code), but we may be in trouble if code in a
header file is less careful. There is already a near-miss here with
GREP_SOURCE_FILE, but fortunately grep.h is indeed careful. :)

(I notice that ref-filter.c similarly uses SOURCE_* for an internal
enum).

-Peff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 0/7] merge-ort: implement support for packing objects together
  2023-10-20  7:29 ` Jeff King
@ 2023-10-20 16:53   ` Junio C Hamano
  0 siblings, 0 replies; 63+ messages in thread
From: Junio C Hamano @ 2023-10-20 16:53 UTC (permalink / raw)
  To: Jeff King
  Cc: Taylor Blau, git, Elijah Newren, Eric W. Biederman, Patrick Steinhardt

Jeff King <peff@peff.net> writes:

>   - Having not worked with the bulk-checkin code much before, I thought
>     it only put in one blob per pack, and thus you'd have to teach it to
>     handle multiple objects, too. But fortunately I was wrong, and it
>     handles this case already. (I mention this mainly because we had
>     talked about it off-list a few weeks ago, and some of my confusion
>     may have seeped into my comments then).

You probably had it mixed up with a real limitation, i.e., "unlike
fast-import that considers the immediate previous object could be a
good delta base, this code does not even attempt any deltification"
;-)

>   - I did briefly wonder if we need this SOURCE abstraction at all. The
>     file source assumes we can seek() and read(), so it must be a
>     regular file. We could probably just mmap() it and treat it as
>     in-core, too (this is what index_fd() and index_path() do under the
>     hood!).

Ahhhh, I've never thought of that one.  I actually was hoping that
we can add an abstraction layer that can be reused elsewhere and
everywhere (not limited to the bulk-checkin codepath for its source
material), which might help folks who want to refactor Git enough to
allow them plug $CORP specific goodies like virtual filesystem layer
;-).

>     But it would probably be a disservice to any platforms that
>     do not support mmap, as the fallback is to read into a full buffer
>     (and the whole point of the bulk checkin code is to avoid that).

Yes.  The result was a pleasant read.

Thanks, both.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 1/7] bulk-checkin: extract abstract `bulk_checkin_source`
  2023-10-20  7:35   ` Jeff King
@ 2023-10-20 16:55     ` Junio C Hamano
  0 siblings, 0 replies; 63+ messages in thread
From: Junio C Hamano @ 2023-10-20 16:55 UTC (permalink / raw)
  To: Jeff King
  Cc: Taylor Blau, git, Elijah Newren, Eric W. Biederman, Patrick Steinhardt

Jeff King <peff@peff.net> writes:

> On Thu, Oct 19, 2023 at 01:28:42PM -0400, Taylor Blau wrote:
>
>> +struct bulk_checkin_source {
>> +	enum { SOURCE_FILE } type;
>> +
>> +	/* SOURCE_FILE fields */
>> +	int fd;
>> +
>> +	/* common fields */
>> +	size_t size;
>> +	const char *path;
>> +};
>
> This is a pretty minor nit, but we may find that "SOURCE_FILE" is not
> sufficiently name-spaced. Enum tags are in the global namespace, so
> the compiler will barf if there are any conflicts.

A very good point.  This was one of the reasons why I suggested
avoiding the switch() based on a new enum altogether and use a
vtable based approach instead.  But it may do while this one is
private to the file and never exposed to other subsystems.

> It might be OK here, since this is local to a single C file (so we at
> least are not hurting other code), but we may be in trouble if code in a
> header file is less careful. There is already a near-miss here with
> GREP_SOURCE_FILE, but fortunately grep.h is indeed careful. :)
>
> (I notice that ref-filter.c similarly uses SOURCE_* for an internal
> enum).

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
  2023-10-19 17:28 ` [PATCH v4 4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source` Taylor Blau
@ 2023-10-23  9:19   ` Patrick Steinhardt
  2023-10-23 18:58     ` Jeff King
  0 siblings, 1 reply; 63+ messages in thread
From: Patrick Steinhardt @ 2023-10-23  9:19 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2588 bytes --]

On Thu, Oct 19, 2023 at 01:28:51PM -0400, Taylor Blau wrote:
> Continue to prepare for streaming an object's contents directly from
> memory by teaching `bulk_checkin_source` how to perform reads and seeks
> based on an address in memory.
> 
> Unlike file descriptors, which manage their own offset internally, we
> have to keep track of how many bytes we've read out of the buffer, and
> make sure we don't read past the end of the buffer.
> 
> Suggested-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  bulk-checkin.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 28bc8d5ab4..60361b3e2e 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -141,11 +141,15 @@ static int already_written(struct bulk_checkin_packfile *state, struct object_id
>  }
>  
>  struct bulk_checkin_source {
> -	enum { SOURCE_FILE } type;
> +	enum { SOURCE_FILE, SOURCE_INCORE } type;
>  
>  	/* SOURCE_FILE fields */
>  	int fd;
>  
> +	/* SOURCE_INCORE fields */
> +	const void *buf;
> +	size_t read;
> +
>  	/* common fields */
>  	size_t size;
>  	const char *path;
> @@ -157,6 +161,11 @@ static off_t bulk_checkin_source_seek_to(struct bulk_checkin_source *source,
>  	switch (source->type) {
>  	case SOURCE_FILE:
>  		return lseek(source->fd, offset, SEEK_SET);
> +	case SOURCE_INCORE:
> +		if (!(0 <= offset && offset < source->size))
> +			return (off_t)-1;
> +		source->read = offset;
> +		return source->read;
>  	default:
>  		BUG("unknown bulk-checkin source: %d", source->type);
>  	}
> @@ -168,6 +177,13 @@ static ssize_t bulk_checkin_source_read(struct bulk_checkin_source *source,
>  	switch (source->type) {
>  	case SOURCE_FILE:
>  		return read_in_full(source->fd, buf, nr);
> +	case SOURCE_INCORE:
> +		assert(source->read <= source->size);

Is there any guideline around when to use `assert()` vs `BUG()`? I think
that this assertion here is quite critical, because when it does not
hold we can end up performing out-of-bounds reads and writes. But as
asserts are typically missing in non-debug builds, this safeguard would
not do anything for our end users, right?

Patrick

> +		if (nr > source->size - source->read)
> +			nr = source->size - source->read;
> +		memcpy(buf, (unsigned char *)source->buf + source->read, nr);
> +		source->read += nr;
> +		return nr;
>  	default:
>  		BUG("unknown bulk-checkin source: %d", source->type);
>  	}
> -- 
> 2.42.0.405.g86fe3250c2
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 0/7] merge-ort: implement support for packing objects together
  2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
                   ` (8 preceding siblings ...)
  2023-10-20  7:29 ` Jeff King
@ 2023-10-23  9:19 ` Patrick Steinhardt
  2023-10-23 22:44 ` [PATCH v5 0/5] " Taylor Blau
  10 siblings, 0 replies; 63+ messages in thread
From: Patrick Steinhardt @ 2023-10-23  9:19 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 13253 bytes --]

On Thu, Oct 19, 2023 at 01:28:38PM -0400, Taylor Blau wrote:
> (Rebased onto the current tip of 'master', which is 813d9a9188 (The
> nineteenth batch, 2023-10-18) at the time of writing).
> 
> This series implements support for a new merge-tree option,
> `--write-pack`, which causes any newly-written objects to be packed
> together instead of being stored individually as loose.
> 
> I intentionally broke this off from the existing thread, since I
> accidentally rerolled mine and Jonathan Tan's Bloom v2 series into it,
> causing some confusion.
> 
> This is a new round that is significantly simplified thanks to
> another very helpful suggestion[1] from Junio. By factoring out a common
> "deflate object to pack" that takes an abstract bulk_checkin_source as a
> parameter, all of the earlier refactorings can be dropped since we
> retain only a single caller instead of multiple.
> 
> This resulted in a rather satisfying range-diff (included below, as
> usual), and a similarly satisfying inter-diff:
> 
>     $ git diff --stat tb/ort-bulk-checkin.v3..
>      bulk-checkin.c | 203 ++++++++++++++++---------------------------------
>      1 file changed, 64 insertions(+), 139 deletions(-)
> 
> Beyond that, the changes since last time can be viewed in the range-diff
> below. Thanks in advance for any review!
> 
> [1]: https://lore.kernel.org/git/xmqq34y7plj4.fsf@gitster.g/

A single question regarding an `assert()` from my side. Other than that
the series looks good to me, thanks.

Patrick

> Taylor Blau (7):
>   bulk-checkin: extract abstract `bulk_checkin_source`
>   bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
>   bulk-checkin: refactor deflate routine to accept a
>     `bulk_checkin_source`
>   bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
>   bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
>   bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
>   builtin/merge-tree.c: implement support for `--write-pack`
> 
>  Documentation/git-merge-tree.txt |   4 +
>  builtin/merge-tree.c             |   5 +
>  bulk-checkin.c                   | 161 ++++++++++++++++++++++++++-----
>  bulk-checkin.h                   |   8 ++
>  merge-ort.c                      |  42 ++++++--
>  merge-recursive.h                |   1 +
>  t/t4301-merge-tree-write-tree.sh |  93 ++++++++++++++++++
>  7 files changed, 280 insertions(+), 34 deletions(-)
> 
> Range-diff against v3:
>  1:  2dffa45183 <  -:  ---------- bulk-checkin: factor out `format_object_header_hash()`
>  2:  7a10dc794a <  -:  ---------- bulk-checkin: factor out `prepare_checkpoint()`
>  3:  20c32d2178 <  -:  ---------- bulk-checkin: factor out `truncate_checkpoint()`
>  4:  893051d0b7 <  -:  ---------- bulk-checkin: factor out `finalize_checkpoint()`
>  5:  da52ec8380 !  1:  97bb6e9f59 bulk-checkin: extract abstract `bulk_checkin_source`
>     @@ bulk-checkin.c: static int stream_blob_to_pack(struct bulk_checkin_packfile *sta
>       			if (*already_hashed_to < offset) {
>       				size_t hsize = offset - *already_hashed_to;
>      @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>     - 	git_hash_ctx ctx;
>     + 	unsigned header_len;
>       	struct hashfile_checkpoint checkpoint = {0};
>       	struct pack_idx_entry *idx = NULL;
>      +	struct bulk_checkin_source source = {
>     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
>       	seekback = lseek(fd, 0, SEEK_CUR);
>       	if (seekback == (off_t) -1)
>      @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>     - 	while (1) {
>     - 		prepare_checkpoint(state, &checkpoint, idx, flags);
>     + 			crc32_begin(state->f);
>     + 		}
>       		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
>      -					 fd, size, path, flags))
>      +					 &source, flags))
>       			break;
>     - 		truncate_checkpoint(state, &checkpoint, idx);
>     + 		/*
>     + 		 * Writing this object to the current pack will make
>     +@@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>     + 		hashfile_truncate(state->f, &checkpoint);
>     + 		state->offset = checkpoint.offset;
>     + 		flush_bulk_checkin_packfile(state);
>      -		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
>      +		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
>       			return error("cannot seek back");
>       	}
>     - 	finalize_checkpoint(state, &ctx, &checkpoint, idx, result_oid);
>     + 	the_hash_algo->final_oid_fn(result_oid, &ctx);
>  7:  04ec74e357 !  2:  9d633df339 bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
>     @@ bulk-checkin.c: static int stream_blob_to_pack(struct bulk_checkin_packfile *sta
>       	s.avail_out = sizeof(obuf) - hdrlen;
>       
>      @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>     - 
>     - 	while (1) {
>     - 		prepare_checkpoint(state, &checkpoint, idx, flags);
>     + 			idx->offset = state->offset;
>     + 			crc32_begin(state->f);
>     + 		}
>      -		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
>      -					 &source, flags))
>      +		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
>      +					&source, OBJ_BLOB, flags))
>       			break;
>     - 		truncate_checkpoint(state, &checkpoint, idx);
>     - 		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
>     + 		/*
>     + 		 * Writing this object to the current pack will make
>  -:  ---------- >  3:  d5bbd7810e bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source`
>  6:  4e9bac5bc1 =  4:  e427fe6ad3 bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
>  8:  8667b76365 !  5:  48095afe80 bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
>     @@ Commit message
>          objects individually as loose.
>      
>          Similar to the existing `index_blob_bulk_checkin()` function, the
>     -    entrypoint delegates to `deflate_blob_to_pack_incore()`, which is
>     -    responsible for formatting the pack header and then deflating the
>     -    contents into the pack. The latter is accomplished by calling
>     -    deflate_obj_contents_to_pack_incore(), which takes advantage of the
>     -    earlier refactorings and is responsible for writing the object to the
>     -    pack and handling any overage from pack.packSizeLimit.
>     -
>     -    The bulk of the new functionality is implemented in the function
>     -    `stream_obj_to_pack()`, which can handle streaming objects from memory
>     -    to the bulk-checkin pack as a result of the earlier refactoring.
>     +    entrypoint delegates to `deflate_obj_to_pack_incore()`. That function in
>     +    turn delegates to deflate_obj_to_pack(), which is responsible for
>     +    formatting the pack header and then deflating the contents into the
>     +    pack.
>      
>          Consistent with the rest of the bulk-checkin mechanism, there are no
>          direct tests here. In future commits when we expose this new
>     @@ Commit message
>          Signed-off-by: Taylor Blau <me@ttaylorr.com>
>      
>       ## bulk-checkin.c ##
>     -@@ bulk-checkin.c: static void finalize_checkpoint(struct bulk_checkin_packfile *state,
>     - 	}
>     +@@ bulk-checkin.c: static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
>     + 	return 0;
>       }
>       
>     -+static int deflate_obj_contents_to_pack_incore(struct bulk_checkin_packfile *state,
>     -+					       git_hash_ctx *ctx,
>     -+					       struct hashfile_checkpoint *checkpoint,
>     -+					       struct object_id *result_oid,
>     -+					       const void *buf, size_t size,
>     -+					       enum object_type type,
>     -+					       const char *path, unsigned flags)
>     ++static int deflate_obj_to_pack_incore(struct bulk_checkin_packfile *state,
>     ++				       struct object_id *result_oid,
>     ++				       const void *buf, size_t size,
>     ++				       const char *path, enum object_type type,
>     ++				       unsigned flags)
>      +{
>     -+	struct pack_idx_entry *idx = NULL;
>     -+	off_t already_hashed_to = 0;
>      +	struct bulk_checkin_source source = {
>      +		.type = SOURCE_INCORE,
>      +		.buf = buf,
>     @@ bulk-checkin.c: static void finalize_checkpoint(struct bulk_checkin_packfile *st
>      +		.path = path,
>      +	};
>      +
>     -+	/* Note: idx is non-NULL when we are writing */
>     -+	if (flags & HASH_WRITE_OBJECT)
>     -+		CALLOC_ARRAY(idx, 1);
>     -+
>     -+	while (1) {
>     -+		prepare_checkpoint(state, checkpoint, idx, flags);
>     -+
>     -+		if (!stream_obj_to_pack(state, ctx, &already_hashed_to, &source,
>     -+					type, flags))
>     -+			break;
>     -+		truncate_checkpoint(state, checkpoint, idx);
>     -+		bulk_checkin_source_seek_to(&source, 0);
>     -+	}
>     -+
>     -+	finalize_checkpoint(state, ctx, checkpoint, idx, result_oid);
>     -+
>     -+	return 0;
>     -+}
>     -+
>     -+static int deflate_blob_to_pack_incore(struct bulk_checkin_packfile *state,
>     -+				       struct object_id *result_oid,
>     -+				       const void *buf, size_t size,
>     -+				       const char *path, unsigned flags)
>     -+{
>     -+	git_hash_ctx ctx;
>     -+	struct hashfile_checkpoint checkpoint = {0};
>     -+
>     -+	format_object_header_hash(the_hash_algo, &ctx, &checkpoint, OBJ_BLOB,
>     -+				  size);
>     -+
>     -+	return deflate_obj_contents_to_pack_incore(state, &ctx, &checkpoint,
>     -+						   result_oid, buf, size,
>     -+						   OBJ_BLOB, path, flags);
>     ++	return deflate_obj_to_pack(state, result_oid, &source, type, 0, flags);
>      +}
>      +
>       static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>     @@ bulk-checkin.c: int index_blob_bulk_checkin(struct object_id *oid,
>      +				   const void *buf, size_t size,
>      +				   const char *path, unsigned flags)
>      +{
>     -+	int status = deflate_blob_to_pack_incore(&bulk_checkin_packfile, oid,
>     -+						 buf, size, path, flags);
>     ++	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
>     ++						buf, size, path, OBJ_BLOB,
>     ++						flags);
>      +	if (!odb_transaction_nesting)
>      +		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
>      +	return status;
>  9:  cba043ef14 !  6:  60568f9281 bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
>     @@ Commit message
>          machinery will have enough to keep track of the converted object's hash
>          in order to update the compatibility mapping.
>      
>     -    Within `deflate_tree_to_pack_incore()`, the changes should be limited
>     -    to something like:
>     +    Within some thin wrapper around `deflate_obj_to_pack_incore()` (perhaps
>     +    `deflate_tree_to_pack_incore()`), the changes should be limited to
>     +    something like:
>      
>              struct strbuf converted = STRBUF_INIT;
>              if (the_repository->compat_hash_algo) {
>     @@ Commit message
>          Signed-off-by: Taylor Blau <me@ttaylorr.com>
>      
>       ## bulk-checkin.c ##
>     -@@ bulk-checkin.c: static int deflate_blob_to_pack_incore(struct bulk_checkin_packfile *state,
>     - 						   OBJ_BLOB, path, flags);
>     - }
>     - 
>     -+static int deflate_tree_to_pack_incore(struct bulk_checkin_packfile *state,
>     -+				       struct object_id *result_oid,
>     -+				       const void *buf, size_t size,
>     -+				       const char *path, unsigned flags)
>     -+{
>     -+	git_hash_ctx ctx;
>     -+	struct hashfile_checkpoint checkpoint = {0};
>     -+
>     -+	format_object_header_hash(the_hash_algo, &ctx, &checkpoint, OBJ_TREE,
>     -+				  size);
>     -+
>     -+	return deflate_obj_contents_to_pack_incore(state, &ctx, &checkpoint,
>     -+						   result_oid, buf, size,
>     -+						   OBJ_TREE, path, flags);
>     -+}
>     -+
>     - static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>     - 				struct object_id *result_oid,
>     - 				int fd, size_t size,
>      @@ bulk-checkin.c: int index_blob_bulk_checkin_incore(struct object_id *oid,
>       	return status;
>       }
>     @@ bulk-checkin.c: int index_blob_bulk_checkin_incore(struct object_id *oid,
>      +				   const void *buf, size_t size,
>      +				   const char *path, unsigned flags)
>      +{
>     -+	int status = deflate_tree_to_pack_incore(&bulk_checkin_packfile, oid,
>     -+						 buf, size, path, flags);
>     ++	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
>     ++						buf, size, path, OBJ_TREE,
>     ++						flags);
>      +	if (!odb_transaction_nesting)
>      +		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
>      +	return status;
> 10:  ae70508037 =  7:  b9be9df122 builtin/merge-tree.c: implement support for `--write-pack`
> -- 
> 2.42.0.405.g86fe3250c2

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
  2023-10-23  9:19   ` Patrick Steinhardt
@ 2023-10-23 18:58     ` Jeff King
  2023-10-24  6:34       ` Patrick Steinhardt
  0 siblings, 1 reply; 63+ messages in thread
From: Jeff King @ 2023-10-23 18:58 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Taylor Blau, git, Elijah Newren, Eric W. Biederman, Junio C Hamano

On Mon, Oct 23, 2023 at 11:19:13AM +0200, Patrick Steinhardt wrote:

> > +	case SOURCE_INCORE:
> > +		assert(source->read <= source->size);
> 
> Is there any guideline around when to use `assert()` vs `BUG()`? I think
> that this assertion here is quite critical, because when it does not
> hold we can end up performing out-of-bounds reads and writes. But as
> asserts are typically missing in non-debug builds, this safeguard would
> not do anything for our end users, right?

I don't think we have a written guideline. My philosophy is: always use
BUG(), because you will never be surprised that the assertion was not
compiled in (and I think compiling without assertions is almost
certainly premature optimization, especially given the way we tend to
use them).

-Peff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v5 0/5] merge-ort: implement support for packing objects together
  2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
                   ` (9 preceding siblings ...)
  2023-10-23  9:19 ` Patrick Steinhardt
@ 2023-10-23 22:44 ` Taylor Blau
  2023-10-23 22:44   ` [PATCH v5 1/5] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
                     ` (6 more replies)
  10 siblings, 7 replies; 63+ messages in thread
From: Taylor Blau @ 2023-10-23 22:44 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

(Rebased onto the current tip of 'master', which is ceadf0f3cf (The
twentieth batch, 2023-10-20)).

This series implements support for a new merge-tree option,
`--write-pack`, which causes any newly-written objects to be packed
together instead of being stored individually as loose.

This is a minor follow-up that could be taken instead of v4 (though the
changes between these two most recent rounds are stylistic and a matter
of subjective opinion).

This moves us the bulk_checkin_source structure introduced in response
to Junio's suggestion during the last round further in the OOP
direction. Instead of switching on the enum type of the source, have
function pointers for read() and seek() respectively.

The functionality at the end is the same, but this avoids some of the
namespacing issues that Peff pointed out while looking at v4. But I
think that this approach ended up being less heavy-weight than I had
originally imagined, so I think that this version is a worthwhile
improvement over v4.

Beyond that, the changes since last time can be viewed in the range-diff
below. Thanks in advance for any review!

[1]: https://lore.kernel.org/git/xmqq34y7plj4.fsf@gitster.g/

Taylor Blau (5):
  bulk-checkin: extract abstract `bulk_checkin_source`
  bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
  bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
  bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
  builtin/merge-tree.c: implement support for `--write-pack`

 Documentation/git-merge-tree.txt |   4 +
 builtin/merge-tree.c             |   5 +
 bulk-checkin.c                   | 197 +++++++++++++++++++++++++++----
 bulk-checkin.h                   |   8 ++
 merge-ort.c                      |  42 +++++--
 merge-recursive.h                |   1 +
 t/t4301-merge-tree-write-tree.sh |  93 +++++++++++++++
 7 files changed, 316 insertions(+), 34 deletions(-)

Range-diff against v4:
1:  97bb6e9f59 ! 1:  696aa027e4 bulk-checkin: extract abstract `bulk_checkin_source`
    @@ bulk-checkin.c: static int already_written(struct bulk_checkin_packfile *state,
      }
      
     +struct bulk_checkin_source {
    -+	enum { SOURCE_FILE } type;
    ++	off_t (*read)(struct bulk_checkin_source *, void *, size_t);
    ++	off_t (*seek)(struct bulk_checkin_source *, off_t);
     +
    -+	/* SOURCE_FILE fields */
    -+	int fd;
    ++	union {
    ++		struct {
    ++			int fd;
    ++		} from_fd;
    ++	} data;
     +
    -+	/* common fields */
     +	size_t size;
     +	const char *path;
     +};
     +
    -+static off_t bulk_checkin_source_seek_to(struct bulk_checkin_source *source,
    -+					 off_t offset)
    ++static off_t bulk_checkin_source_read_from_fd(struct bulk_checkin_source *source,
    ++					      void *buf, size_t nr)
     +{
    -+	switch (source->type) {
    -+	case SOURCE_FILE:
    -+		return lseek(source->fd, offset, SEEK_SET);
    -+	default:
    -+		BUG("unknown bulk-checkin source: %d", source->type);
    -+	}
    ++	return read_in_full(source->data.from_fd.fd, buf, nr);
     +}
     +
    -+static ssize_t bulk_checkin_source_read(struct bulk_checkin_source *source,
    -+					void *buf, size_t nr)
    ++static off_t bulk_checkin_source_seek_from_fd(struct bulk_checkin_source *source,
    ++					      off_t offset)
     +{
    -+	switch (source->type) {
    -+	case SOURCE_FILE:
    -+		return read_in_full(source->fd, buf, nr);
    -+	default:
    -+		BUG("unknown bulk-checkin source: %d", source->type);
    -+	}
    ++	return lseek(source->data.from_fd.fd, offset, SEEK_SET);
    ++}
    ++
    ++static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
    ++					     int fd, size_t size,
    ++					     const char *path)
    ++{
    ++	memset(source, 0, sizeof(struct bulk_checkin_source));
    ++
    ++	source->read = bulk_checkin_source_read_from_fd;
    ++	source->seek = bulk_checkin_source_seek_from_fd;
    ++
    ++	source->data.from_fd.fd = fd;
    ++
    ++	source->size = size;
    ++	source->path = path;
     +}
     +
      /*
    @@ bulk-checkin.c: static int stream_blob_to_pack(struct bulk_checkin_packfile *sta
     -			ssize_t read_result = read_in_full(fd, ibuf, rsize);
     +			ssize_t read_result;
     +
    -+			read_result = bulk_checkin_source_read(source, ibuf,
    -+							       rsize);
    ++			read_result = source->read(source, ibuf, rsize);
      			if (read_result < 0)
     -				die_errno("failed to read from '%s'", path);
     +				die_errno("failed to read from '%s'",
    @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
      	unsigned header_len;
      	struct hashfile_checkpoint checkpoint = {0};
      	struct pack_idx_entry *idx = NULL;
    -+	struct bulk_checkin_source source = {
    -+		.type = SOURCE_FILE,
    -+		.fd = fd,
    -+		.size = size,
    -+		.path = path,
    -+	};
    ++	struct bulk_checkin_source source;
    ++
    ++	init_bulk_checkin_source_from_fd(&source, fd, size, path);
      
      	seekback = lseek(fd, 0, SEEK_CUR);
      	if (seekback == (off_t) -1)
    @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
      		state->offset = checkpoint.offset;
      		flush_bulk_checkin_packfile(state);
     -		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
    -+		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
    ++		if (source.seek(&source, seekback) == (off_t)-1)
      			return error("cannot seek back");
      	}
      	the_hash_algo->final_oid_fn(result_oid, &ctx);
2:  9d633df339 < -:  ---------- bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
3:  d5bbd7810e ! 2:  596bd028a7 bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source`
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source`
    +    bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
     
    -    Prepare for a future change where we will want to use a routine very
    -    similar to the existing `deflate_blob_to_pack()` but over arbitrary
    -    sources (i.e. either open file-descriptors, or a location in memory).
    +    The existing `stream_blob_to_pack()` function is named based on the fact
    +    that it knows only how to stream blobs into a bulk-checkin pack.
     
    -    Extract out a common "deflate_obj_to_pack()" routine that acts on a
    -    bulk_checkin_source, instead of a (int, size_t) pair. Then rewrite
    -    `deflate_blob_to_pack()` in terms of it.
    +    But there is no longer anything in this function which prevents us from
    +    writing objects of arbitrary types to the bulk-checkin pack. Prepare to
    +    write OBJ_TREEs by removing this assumption, adding an `enum
    +    object_type` parameter to this function's argument list, and renaming it
    +    to `stream_obj_to_pack()` accordingly.
     
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## bulk-checkin.c ##
    +@@ bulk-checkin.c: static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
    +  * status before calling us just in case we ask it to call us again
    +  * with a new pack.
    +  */
    +-static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
    +-			       git_hash_ctx *ctx, off_t *already_hashed_to,
    +-			       struct bulk_checkin_source *source,
    +-			       unsigned flags)
    ++static int stream_obj_to_pack(struct bulk_checkin_packfile *state,
    ++			      git_hash_ctx *ctx, off_t *already_hashed_to,
    ++			      struct bulk_checkin_source *source,
    ++			      enum object_type type, unsigned flags)
    + {
    + 	git_zstream s;
    + 	unsigned char ibuf[16384];
    +@@ bulk-checkin.c: static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
    + 
    + 	git_deflate_init(&s, pack_compression_level);
    + 
    +-	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB,
    +-					      size);
    ++	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), type, size);
    + 	s.next_out = obuf + hdrlen;
    + 	s.avail_out = sizeof(obuf) - hdrlen;
    + 
     @@ bulk-checkin.c: static void prepare_to_stream(struct bulk_checkin_packfile *state,
      		die_errno("unable to write pack header");
      }
    @@ bulk-checkin.c: static void prepare_to_stream(struct bulk_checkin_packfile *stat
      	unsigned header_len;
      	struct hashfile_checkpoint checkpoint = {0};
      	struct pack_idx_entry *idx = NULL;
    --	struct bulk_checkin_source source = {
    --		.type = SOURCE_FILE,
    --		.fd = fd,
    --		.size = size,
    --		.path = path,
    --	};
    +-	struct bulk_checkin_source source;
      
    +-	init_bulk_checkin_source_from_fd(&source, fd, size, path);
    +-
     -	seekback = lseek(fd, 0, SEEK_CUR);
     -	if (seekback == (off_t) -1)
     -		return error("cannot find the current offset");
    @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
      		prepare_to_stream(state, flags);
      		if (idx) {
     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
    + 			idx->offset = state->offset;
      			crc32_begin(state->f);
      		}
    - 		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
    --					&source, OBJ_BLOB, flags))
    +-		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
    +-					 &source, flags))
    ++		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
     +					source, type, flags))
      			break;
      		/*
    @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
      		hashfile_truncate(state->f, &checkpoint);
      		state->offset = checkpoint.offset;
      		flush_bulk_checkin_packfile(state);
    --		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
    -+		if (bulk_checkin_source_seek_to(source, seekback) == (off_t)-1)
    +-		if (source.seek(&source, seekback) == (off_t)-1)
    ++		if (source->seek(source, seekback) == (off_t)-1)
      			return error("cannot seek back");
      	}
      	the_hash_algo->final_oid_fn(result_oid, &ctx);
    @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
     +				int fd, size_t size,
     +				const char *path, unsigned flags)
     +{
    -+	struct bulk_checkin_source source = {
    -+		.type = SOURCE_FILE,
    -+		.fd = fd,
    -+		.size = size,
    -+		.path = path,
    -+	};
    -+	off_t seekback = lseek(fd, 0, SEEK_CUR);
    ++	struct bulk_checkin_source source;
    ++	off_t seekback;
    ++
    ++	init_bulk_checkin_source_from_fd(&source, fd, size, path);
    ++
    ++	seekback = lseek(fd, 0, SEEK_CUR);
     +	if (seekback == (off_t) -1)
     +		return error("cannot find the current offset");
     +
4:  e427fe6ad3 < -:  ---------- bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
5:  48095afe80 ! 3:  d8cf8e4395 bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
    @@ Metadata
      ## Commit message ##
         bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
     
    -    Now that we have factored out many of the common routines necessary to
    -    index a new object into a pack created by the bulk-checkin machinery, we
    -    can introduce a variant of `index_blob_bulk_checkin()` that acts on
    -    blobs whose contents we can fit in memory.
    +    Introduce `index_blob_bulk_checkin_incore()` which allows streaming
    +    arbitrary blob contents from memory into the bulk-checkin pack.
    +
    +    In order to support streaming from a location in memory, we must
    +    implement a new kind of bulk_checkin_source that does just that. These
    +    implementation in spread out across:
    +
    +      - init_bulk_checkin_source_incore()
    +      - bulk_checkin_source_read_incore()
    +      - bulk_checkin_source_seek_incore()
    +
    +    Note that, unlike file descriptors, which manage their own offset
    +    internally, we have to keep track of how many bytes we've read out of
    +    the buffer, and make sure we don't read past the end of the buffer.
     
         This will be useful in a couple of more commits in order to provide the
         `merge-tree` builtin with a mechanism to create a new pack containing
    @@ Commit message
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## bulk-checkin.c ##
    +@@ bulk-checkin.c: struct bulk_checkin_source {
    + 		struct {
    + 			int fd;
    + 		} from_fd;
    ++		struct {
    ++			const void *buf;
    ++			size_t nr_read;
    ++		} incore;
    + 	} data;
    + 
    + 	size_t size;
    +@@ bulk-checkin.c: static off_t bulk_checkin_source_seek_from_fd(struct bulk_checkin_source *source
    + 	return lseek(source->data.from_fd.fd, offset, SEEK_SET);
    + }
    + 
    ++static off_t bulk_checkin_source_read_incore(struct bulk_checkin_source *source,
    ++					     void *buf, size_t nr)
    ++{
    ++	const unsigned char *src = source->data.incore.buf;
    ++
    ++	if (source->data.incore.nr_read > source->size)
    ++		BUG("read beyond bulk-checkin source buffer end "
    ++		    "(%"PRIuMAX" > %"PRIuMAX")",
    ++		    (uintmax_t)source->data.incore.nr_read,
    ++		    (uintmax_t)source->size);
    ++
    ++	if (nr > source->size - source->data.incore.nr_read)
    ++		nr = source->size - source->data.incore.nr_read;
    ++
    ++	src += source->data.incore.nr_read;
    ++
    ++	memcpy(buf, src, nr);
    ++	source->data.incore.nr_read += nr;
    ++	return nr;
    ++}
    ++
    ++static off_t bulk_checkin_source_seek_incore(struct bulk_checkin_source *source,
    ++					     off_t offset)
    ++{
    ++	if (!(0 <= offset && offset < source->size))
    ++		return (off_t)-1;
    ++	source->data.incore.nr_read = offset;
    ++	return source->data.incore.nr_read;
    ++}
    ++
    + static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
    + 					     int fd, size_t size,
    + 					     const char *path)
    +@@ bulk-checkin.c: static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
    + 	source->path = path;
    + }
    + 
    ++static void init_bulk_checkin_source_incore(struct bulk_checkin_source *source,
    ++					    const void *buf, size_t size,
    ++					    const char *path)
    ++{
    ++	memset(source, 0, sizeof(struct bulk_checkin_source));
    ++
    ++	source->read = bulk_checkin_source_read_incore;
    ++	source->seek = bulk_checkin_source_seek_incore;
    ++
    ++	source->data.incore.buf = buf;
    ++	source->data.incore.nr_read = 0;
    ++
    ++	source->size = size;
    ++	source->path = path;
    ++}
    ++
    + /*
    +  * Read the contents from 'source' for 'size' bytes, streaming it to the
    +  * packfile in state while updating the hash in ctx. Signal a failure
     @@ bulk-checkin.c: static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
      	return 0;
      }
    @@ bulk-checkin.c: static int deflate_obj_to_pack(struct bulk_checkin_packfile *sta
     +				       const char *path, enum object_type type,
     +				       unsigned flags)
     +{
    -+	struct bulk_checkin_source source = {
    -+		.type = SOURCE_INCORE,
    -+		.buf = buf,
    -+		.size = size,
    -+		.read = 0,
    -+		.path = path,
    -+	};
    ++	struct bulk_checkin_source source;
    ++
    ++	init_bulk_checkin_source_incore(&source, buf, size, path);
     +
     +	return deflate_obj_to_pack(state, result_oid, &source, type, 0, flags);
     +}
6:  60568f9281 = 4:  2670192802 bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
7:  b9be9df122 ! 5:  3595db76a5 builtin/merge-tree.c: implement support for `--write-pack`
    @@ Documentation/git-merge-tree.txt: OPTIONS
     
      ## builtin/merge-tree.c ##
     @@
    - #include "quote.h"
      #include "tree.h"
      #include "config.h"
    + #include "strvec.h"
     +#include "bulk-checkin.h"
      
      static int line_termination = '\n';
      
     @@ builtin/merge-tree.c: struct merge_tree_options {
    - 	int show_messages;
      	int name_only;
      	int use_stdin;
    + 	struct merge_options merge_options;
     +	int write_pack;
      };
      
      static int real_merge(struct merge_tree_options *o,
     @@ builtin/merge-tree.c: static int real_merge(struct merge_tree_options *o,
    - 	init_merge_options(&opt, the_repository);
    + 				 _("not something we can merge"));
      
      	opt.show_rename_progress = 0;
     +	opt.write_pack = o->write_pack;
    @@ builtin/merge-tree.c: static int real_merge(struct merge_tree_options *o,
      	opt.branch1 = branch1;
      	opt.branch2 = branch2;
     @@ builtin/merge-tree.c: int cmd_merge_tree(int argc, const char **argv, const char *prefix)
    - 			   &merge_base,
    - 			   N_("commit"),
      			   N_("specify a merge-base for the merge")),
    + 		OPT_STRVEC('X', "strategy-option", &xopts, N_("option=value"),
    + 			N_("option for selected merge strategy")),
     +		OPT_BOOL(0, "write-pack", &o.write_pack,
     +			 N_("write new objects to a pack instead of as loose")),
      		OPT_END()

base-commit: ceadf0f3cf51550166a387ec8508bb55e7883057
-- 
2.42.0.425.g963d08ddb3.dirty

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v5 1/5] bulk-checkin: extract abstract `bulk_checkin_source`
  2023-10-23 22:44 ` [PATCH v5 0/5] " Taylor Blau
@ 2023-10-23 22:44   ` Taylor Blau
  2023-10-25  7:37     ` Jeff King
  2023-10-23 22:44   ` [PATCH v5 2/5] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types Taylor Blau
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2023-10-23 22:44 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

A future commit will want to implement a very similar routine as in
`stream_blob_to_pack()` with two notable changes:

  - Instead of streaming just OBJ_BLOBs, this new function may want to
    stream objects of arbitrary type.

  - Instead of streaming the object's contents from an open
    file-descriptor, this new function may want to "stream" its contents
    from memory.

To avoid duplicating a significant chunk of code between the existing
`stream_blob_to_pack()`, extract an abstract `bulk_checkin_source`. This
concept currently is a thin layer of `lseek()` and `read_in_full()`, but
will grow to understand how to perform analogous operations when writing
out an object's contents from memory.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 65 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 8 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6ce62999e5..174a6c24e4 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -140,8 +140,49 @@ static int already_written(struct bulk_checkin_packfile *state, struct object_id
 	return 0;
 }
 
+struct bulk_checkin_source {
+	off_t (*read)(struct bulk_checkin_source *, void *, size_t);
+	off_t (*seek)(struct bulk_checkin_source *, off_t);
+
+	union {
+		struct {
+			int fd;
+		} from_fd;
+	} data;
+
+	size_t size;
+	const char *path;
+};
+
+static off_t bulk_checkin_source_read_from_fd(struct bulk_checkin_source *source,
+					      void *buf, size_t nr)
+{
+	return read_in_full(source->data.from_fd.fd, buf, nr);
+}
+
+static off_t bulk_checkin_source_seek_from_fd(struct bulk_checkin_source *source,
+					      off_t offset)
+{
+	return lseek(source->data.from_fd.fd, offset, SEEK_SET);
+}
+
+static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
+					     int fd, size_t size,
+					     const char *path)
+{
+	memset(source, 0, sizeof(struct bulk_checkin_source));
+
+	source->read = bulk_checkin_source_read_from_fd;
+	source->seek = bulk_checkin_source_seek_from_fd;
+
+	source->data.from_fd.fd = fd;
+
+	source->size = size;
+	source->path = path;
+}
+
 /*
- * Read the contents from fd for size bytes, streaming it to the
+ * Read the contents from 'source' for 'size' bytes, streaming it to the
  * packfile in state while updating the hash in ctx. Signal a failure
  * by returning a negative value when the resulting pack would exceed
  * the pack size limit and this is not the first object in the pack,
@@ -157,7 +198,7 @@ static int already_written(struct bulk_checkin_packfile *state, struct object_id
  */
 static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 			       git_hash_ctx *ctx, off_t *already_hashed_to,
-			       int fd, size_t size, const char *path,
+			       struct bulk_checkin_source *source,
 			       unsigned flags)
 {
 	git_zstream s;
@@ -167,22 +208,27 @@ static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 	int status = Z_OK;
 	int write_object = (flags & HASH_WRITE_OBJECT);
 	off_t offset = 0;
+	size_t size = source->size;
 
 	git_deflate_init(&s, pack_compression_level);
 
-	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB, size);
+	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB,
+					      size);
 	s.next_out = obuf + hdrlen;
 	s.avail_out = sizeof(obuf) - hdrlen;
 
 	while (status != Z_STREAM_END) {
 		if (size && !s.avail_in) {
 			ssize_t rsize = size < sizeof(ibuf) ? size : sizeof(ibuf);
-			ssize_t read_result = read_in_full(fd, ibuf, rsize);
+			ssize_t read_result;
+
+			read_result = source->read(source, ibuf, rsize);
 			if (read_result < 0)
-				die_errno("failed to read from '%s'", path);
+				die_errno("failed to read from '%s'",
+					  source->path);
 			if (read_result != rsize)
 				die("failed to read %d bytes from '%s'",
-				    (int)rsize, path);
+				    (int)rsize, source->path);
 			offset += rsize;
 			if (*already_hashed_to < offset) {
 				size_t hsize = offset - *already_hashed_to;
@@ -258,6 +304,9 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	unsigned header_len;
 	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;
+	struct bulk_checkin_source source;
+
+	init_bulk_checkin_source_from_fd(&source, fd, size, path);
 
 	seekback = lseek(fd, 0, SEEK_CUR);
 	if (seekback == (off_t) -1)
@@ -283,7 +332,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 			crc32_begin(state->f);
 		}
 		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
-					 fd, size, path, flags))
+					 &source, flags))
 			break;
 		/*
 		 * Writing this object to the current pack will make
@@ -295,7 +344,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 		hashfile_truncate(state->f, &checkpoint);
 		state->offset = checkpoint.offset;
 		flush_bulk_checkin_packfile(state);
-		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
+		if (source.seek(&source, seekback) == (off_t)-1)
 			return error("cannot seek back");
 	}
 	the_hash_algo->final_oid_fn(result_oid, &ctx);
-- 
2.42.0.425.g963d08ddb3.dirty


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v5 2/5] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
  2023-10-23 22:44 ` [PATCH v5 0/5] " Taylor Blau
  2023-10-23 22:44   ` [PATCH v5 1/5] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
@ 2023-10-23 22:44   ` Taylor Blau
  2023-10-23 22:45   ` [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2023-10-23 22:44 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

The existing `stream_blob_to_pack()` function is named based on the fact
that it knows only how to stream blobs into a bulk-checkin pack.

But there is no longer anything in this function which prevents us from
writing objects of arbitrary types to the bulk-checkin pack. Prepare to
write OBJ_TREEs by removing this assumption, adding an `enum
object_type` parameter to this function's argument list, and renaming it
to `stream_obj_to_pack()` accordingly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 61 +++++++++++++++++++++++++++++---------------------
 1 file changed, 36 insertions(+), 25 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 174a6c24e4..79776e679e 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -196,10 +196,10 @@ static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
  * status before calling us just in case we ask it to call us again
  * with a new pack.
  */
-static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
-			       git_hash_ctx *ctx, off_t *already_hashed_to,
-			       struct bulk_checkin_source *source,
-			       unsigned flags)
+static int stream_obj_to_pack(struct bulk_checkin_packfile *state,
+			      git_hash_ctx *ctx, off_t *already_hashed_to,
+			      struct bulk_checkin_source *source,
+			      enum object_type type, unsigned flags)
 {
 	git_zstream s;
 	unsigned char ibuf[16384];
@@ -212,8 +212,7 @@ static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
 
 	git_deflate_init(&s, pack_compression_level);
 
-	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB,
-					      size);
+	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), type, size);
 	s.next_out = obuf + hdrlen;
 	s.avail_out = sizeof(obuf) - hdrlen;
 
@@ -293,27 +292,23 @@ static void prepare_to_stream(struct bulk_checkin_packfile *state,
 		die_errno("unable to write pack header");
 }
 
-static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
-				struct object_id *result_oid,
-				int fd, size_t size,
-				const char *path, unsigned flags)
+
+static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
+			       struct object_id *result_oid,
+			       struct bulk_checkin_source *source,
+			       enum object_type type,
+			       off_t seekback,
+			       unsigned flags)
 {
-	off_t seekback, already_hashed_to;
+	off_t already_hashed_to = 0;
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
 	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;
-	struct bulk_checkin_source source;
 
-	init_bulk_checkin_source_from_fd(&source, fd, size, path);
-
-	seekback = lseek(fd, 0, SEEK_CUR);
-	if (seekback == (off_t) -1)
-		return error("cannot find the current offset");
-
-	header_len = format_object_header((char *)obuf, sizeof(obuf),
-					  OBJ_BLOB, size);
+	header_len = format_object_header((char *)obuf, sizeof(obuf), type,
+					  source->size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
 	the_hash_algo->init_fn(&checkpoint.ctx);
@@ -322,8 +317,6 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	if ((flags & HASH_WRITE_OBJECT) != 0)
 		CALLOC_ARRAY(idx, 1);
 
-	already_hashed_to = 0;
-
 	while (1) {
 		prepare_to_stream(state, flags);
 		if (idx) {
@@ -331,8 +324,8 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 			idx->offset = state->offset;
 			crc32_begin(state->f);
 		}
-		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
-					 &source, flags))
+		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
+					source, type, flags))
 			break;
 		/*
 		 * Writing this object to the current pack will make
@@ -344,7 +337,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 		hashfile_truncate(state->f, &checkpoint);
 		state->offset = checkpoint.offset;
 		flush_bulk_checkin_packfile(state);
-		if (source.seek(&source, seekback) == (off_t)-1)
+		if (source->seek(source, seekback) == (off_t)-1)
 			return error("cannot seek back");
 	}
 	the_hash_algo->final_oid_fn(result_oid, &ctx);
@@ -366,6 +359,24 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	return 0;
 }
 
+static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
+				struct object_id *result_oid,
+				int fd, size_t size,
+				const char *path, unsigned flags)
+{
+	struct bulk_checkin_source source;
+	off_t seekback;
+
+	init_bulk_checkin_source_from_fd(&source, fd, size, path);
+
+	seekback = lseek(fd, 0, SEEK_CUR);
+	if (seekback == (off_t) -1)
+		return error("cannot find the current offset");
+
+	return deflate_obj_to_pack(state, result_oid, &source, OBJ_BLOB,
+				   seekback, flags);
+}
+
 void prepare_loose_object_bulk_checkin(void)
 {
 	/*
-- 
2.42.0.425.g963d08ddb3.dirty


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
  2023-10-23 22:44 ` [PATCH v5 0/5] " Taylor Blau
  2023-10-23 22:44   ` [PATCH v5 1/5] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
  2023-10-23 22:44   ` [PATCH v5 2/5] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types Taylor Blau
@ 2023-10-23 22:45   ` Taylor Blau
  2023-10-25  7:58     ` Patrick Steinhardt
  2023-10-23 22:45   ` [PATCH v5 4/5] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2023-10-23 22:45 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

Introduce `index_blob_bulk_checkin_incore()` which allows streaming
arbitrary blob contents from memory into the bulk-checkin pack.

In order to support streaming from a location in memory, we must
implement a new kind of bulk_checkin_source that does just that. These
implementation in spread out across:

  - init_bulk_checkin_source_incore()
  - bulk_checkin_source_read_incore()
  - bulk_checkin_source_seek_incore()

Note that, unlike file descriptors, which manage their own offset
internally, we have to keep track of how many bytes we've read out of
the buffer, and make sure we don't read past the end of the buffer.

This will be useful in a couple of more commits in order to provide the
`merge-tree` builtin with a mechanism to create a new pack containing
any objects it created during the merge, instead of storing those
objects individually as loose.

Similar to the existing `index_blob_bulk_checkin()` function, the
entrypoint delegates to `deflate_obj_to_pack_incore()`. That function in
turn delegates to deflate_obj_to_pack(), which is responsible for
formatting the pack header and then deflating the contents into the
pack.

Consistent with the rest of the bulk-checkin mechanism, there are no
direct tests here. In future commits when we expose this new
functionality via the `merge-tree` builtin, we will test it indirectly
there.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++
 bulk-checkin.h |  4 +++
 2 files changed, 79 insertions(+)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 79776e679e..b728210bc7 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -148,6 +148,10 @@ struct bulk_checkin_source {
 		struct {
 			int fd;
 		} from_fd;
+		struct {
+			const void *buf;
+			size_t nr_read;
+		} incore;
 	} data;
 
 	size_t size;
@@ -166,6 +170,36 @@ static off_t bulk_checkin_source_seek_from_fd(struct bulk_checkin_source *source
 	return lseek(source->data.from_fd.fd, offset, SEEK_SET);
 }
 
+static off_t bulk_checkin_source_read_incore(struct bulk_checkin_source *source,
+					     void *buf, size_t nr)
+{
+	const unsigned char *src = source->data.incore.buf;
+
+	if (source->data.incore.nr_read > source->size)
+		BUG("read beyond bulk-checkin source buffer end "
+		    "(%"PRIuMAX" > %"PRIuMAX")",
+		    (uintmax_t)source->data.incore.nr_read,
+		    (uintmax_t)source->size);
+
+	if (nr > source->size - source->data.incore.nr_read)
+		nr = source->size - source->data.incore.nr_read;
+
+	src += source->data.incore.nr_read;
+
+	memcpy(buf, src, nr);
+	source->data.incore.nr_read += nr;
+	return nr;
+}
+
+static off_t bulk_checkin_source_seek_incore(struct bulk_checkin_source *source,
+					     off_t offset)
+{
+	if (!(0 <= offset && offset < source->size))
+		return (off_t)-1;
+	source->data.incore.nr_read = offset;
+	return source->data.incore.nr_read;
+}
+
 static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
 					     int fd, size_t size,
 					     const char *path)
@@ -181,6 +215,22 @@ static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
 	source->path = path;
 }
 
+static void init_bulk_checkin_source_incore(struct bulk_checkin_source *source,
+					    const void *buf, size_t size,
+					    const char *path)
+{
+	memset(source, 0, sizeof(struct bulk_checkin_source));
+
+	source->read = bulk_checkin_source_read_incore;
+	source->seek = bulk_checkin_source_seek_incore;
+
+	source->data.incore.buf = buf;
+	source->data.incore.nr_read = 0;
+
+	source->size = size;
+	source->path = path;
+}
+
 /*
  * Read the contents from 'source' for 'size' bytes, streaming it to the
  * packfile in state while updating the hash in ctx. Signal a failure
@@ -359,6 +409,19 @@ static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
 	return 0;
 }
 
+static int deflate_obj_to_pack_incore(struct bulk_checkin_packfile *state,
+				       struct object_id *result_oid,
+				       const void *buf, size_t size,
+				       const char *path, enum object_type type,
+				       unsigned flags)
+{
+	struct bulk_checkin_source source;
+
+	init_bulk_checkin_source_incore(&source, buf, size, path);
+
+	return deflate_obj_to_pack(state, result_oid, &source, type, 0, flags);
+}
+
 static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 				struct object_id *result_oid,
 				int fd, size_t size,
@@ -421,6 +484,18 @@ int index_blob_bulk_checkin(struct object_id *oid,
 	return status;
 }
 
+int index_blob_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags)
+{
+	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
+						buf, size, path, OBJ_BLOB,
+						flags);
+	if (!odb_transaction_nesting)
+		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
+	return status;
+}
+
 void begin_odb_transaction(void)
 {
 	odb_transaction_nesting += 1;
diff --git a/bulk-checkin.h b/bulk-checkin.h
index aa7286a7b3..1b91daeaee 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -13,6 +13,10 @@ int index_blob_bulk_checkin(struct object_id *oid,
 			    int fd, size_t size,
 			    const char *path, unsigned flags);
 
+int index_blob_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags);
+
 /*
  * Tell the object database to optimize for adding
  * multiple objects. end_odb_transaction must be called
-- 
2.42.0.425.g963d08ddb3.dirty


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v5 4/5] bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
  2023-10-23 22:44 ` [PATCH v5 0/5] " Taylor Blau
                     ` (2 preceding siblings ...)
  2023-10-23 22:45   ` [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
@ 2023-10-23 22:45   ` Taylor Blau
  2023-10-23 22:45   ` [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2023-10-23 22:45 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

The remaining missing piece in order to teach the `merge-tree` builtin
how to write the contents of a merge into a pack is a function to index
tree objects into a bulk-checkin pack.

This patch implements that missing piece, which is a thin wrapper around
all of the functionality introduced in previous commits.

If and when Git gains support for a "compatibility" hash algorithm, the
changes to support that here will be minimal. The bulk-checkin machinery
will need to convert the incoming tree to compute its length under the
compatibility hash, necessary to reconstruct its header. With that
information (and the converted contents of the tree), the bulk-checkin
machinery will have enough to keep track of the converted object's hash
in order to update the compatibility mapping.

Within some thin wrapper around `deflate_obj_to_pack_incore()` (perhaps
`deflate_tree_to_pack_incore()`), the changes should be limited to
something like:

    struct strbuf converted = STRBUF_INIT;
    if (the_repository->compat_hash_algo) {
      if (convert_object_file(&compat_obj,
                              the_repository->hash_algo,
                              the_repository->compat_hash_algo, ...) < 0)
        die(...);

      format_object_header_hash(the_repository->compat_hash_algo,
                                OBJ_TREE, size);
    }
    /* compute the converted tree's hash using the compat algorithm */
    strbuf_release(&converted);

, assuming related changes throughout the rest of the bulk-checkin
machinery necessary to update the hash of the converted object, which
are likewise minimal in size.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bulk-checkin.c | 12 ++++++++++++
 bulk-checkin.h |  4 ++++
 2 files changed, 16 insertions(+)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index b728210bc7..bd6151ba3c 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -496,6 +496,18 @@ int index_blob_bulk_checkin_incore(struct object_id *oid,
 	return status;
 }
 
+int index_tree_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags)
+{
+	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
+						buf, size, path, OBJ_TREE,
+						flags);
+	if (!odb_transaction_nesting)
+		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
+	return status;
+}
+
 void begin_odb_transaction(void)
 {
 	odb_transaction_nesting += 1;
diff --git a/bulk-checkin.h b/bulk-checkin.h
index 1b91daeaee..89786b3954 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -17,6 +17,10 @@ int index_blob_bulk_checkin_incore(struct object_id *oid,
 				   const void *buf, size_t size,
 				   const char *path, unsigned flags);
 
+int index_tree_bulk_checkin_incore(struct object_id *oid,
+				   const void *buf, size_t size,
+				   const char *path, unsigned flags);
+
 /*
  * Tell the object database to optimize for adding
  * multiple objects. end_odb_transaction must be called
-- 
2.42.0.425.g963d08ddb3.dirty


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-10-23 22:44 ` [PATCH v5 0/5] " Taylor Blau
                     ` (3 preceding siblings ...)
  2023-10-23 22:45   ` [PATCH v5 4/5] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
@ 2023-10-23 22:45   ` Taylor Blau
  2023-10-25  7:58     ` Patrick Steinhardt
  2023-11-10 23:51     ` Elijah Newren
  2023-10-23 23:31   ` [PATCH v5 0/5] merge-ort: implement support for packing objects together Junio C Hamano
  2023-10-25  7:58   ` [PATCH v5 0/5] merge-ort: implement support for packing objects together Patrick Steinhardt
  6 siblings, 2 replies; 63+ messages in thread
From: Taylor Blau @ 2023-10-23 22:45 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano,
	Patrick Steinhardt

When using merge-tree often within a repository[^1], it is possible to
generate a relatively large number of loose objects, which can result in
degraded performance, and inode exhaustion in extreme cases.

Building on the functionality introduced in previous commits, the
bulk-checkin machinery now has support to write arbitrary blob and tree
objects which are small enough to be held in-core. We can use this to
write any blob/tree objects generated by ORT into a separate pack
instead of writing them out individually as loose.

This functionality is gated behind a new `--write-pack` option to
`merge-tree` that works with the (non-deprecated) `--write-tree` mode.

The implementation is relatively straightforward. There are two spots
within the ORT mechanism where we call `write_object_file()`, one for
content differences within blobs, and another to assemble any new trees
necessary to construct the merge. In each of those locations,
conditionally replace calls to `write_object_file()` with
`index_blob_bulk_checkin_incore()` or `index_tree_bulk_checkin_incore()`
depending on which kind of object we are writing.

The only remaining task is to begin and end the transaction necessary to
initialize the bulk-checkin machinery, and move any new pack(s) it
created into the main object store.

[^1]: Such is the case at GitHub, where we run presumptive "test merges"
  on open pull requests to see whether or not we can light up the merge
  button green depending on whether or not the presumptive merge was
  conflicted.

  This is done in response to a number of user-initiated events,
  including viewing an open pull request whose last test merge is stale
  with respect to the current base and tip of the pull request. As a
  result, merge-tree can be run very frequently on large, active
  repositories.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-merge-tree.txt |  4 ++
 builtin/merge-tree.c             |  5 ++
 merge-ort.c                      | 42 +++++++++++----
 merge-recursive.h                |  1 +
 t/t4301-merge-tree-write-tree.sh | 93 ++++++++++++++++++++++++++++++++
 5 files changed, 136 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-merge-tree.txt b/Documentation/git-merge-tree.txt
index ffc4fbf7e8..9d37609ef1 100644
--- a/Documentation/git-merge-tree.txt
+++ b/Documentation/git-merge-tree.txt
@@ -69,6 +69,10 @@ OPTIONS
 	specify a merge-base for the merge, and specifying multiple bases is
 	currently not supported. This option is incompatible with `--stdin`.
 
+--write-pack::
+	Write any new objects into a separate packfile instead of as
+	individual loose objects.
+
 [[OUTPUT]]
 OUTPUT
 ------
diff --git a/builtin/merge-tree.c b/builtin/merge-tree.c
index a35e0452d6..218442ac9b 100644
--- a/builtin/merge-tree.c
+++ b/builtin/merge-tree.c
@@ -19,6 +19,7 @@
 #include "tree.h"
 #include "config.h"
 #include "strvec.h"
+#include "bulk-checkin.h"
 
 static int line_termination = '\n';
 
@@ -416,6 +417,7 @@ struct merge_tree_options {
 	int name_only;
 	int use_stdin;
 	struct merge_options merge_options;
+	int write_pack;
 };
 
 static int real_merge(struct merge_tree_options *o,
@@ -441,6 +443,7 @@ static int real_merge(struct merge_tree_options *o,
 				 _("not something we can merge"));
 
 	opt.show_rename_progress = 0;
+	opt.write_pack = o->write_pack;
 
 	opt.branch1 = branch1;
 	opt.branch2 = branch2;
@@ -553,6 +556,8 @@ int cmd_merge_tree(int argc, const char **argv, const char *prefix)
 			   N_("specify a merge-base for the merge")),
 		OPT_STRVEC('X', "strategy-option", &xopts, N_("option=value"),
 			N_("option for selected merge strategy")),
+		OPT_BOOL(0, "write-pack", &o.write_pack,
+			 N_("write new objects to a pack instead of as loose")),
 		OPT_END()
 	};
 
diff --git a/merge-ort.c b/merge-ort.c
index 3653725661..523577d71e 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -48,6 +48,7 @@
 #include "tree.h"
 #include "unpack-trees.h"
 #include "xdiff-interface.h"
+#include "bulk-checkin.h"
 
 /*
  * We have many arrays of size 3.  Whenever we have such an array, the
@@ -2108,10 +2109,19 @@ static int handle_content_merge(struct merge_options *opt,
 		if ((merge_status < 0) || !result_buf.ptr)
 			ret = error(_("failed to execute internal merge"));
 
-		if (!ret &&
-		    write_object_file(result_buf.ptr, result_buf.size,
-				      OBJ_BLOB, &result->oid))
-			ret = error(_("unable to add %s to database"), path);
+		if (!ret) {
+			ret = opt->write_pack
+				? index_blob_bulk_checkin_incore(&result->oid,
+								 result_buf.ptr,
+								 result_buf.size,
+								 path, 1)
+				: write_object_file(result_buf.ptr,
+						    result_buf.size,
+						    OBJ_BLOB, &result->oid);
+			if (ret)
+				ret = error(_("unable to add %s to database"),
+					    path);
+		}
 
 		free(result_buf.ptr);
 		if (ret)
@@ -3597,7 +3607,8 @@ static int tree_entry_order(const void *a_, const void *b_)
 				 b->string, strlen(b->string), bmi->result.mode);
 }
 
-static int write_tree(struct object_id *result_oid,
+static int write_tree(struct merge_options *opt,
+		      struct object_id *result_oid,
 		      struct string_list *versions,
 		      unsigned int offset,
 		      size_t hash_size)
@@ -3631,8 +3642,14 @@ static int write_tree(struct object_id *result_oid,
 	}
 
 	/* Write this object file out, and record in result_oid */
-	if (write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid))
+	ret = opt->write_pack
+		? index_tree_bulk_checkin_incore(result_oid,
+						 buf.buf, buf.len, "", 1)
+		: write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid);
+
+	if (ret)
 		ret = -1;
+
 	strbuf_release(&buf);
 	return ret;
 }
@@ -3797,8 +3814,8 @@ static int write_completed_directory(struct merge_options *opt,
 		 */
 		dir_info->is_null = 0;
 		dir_info->result.mode = S_IFDIR;
-		if (write_tree(&dir_info->result.oid, &info->versions, offset,
-			       opt->repo->hash_algo->rawsz) < 0)
+		if (write_tree(opt, &dir_info->result.oid, &info->versions,
+			       offset, opt->repo->hash_algo->rawsz) < 0)
 			ret = -1;
 	}
 
@@ -4332,9 +4349,13 @@ static int process_entries(struct merge_options *opt,
 		fflush(stdout);
 		BUG("dir_metadata accounting completely off; shouldn't happen");
 	}
-	if (write_tree(result_oid, &dir_metadata.versions, 0,
+	if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
 		       opt->repo->hash_algo->rawsz) < 0)
 		ret = -1;
+
+	if (opt->write_pack)
+		end_odb_transaction();
+
 cleanup:
 	string_list_clear(&plist, 0);
 	string_list_clear(&dir_metadata.versions, 0);
@@ -4878,6 +4899,9 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 	 */
 	strmap_init(&opt->priv->conflicts);
 
+	if (opt->write_pack)
+		begin_odb_transaction();
+
 	trace2_region_leave("merge", "allocate/init", opt->repo);
 }
 
diff --git a/merge-recursive.h b/merge-recursive.h
index 3d3b3e3c29..5c5ff380a8 100644
--- a/merge-recursive.h
+++ b/merge-recursive.h
@@ -48,6 +48,7 @@ struct merge_options {
 	unsigned renormalize : 1;
 	unsigned record_conflict_msgs_as_headers : 1;
 	const char *msg_header_prefix;
+	unsigned write_pack : 1;
 
 	/* internal fields used by the implementation */
 	struct merge_options_internal *priv;
diff --git a/t/t4301-merge-tree-write-tree.sh b/t/t4301-merge-tree-write-tree.sh
index b2c8a43fce..d2a8634523 100755
--- a/t/t4301-merge-tree-write-tree.sh
+++ b/t/t4301-merge-tree-write-tree.sh
@@ -945,4 +945,97 @@ test_expect_success 'check the input format when --stdin is passed' '
 	test_cmp expect actual
 '
 
+packdir=".git/objects/pack"
+
+test_expect_success 'merge-tree can pack its result with --write-pack' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+
+	# base has lines [3, 4, 5]
+	#   - side adds to the beginning, resulting in [1, 2, 3, 4, 5]
+	#   - other adds to the end, resulting in [3, 4, 5, 6, 7]
+	#
+	# merging the two should result in a new blob object containing
+	# [1, 2, 3, 4, 5, 6, 7], along with a new tree.
+	test_commit -C repo base file "$(test_seq 3 5)" &&
+	git -C repo branch -M main &&
+	git -C repo checkout -b side main &&
+	test_commit -C repo side file "$(test_seq 1 5)" &&
+	git -C repo checkout -b other main &&
+	test_commit -C repo other file "$(test_seq 3 7)" &&
+
+	find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
+	tree="$(git -C repo merge-tree --write-pack \
+		refs/tags/side refs/tags/other)" &&
+	blob="$(git -C repo rev-parse $tree:file)" &&
+	find repo/$packdir -type f -name "pack-*.idx" >packs.after &&
+
+	test_must_be_empty packs.before &&
+	test_line_count = 1 packs.after &&
+
+	git show-index <$(cat packs.after) >objects &&
+	test_line_count = 2 objects &&
+	grep "^[1-9][0-9]* $tree" objects &&
+	grep "^[1-9][0-9]* $blob" objects
+'
+
+test_expect_success 'merge-tree can write multiple packs with --write-pack' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		git config pack.packSizeLimit 512 &&
+
+		test_seq 512 >f &&
+
+		# "f" contains roughly ~2,000 bytes.
+		#
+		# Each side ("foo" and "bar") adds a small amount of data at the
+		# beginning and end of "base", respectively.
+		git add f &&
+		test_tick &&
+		git commit -m base &&
+		git branch -M main &&
+
+		git checkout -b foo main &&
+		{
+			echo foo && cat f
+		} >f.tmp &&
+		mv f.tmp f &&
+		git add f &&
+		test_tick &&
+		git commit -m foo &&
+
+		git checkout -b bar main &&
+		echo bar >>f &&
+		git add f &&
+		test_tick &&
+		git commit -m bar &&
+
+		find $packdir -type f -name "pack-*.idx" >packs.before &&
+		# Merging either side should result in a new object which is
+		# larger than 1M, thus the result should be split into two
+		# separate packs.
+		tree="$(git merge-tree --write-pack \
+			refs/heads/foo refs/heads/bar)" &&
+		blob="$(git rev-parse $tree:f)" &&
+		find $packdir -type f -name "pack-*.idx" >packs.after &&
+
+		test_must_be_empty packs.before &&
+		test_line_count = 2 packs.after &&
+		for idx in $(cat packs.after)
+		do
+			git show-index <$idx || return 1
+		done >objects &&
+
+		# The resulting set of packs should contain one copy of both
+		# objects, each in a separate pack.
+		test_line_count = 2 objects &&
+		grep "^[1-9][0-9]* $tree" objects &&
+		grep "^[1-9][0-9]* $blob" objects
+
+	)
+'
+
 test_done
-- 
2.42.0.425.g963d08ddb3.dirty

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 0/5] merge-ort: implement support for packing objects together
  2023-10-23 22:44 ` [PATCH v5 0/5] " Taylor Blau
                     ` (4 preceding siblings ...)
  2023-10-23 22:45   ` [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
@ 2023-10-23 23:31   ` Junio C Hamano
  2023-11-06 15:46     ` Johannes Schindelin
  2023-10-25  7:58   ` [PATCH v5 0/5] merge-ort: implement support for packing objects together Patrick Steinhardt
  6 siblings, 1 reply; 63+ messages in thread
From: Junio C Hamano @ 2023-10-23 23:31 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Elijah Newren, Eric W. Biederman, Jeff King, Patrick Steinhardt

Taylor Blau <me@ttaylorr.com> writes:

> But I
> think that this approach ended up being less heavy-weight than I had
> originally imagined, so I think that this version is a worthwhile
> improvement over v4.

;-).

This version is a good place to stop, a bit short of going full OO.
Nicely done.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
  2023-10-23 18:58     ` Jeff King
@ 2023-10-24  6:34       ` Patrick Steinhardt
  2023-10-24 17:08         ` Junio C Hamano
  0 siblings, 1 reply; 63+ messages in thread
From: Patrick Steinhardt @ 2023-10-24  6:34 UTC (permalink / raw)
  To: Jeff King
  Cc: Taylor Blau, git, Elijah Newren, Eric W. Biederman, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1213 bytes --]

On Mon, Oct 23, 2023 at 02:58:42PM -0400, Jeff King wrote:
> On Mon, Oct 23, 2023 at 11:19:13AM +0200, Patrick Steinhardt wrote:
> 
> > > +	case SOURCE_INCORE:
> > > +		assert(source->read <= source->size);
> > 
> > Is there any guideline around when to use `assert()` vs `BUG()`? I think
> > that this assertion here is quite critical, because when it does not
> > hold we can end up performing out-of-bounds reads and writes. But as
> > asserts are typically missing in non-debug builds, this safeguard would
> > not do anything for our end users, right?
> 
> I don't think we have a written guideline. My philosophy is: always use
> BUG(), because you will never be surprised that the assertion was not
> compiled in (and I think compiling without assertions is almost
> certainly premature optimization, especially given the way we tend to
> use them).
> 
> -Peff

I'm inclined to agree with your philosophy. Makes me wonder whether we
should write a Coccinelle rule to catch this. But a quick-and-dirty grep
in our codebase shows that such a rule would cause quite a lot of churn:

$ git grep BUG\( | wc -l
677
$ git grep assert\( | wc -l
549

Probably not worth it.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v4 4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
  2023-10-24  6:34       ` Patrick Steinhardt
@ 2023-10-24 17:08         ` Junio C Hamano
  0 siblings, 0 replies; 63+ messages in thread
From: Junio C Hamano @ 2023-10-24 17:08 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Jeff King, Taylor Blau, git, Elijah Newren, Eric W. Biederman

Patrick Steinhardt <ps@pks.im> writes:

> I'm inclined to agree with your philosophy. Makes me wonder whether we
> should write a Coccinelle rule to catch this. But a quick-and-dirty grep
> in our codebase shows that such a rule would cause quite a lot of churn:
>
> $ git grep BUG\( | wc -l
> 677
> $ git grep assert\( | wc -l
> 549
>
> Probably not worth it.

Yeah, we can stick to our usual "do not add new instances, fix them
while touching near-by code" pattern for this one, I would say.

Thanks.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 1/5] bulk-checkin: extract abstract `bulk_checkin_source`
  2023-10-23 22:44   ` [PATCH v5 1/5] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
@ 2023-10-25  7:37     ` Jeff King
  2023-10-25 15:39       ` Taylor Blau
  2023-10-27 23:12       ` Junio C Hamano
  0 siblings, 2 replies; 63+ messages in thread
From: Jeff King @ 2023-10-25  7:37 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Elijah Newren, Eric W. Biederman, Junio C Hamano,
	Patrick Steinhardt

On Mon, Oct 23, 2023 at 06:44:56PM -0400, Taylor Blau wrote:

> +struct bulk_checkin_source {
> +	off_t (*read)(struct bulk_checkin_source *, void *, size_t);
> +	off_t (*seek)(struct bulk_checkin_source *, off_t);
> +
> +	union {
> +		struct {
> +			int fd;
> +		} from_fd;
> +	} data;
> +
> +	size_t size;
> +	const char *path;
> +};

The virtual functions combined with the union are a funny mix of
object-oriented and procedural code. The bulk_checkin_source has
totally virtualized functions, but knows about all of the ancillary data
each set of virtualized functions might want. ;)

I think the more pure OO version would embed the parent, and have each
concrete type define its own struct type, like:

  struct bulk_checkin_source_fd {
	struct bulk_checkin_source src;
	int fd;
  };

That works great if the code which constructs it knows which concrete
type it wants, and can just do:

  struct bulk_checkin_source_fd src;
  init_bulk_checkin_source_from_fd(&src, ...);

If even the construction is somewhat virtualized, then you are stuck
with heap constructors like:

  struct bulk_checkin_source *bulk_checkin_source_from_fd(...);

Not too bad, but you have to remember to free now.

Alternatively, I think some of our other OO code just leaves room for
a type-specific void pointer, like:

  struct bulk_checkin_source {
	...the usual stuff...

	void *magic_type_data;
  };

and then the init_bulk_checkin_source_from_fd() function allocates its
own heap struct for the magic_type_data field and sticks the int in
there.

That said, both of those are a lot more annoying to use in C (more
boilerplate, more casting, and more opportunities to get something
wrong, including leaks). So I don't mind this in-between state. It is a
funny layering violating from an OO standpoint, but it's not like we
expect an unbounded set of concrete types to "inherit" from the source
struct.

-Peff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
  2023-10-23 22:45   ` [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
@ 2023-10-25  7:58     ` Patrick Steinhardt
  2023-10-25 15:44       ` Taylor Blau
  0 siblings, 1 reply; 63+ messages in thread
From: Patrick Steinhardt @ 2023-10-25  7:58 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 6245 bytes --]

On Mon, Oct 23, 2023 at 06:45:01PM -0400, Taylor Blau wrote:
> Introduce `index_blob_bulk_checkin_incore()` which allows streaming
> arbitrary blob contents from memory into the bulk-checkin pack.
> 
> In order to support streaming from a location in memory, we must
> implement a new kind of bulk_checkin_source that does just that. These
> implementation in spread out across:

Nit: the commit message is a bit off here. Probably not worth a reroll
though.

>   - init_bulk_checkin_source_incore()
>   - bulk_checkin_source_read_incore()
>   - bulk_checkin_source_seek_incore()
> 
> Note that, unlike file descriptors, which manage their own offset
> internally, we have to keep track of how many bytes we've read out of
> the buffer, and make sure we don't read past the end of the buffer.
> 
> This will be useful in a couple of more commits in order to provide the
> `merge-tree` builtin with a mechanism to create a new pack containing
> any objects it created during the merge, instead of storing those
> objects individually as loose.
> 
> Similar to the existing `index_blob_bulk_checkin()` function, the
> entrypoint delegates to `deflate_obj_to_pack_incore()`. That function in
> turn delegates to deflate_obj_to_pack(), which is responsible for
> formatting the pack header and then deflating the contents into the
> pack.
> 
> Consistent with the rest of the bulk-checkin mechanism, there are no
> direct tests here. In future commits when we expose this new
> functionality via the `merge-tree` builtin, we will test it indirectly
> there.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  bulk-checkin.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  bulk-checkin.h |  4 +++
>  2 files changed, 79 insertions(+)
> 
> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 79776e679e..b728210bc7 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -148,6 +148,10 @@ struct bulk_checkin_source {
>  		struct {
>  			int fd;
>  		} from_fd;
> +		struct {
> +			const void *buf;
> +			size_t nr_read;
> +		} incore;
>  	} data;
>  
>  	size_t size;
> @@ -166,6 +170,36 @@ static off_t bulk_checkin_source_seek_from_fd(struct bulk_checkin_source *source
>  	return lseek(source->data.from_fd.fd, offset, SEEK_SET);
>  }
>  
> +static off_t bulk_checkin_source_read_incore(struct bulk_checkin_source *source,
> +					     void *buf, size_t nr)
> +{
> +	const unsigned char *src = source->data.incore.buf;
> +
> +	if (source->data.incore.nr_read > source->size)
> +		BUG("read beyond bulk-checkin source buffer end "
> +		    "(%"PRIuMAX" > %"PRIuMAX")",
> +		    (uintmax_t)source->data.incore.nr_read,
> +		    (uintmax_t)source->size);
> +
> +	if (nr > source->size - source->data.incore.nr_read)
> +		nr = source->size - source->data.incore.nr_read;
> +
> +	src += source->data.incore.nr_read;
> +
> +	memcpy(buf, src, nr);
> +	source->data.incore.nr_read += nr;
> +	return nr;
> +}
> +
> +static off_t bulk_checkin_source_seek_incore(struct bulk_checkin_source *source,
> +					     off_t offset)
> +{
> +	if (!(0 <= offset && offset < source->size))
> +		return (off_t)-1;

At the risk of showing my own ignorance, but why is the cast here
necessary?

Patrick

> +	source->data.incore.nr_read = offset;
> +	return source->data.incore.nr_read;
> +}
> +
>  static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
>  					     int fd, size_t size,
>  					     const char *path)
> @@ -181,6 +215,22 @@ static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
>  	source->path = path;
>  }
>  
> +static void init_bulk_checkin_source_incore(struct bulk_checkin_source *source,
> +					    const void *buf, size_t size,
> +					    const char *path)
> +{
> +	memset(source, 0, sizeof(struct bulk_checkin_source));
> +
> +	source->read = bulk_checkin_source_read_incore;
> +	source->seek = bulk_checkin_source_seek_incore;
> +
> +	source->data.incore.buf = buf;
> +	source->data.incore.nr_read = 0;
> +
> +	source->size = size;
> +	source->path = path;
> +}
> +
>  /*
>   * Read the contents from 'source' for 'size' bytes, streaming it to the
>   * packfile in state while updating the hash in ctx. Signal a failure
> @@ -359,6 +409,19 @@ static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
>  	return 0;
>  }
>  
> +static int deflate_obj_to_pack_incore(struct bulk_checkin_packfile *state,
> +				       struct object_id *result_oid,
> +				       const void *buf, size_t size,
> +				       const char *path, enum object_type type,
> +				       unsigned flags)
> +{
> +	struct bulk_checkin_source source;
> +
> +	init_bulk_checkin_source_incore(&source, buf, size, path);
> +
> +	return deflate_obj_to_pack(state, result_oid, &source, type, 0, flags);
> +}
> +
>  static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>  				struct object_id *result_oid,
>  				int fd, size_t size,
> @@ -421,6 +484,18 @@ int index_blob_bulk_checkin(struct object_id *oid,
>  	return status;
>  }
>  
> +int index_blob_bulk_checkin_incore(struct object_id *oid,
> +				   const void *buf, size_t size,
> +				   const char *path, unsigned flags)
> +{
> +	int status = deflate_obj_to_pack_incore(&bulk_checkin_packfile, oid,
> +						buf, size, path, OBJ_BLOB,
> +						flags);
> +	if (!odb_transaction_nesting)
> +		flush_bulk_checkin_packfile(&bulk_checkin_packfile);
> +	return status;
> +}
> +
>  void begin_odb_transaction(void)
>  {
>  	odb_transaction_nesting += 1;
> diff --git a/bulk-checkin.h b/bulk-checkin.h
> index aa7286a7b3..1b91daeaee 100644
> --- a/bulk-checkin.h
> +++ b/bulk-checkin.h
> @@ -13,6 +13,10 @@ int index_blob_bulk_checkin(struct object_id *oid,
>  			    int fd, size_t size,
>  			    const char *path, unsigned flags);
>  
> +int index_blob_bulk_checkin_incore(struct object_id *oid,
> +				   const void *buf, size_t size,
> +				   const char *path, unsigned flags);
> +
>  /*
>   * Tell the object database to optimize for adding
>   * multiple objects. end_odb_transaction must be called
> -- 
> 2.42.0.425.g963d08ddb3.dirty
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-10-23 22:45   ` [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
@ 2023-10-25  7:58     ` Patrick Steinhardt
  2023-10-25 15:46       ` Taylor Blau
  2023-11-10 23:51     ` Elijah Newren
  1 sibling, 1 reply; 63+ messages in thread
From: Patrick Steinhardt @ 2023-10-25  7:58 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 11501 bytes --]

On Mon, Oct 23, 2023 at 06:45:06PM -0400, Taylor Blau wrote:
> When using merge-tree often within a repository[^1], it is possible to
> generate a relatively large number of loose objects, which can result in
> degraded performance, and inode exhaustion in extreme cases.
> 
> Building on the functionality introduced in previous commits, the
> bulk-checkin machinery now has support to write arbitrary blob and tree
> objects which are small enough to be held in-core. We can use this to
> write any blob/tree objects generated by ORT into a separate pack
> instead of writing them out individually as loose.
> 
> This functionality is gated behind a new `--write-pack` option to
> `merge-tree` that works with the (non-deprecated) `--write-tree` mode.
> 
> The implementation is relatively straightforward. There are two spots
> within the ORT mechanism where we call `write_object_file()`, one for
> content differences within blobs, and another to assemble any new trees
> necessary to construct the merge. In each of those locations,
> conditionally replace calls to `write_object_file()` with
> `index_blob_bulk_checkin_incore()` or `index_tree_bulk_checkin_incore()`
> depending on which kind of object we are writing.
> 
> The only remaining task is to begin and end the transaction necessary to
> initialize the bulk-checkin machinery, and move any new pack(s) it
> created into the main object store.
> 
> [^1]: Such is the case at GitHub, where we run presumptive "test merges"
>   on open pull requests to see whether or not we can light up the merge
>   button green depending on whether or not the presumptive merge was
>   conflicted.
> 
>   This is done in response to a number of user-initiated events,
>   including viewing an open pull request whose last test merge is stale
>   with respect to the current base and tip of the pull request. As a
>   result, merge-tree can be run very frequently on large, active
>   repositories.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/git-merge-tree.txt |  4 ++
>  builtin/merge-tree.c             |  5 ++
>  merge-ort.c                      | 42 +++++++++++----
>  merge-recursive.h                |  1 +
>  t/t4301-merge-tree-write-tree.sh | 93 ++++++++++++++++++++++++++++++++
>  5 files changed, 136 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/git-merge-tree.txt b/Documentation/git-merge-tree.txt
> index ffc4fbf7e8..9d37609ef1 100644
> --- a/Documentation/git-merge-tree.txt
> +++ b/Documentation/git-merge-tree.txt
> @@ -69,6 +69,10 @@ OPTIONS
>  	specify a merge-base for the merge, and specifying multiple bases is
>  	currently not supported. This option is incompatible with `--stdin`.
>  
> +--write-pack::
> +	Write any new objects into a separate packfile instead of as
> +	individual loose objects.
> +
>  [[OUTPUT]]
>  OUTPUT
>  ------
> diff --git a/builtin/merge-tree.c b/builtin/merge-tree.c
> index a35e0452d6..218442ac9b 100644
> --- a/builtin/merge-tree.c
> +++ b/builtin/merge-tree.c
> @@ -19,6 +19,7 @@
>  #include "tree.h"
>  #include "config.h"
>  #include "strvec.h"
> +#include "bulk-checkin.h"
>  
>  static int line_termination = '\n';
>  
> @@ -416,6 +417,7 @@ struct merge_tree_options {
>  	int name_only;
>  	int use_stdin;
>  	struct merge_options merge_options;
> +	int write_pack;
>  };
>  
>  static int real_merge(struct merge_tree_options *o,
> @@ -441,6 +443,7 @@ static int real_merge(struct merge_tree_options *o,
>  				 _("not something we can merge"));
>  
>  	opt.show_rename_progress = 0;
> +	opt.write_pack = o->write_pack;
>  
>  	opt.branch1 = branch1;
>  	opt.branch2 = branch2;
> @@ -553,6 +556,8 @@ int cmd_merge_tree(int argc, const char **argv, const char *prefix)
>  			   N_("specify a merge-base for the merge")),
>  		OPT_STRVEC('X', "strategy-option", &xopts, N_("option=value"),
>  			N_("option for selected merge strategy")),
> +		OPT_BOOL(0, "write-pack", &o.write_pack,
> +			 N_("write new objects to a pack instead of as loose")),
>  		OPT_END()
>  	};
>  
> diff --git a/merge-ort.c b/merge-ort.c
> index 3653725661..523577d71e 100644
> --- a/merge-ort.c
> +++ b/merge-ort.c
> @@ -48,6 +48,7 @@
>  #include "tree.h"
>  #include "unpack-trees.h"
>  #include "xdiff-interface.h"
> +#include "bulk-checkin.h"
>  
>  /*
>   * We have many arrays of size 3.  Whenever we have such an array, the
> @@ -2108,10 +2109,19 @@ static int handle_content_merge(struct merge_options *opt,
>  		if ((merge_status < 0) || !result_buf.ptr)
>  			ret = error(_("failed to execute internal merge"));
>  
> -		if (!ret &&
> -		    write_object_file(result_buf.ptr, result_buf.size,
> -				      OBJ_BLOB, &result->oid))
> -			ret = error(_("unable to add %s to database"), path);
> +		if (!ret) {
> +			ret = opt->write_pack
> +				? index_blob_bulk_checkin_incore(&result->oid,
> +								 result_buf.ptr,
> +								 result_buf.size,
> +								 path, 1)
> +				: write_object_file(result_buf.ptr,
> +						    result_buf.size,
> +						    OBJ_BLOB, &result->oid);
> +			if (ret)
> +				ret = error(_("unable to add %s to database"),
> +					    path);
> +		}
>  
>  		free(result_buf.ptr);
>  		if (ret)
> @@ -3597,7 +3607,8 @@ static int tree_entry_order(const void *a_, const void *b_)
>  				 b->string, strlen(b->string), bmi->result.mode);
>  }
>  
> -static int write_tree(struct object_id *result_oid,
> +static int write_tree(struct merge_options *opt,
> +		      struct object_id *result_oid,
>  		      struct string_list *versions,
>  		      unsigned int offset,
>  		      size_t hash_size)
> @@ -3631,8 +3642,14 @@ static int write_tree(struct object_id *result_oid,
>  	}
>  
>  	/* Write this object file out, and record in result_oid */
> -	if (write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid))
> +	ret = opt->write_pack
> +		? index_tree_bulk_checkin_incore(result_oid,
> +						 buf.buf, buf.len, "", 1)
> +		: write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid);
> +
> +	if (ret)
>  		ret = -1;
> +
>  	strbuf_release(&buf);
>  	return ret;
>  }
> @@ -3797,8 +3814,8 @@ static int write_completed_directory(struct merge_options *opt,
>  		 */
>  		dir_info->is_null = 0;
>  		dir_info->result.mode = S_IFDIR;
> -		if (write_tree(&dir_info->result.oid, &info->versions, offset,
> -			       opt->repo->hash_algo->rawsz) < 0)
> +		if (write_tree(opt, &dir_info->result.oid, &info->versions,
> +			       offset, opt->repo->hash_algo->rawsz) < 0)
>  			ret = -1;
>  	}
>  
> @@ -4332,9 +4349,13 @@ static int process_entries(struct merge_options *opt,
>  		fflush(stdout);
>  		BUG("dir_metadata accounting completely off; shouldn't happen");
>  	}
> -	if (write_tree(result_oid, &dir_metadata.versions, 0,
> +	if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
>  		       opt->repo->hash_algo->rawsz) < 0)
>  		ret = -1;
> +
> +	if (opt->write_pack)
> +		end_odb_transaction();
> +
>  cleanup:
>  	string_list_clear(&plist, 0);
>  	string_list_clear(&dir_metadata.versions, 0);
> @@ -4878,6 +4899,9 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
>  	 */
>  	strmap_init(&opt->priv->conflicts);
>  
> +	if (opt->write_pack)
> +		begin_odb_transaction();
> +
>  	trace2_region_leave("merge", "allocate/init", opt->repo);
>  }
>  
> diff --git a/merge-recursive.h b/merge-recursive.h
> index 3d3b3e3c29..5c5ff380a8 100644
> --- a/merge-recursive.h
> +++ b/merge-recursive.h
> @@ -48,6 +48,7 @@ struct merge_options {
>  	unsigned renormalize : 1;
>  	unsigned record_conflict_msgs_as_headers : 1;
>  	const char *msg_header_prefix;
> +	unsigned write_pack : 1;
>  
>  	/* internal fields used by the implementation */
>  	struct merge_options_internal *priv;
> diff --git a/t/t4301-merge-tree-write-tree.sh b/t/t4301-merge-tree-write-tree.sh
> index b2c8a43fce..d2a8634523 100755
> --- a/t/t4301-merge-tree-write-tree.sh
> +++ b/t/t4301-merge-tree-write-tree.sh
> @@ -945,4 +945,97 @@ test_expect_success 'check the input format when --stdin is passed' '
>  	test_cmp expect actual
>  '
>  
> +packdir=".git/objects/pack"
> +
> +test_expect_success 'merge-tree can pack its result with --write-pack' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +
> +	# base has lines [3, 4, 5]
> +	#   - side adds to the beginning, resulting in [1, 2, 3, 4, 5]
> +	#   - other adds to the end, resulting in [3, 4, 5, 6, 7]
> +	#
> +	# merging the two should result in a new blob object containing
> +	# [1, 2, 3, 4, 5, 6, 7], along with a new tree.
> +	test_commit -C repo base file "$(test_seq 3 5)" &&
> +	git -C repo branch -M main &&
> +	git -C repo checkout -b side main &&
> +	test_commit -C repo side file "$(test_seq 1 5)" &&
> +	git -C repo checkout -b other main &&
> +	test_commit -C repo other file "$(test_seq 3 7)" &&
> +
> +	find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
> +	tree="$(git -C repo merge-tree --write-pack \
> +		refs/tags/side refs/tags/other)" &&
> +	blob="$(git -C repo rev-parse $tree:file)" &&
> +	find repo/$packdir -type f -name "pack-*.idx" >packs.after &&

While we do assert that we write a new packfile, we don't assert whether
parts of the written object may have been written as loose objects. Do
we want to tighten the checks to verify that?

Patrick

> +	test_must_be_empty packs.before &&
> +	test_line_count = 1 packs.after &&
> +
> +	git show-index <$(cat packs.after) >objects &&
> +	test_line_count = 2 objects &&
> +	grep "^[1-9][0-9]* $tree" objects &&
> +	grep "^[1-9][0-9]* $blob" objects
> +'
> +
> +test_expect_success 'merge-tree can write multiple packs with --write-pack' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		git config pack.packSizeLimit 512 &&
> +
> +		test_seq 512 >f &&
> +
> +		# "f" contains roughly ~2,000 bytes.
> +		#
> +		# Each side ("foo" and "bar") adds a small amount of data at the
> +		# beginning and end of "base", respectively.
> +		git add f &&
> +		test_tick &&
> +		git commit -m base &&
> +		git branch -M main &&
> +
> +		git checkout -b foo main &&
> +		{
> +			echo foo && cat f
> +		} >f.tmp &&
> +		mv f.tmp f &&
> +		git add f &&
> +		test_tick &&
> +		git commit -m foo &&
> +
> +		git checkout -b bar main &&
> +		echo bar >>f &&
> +		git add f &&
> +		test_tick &&
> +		git commit -m bar &&
> +
> +		find $packdir -type f -name "pack-*.idx" >packs.before &&
> +		# Merging either side should result in a new object which is
> +		# larger than 1M, thus the result should be split into two
> +		# separate packs.
> +		tree="$(git merge-tree --write-pack \
> +			refs/heads/foo refs/heads/bar)" &&
> +		blob="$(git rev-parse $tree:f)" &&
> +		find $packdir -type f -name "pack-*.idx" >packs.after &&
> +
> +		test_must_be_empty packs.before &&
> +		test_line_count = 2 packs.after &&
> +		for idx in $(cat packs.after)
> +		do
> +			git show-index <$idx || return 1
> +		done >objects &&
> +
> +		# The resulting set of packs should contain one copy of both
> +		# objects, each in a separate pack.
> +		test_line_count = 2 objects &&
> +		grep "^[1-9][0-9]* $tree" objects &&
> +		grep "^[1-9][0-9]* $blob" objects
> +
> +	)
> +'
> +
>  test_done
> -- 
> 2.42.0.425.g963d08ddb3.dirty

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 0/5] merge-ort: implement support for packing objects together
  2023-10-23 22:44 ` [PATCH v5 0/5] " Taylor Blau
                     ` (5 preceding siblings ...)
  2023-10-23 23:31   ` [PATCH v5 0/5] merge-ort: implement support for packing objects together Junio C Hamano
@ 2023-10-25  7:58   ` Patrick Steinhardt
  6 siblings, 0 replies; 63+ messages in thread
From: Patrick Steinhardt @ 2023-10-25  7:58 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 18764 bytes --]

On Mon, Oct 23, 2023 at 06:44:49PM -0400, Taylor Blau wrote:
> (Rebased onto the current tip of 'master', which is ceadf0f3cf (The
> twentieth batch, 2023-10-20)).
> 
> This series implements support for a new merge-tree option,
> `--write-pack`, which causes any newly-written objects to be packed
> together instead of being stored individually as loose.
> 
> This is a minor follow-up that could be taken instead of v4 (though the
> changes between these two most recent rounds are stylistic and a matter
> of subjective opinion).
> 
> This moves us the bulk_checkin_source structure introduced in response
> to Junio's suggestion during the last round further in the OOP
> direction. Instead of switching on the enum type of the source, have
> function pointers for read() and seek() respectively.
> 
> The functionality at the end is the same, but this avoids some of the
> namespacing issues that Peff pointed out while looking at v4. But I
> think that this approach ended up being less heavy-weight than I had
> originally imagined, so I think that this version is a worthwhile
> improvement over v4.
> 
> Beyond that, the changes since last time can be viewed in the range-diff
> below. Thanks in advance for any review!

Overall this version looks good to me. I've only got two smallish nits
and one question regarding the tests.

Thanks!

Patrick

> [1]: https://lore.kernel.org/git/xmqq34y7plj4.fsf@gitster.g/
> 
> Taylor Blau (5):
>   bulk-checkin: extract abstract `bulk_checkin_source`
>   bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
>   bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
>   bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
>   builtin/merge-tree.c: implement support for `--write-pack`
> 
>  Documentation/git-merge-tree.txt |   4 +
>  builtin/merge-tree.c             |   5 +
>  bulk-checkin.c                   | 197 +++++++++++++++++++++++++++----
>  bulk-checkin.h                   |   8 ++
>  merge-ort.c                      |  42 +++++--
>  merge-recursive.h                |   1 +
>  t/t4301-merge-tree-write-tree.sh |  93 +++++++++++++++
>  7 files changed, 316 insertions(+), 34 deletions(-)
> 
> Range-diff against v4:
> 1:  97bb6e9f59 ! 1:  696aa027e4 bulk-checkin: extract abstract `bulk_checkin_source`
>     @@ bulk-checkin.c: static int already_written(struct bulk_checkin_packfile *state,
>       }
>       
>      +struct bulk_checkin_source {
>     -+	enum { SOURCE_FILE } type;
>     ++	off_t (*read)(struct bulk_checkin_source *, void *, size_t);
>     ++	off_t (*seek)(struct bulk_checkin_source *, off_t);
>      +
>     -+	/* SOURCE_FILE fields */
>     -+	int fd;
>     ++	union {
>     ++		struct {
>     ++			int fd;
>     ++		} from_fd;
>     ++	} data;
>      +
>     -+	/* common fields */
>      +	size_t size;
>      +	const char *path;
>      +};
>      +
>     -+static off_t bulk_checkin_source_seek_to(struct bulk_checkin_source *source,
>     -+					 off_t offset)
>     ++static off_t bulk_checkin_source_read_from_fd(struct bulk_checkin_source *source,
>     ++					      void *buf, size_t nr)
>      +{
>     -+	switch (source->type) {
>     -+	case SOURCE_FILE:
>     -+		return lseek(source->fd, offset, SEEK_SET);
>     -+	default:
>     -+		BUG("unknown bulk-checkin source: %d", source->type);
>     -+	}
>     ++	return read_in_full(source->data.from_fd.fd, buf, nr);
>      +}
>      +
>     -+static ssize_t bulk_checkin_source_read(struct bulk_checkin_source *source,
>     -+					void *buf, size_t nr)
>     ++static off_t bulk_checkin_source_seek_from_fd(struct bulk_checkin_source *source,
>     ++					      off_t offset)
>      +{
>     -+	switch (source->type) {
>     -+	case SOURCE_FILE:
>     -+		return read_in_full(source->fd, buf, nr);
>     -+	default:
>     -+		BUG("unknown bulk-checkin source: %d", source->type);
>     -+	}
>     ++	return lseek(source->data.from_fd.fd, offset, SEEK_SET);
>     ++}
>     ++
>     ++static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
>     ++					     int fd, size_t size,
>     ++					     const char *path)
>     ++{
>     ++	memset(source, 0, sizeof(struct bulk_checkin_source));
>     ++
>     ++	source->read = bulk_checkin_source_read_from_fd;
>     ++	source->seek = bulk_checkin_source_seek_from_fd;
>     ++
>     ++	source->data.from_fd.fd = fd;
>     ++
>     ++	source->size = size;
>     ++	source->path = path;
>      +}
>      +
>       /*
>     @@ bulk-checkin.c: static int stream_blob_to_pack(struct bulk_checkin_packfile *sta
>      -			ssize_t read_result = read_in_full(fd, ibuf, rsize);
>      +			ssize_t read_result;
>      +
>     -+			read_result = bulk_checkin_source_read(source, ibuf,
>     -+							       rsize);
>     ++			read_result = source->read(source, ibuf, rsize);
>       			if (read_result < 0)
>      -				die_errno("failed to read from '%s'", path);
>      +				die_errno("failed to read from '%s'",
>     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
>       	unsigned header_len;
>       	struct hashfile_checkpoint checkpoint = {0};
>       	struct pack_idx_entry *idx = NULL;
>     -+	struct bulk_checkin_source source = {
>     -+		.type = SOURCE_FILE,
>     -+		.fd = fd,
>     -+		.size = size,
>     -+		.path = path,
>     -+	};
>     ++	struct bulk_checkin_source source;
>     ++
>     ++	init_bulk_checkin_source_from_fd(&source, fd, size, path);
>       
>       	seekback = lseek(fd, 0, SEEK_CUR);
>       	if (seekback == (off_t) -1)
>     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
>       		state->offset = checkpoint.offset;
>       		flush_bulk_checkin_packfile(state);
>      -		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
>     -+		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
>     ++		if (source.seek(&source, seekback) == (off_t)-1)
>       			return error("cannot seek back");
>       	}
>       	the_hash_algo->final_oid_fn(result_oid, &ctx);
> 2:  9d633df339 < -:  ---------- bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
> 3:  d5bbd7810e ! 2:  596bd028a7 bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source`
>     @@ Metadata
>      Author: Taylor Blau <me@ttaylorr.com>
>      
>       ## Commit message ##
>     -    bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source`
>     +    bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types
>      
>     -    Prepare for a future change where we will want to use a routine very
>     -    similar to the existing `deflate_blob_to_pack()` but over arbitrary
>     -    sources (i.e. either open file-descriptors, or a location in memory).
>     +    The existing `stream_blob_to_pack()` function is named based on the fact
>     +    that it knows only how to stream blobs into a bulk-checkin pack.
>      
>     -    Extract out a common "deflate_obj_to_pack()" routine that acts on a
>     -    bulk_checkin_source, instead of a (int, size_t) pair. Then rewrite
>     -    `deflate_blob_to_pack()` in terms of it.
>     +    But there is no longer anything in this function which prevents us from
>     +    writing objects of arbitrary types to the bulk-checkin pack. Prepare to
>     +    write OBJ_TREEs by removing this assumption, adding an `enum
>     +    object_type` parameter to this function's argument list, and renaming it
>     +    to `stream_obj_to_pack()` accordingly.
>      
>          Signed-off-by: Taylor Blau <me@ttaylorr.com>
>      
>       ## bulk-checkin.c ##
>     +@@ bulk-checkin.c: static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
>     +  * status before calling us just in case we ask it to call us again
>     +  * with a new pack.
>     +  */
>     +-static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
>     +-			       git_hash_ctx *ctx, off_t *already_hashed_to,
>     +-			       struct bulk_checkin_source *source,
>     +-			       unsigned flags)
>     ++static int stream_obj_to_pack(struct bulk_checkin_packfile *state,
>     ++			      git_hash_ctx *ctx, off_t *already_hashed_to,
>     ++			      struct bulk_checkin_source *source,
>     ++			      enum object_type type, unsigned flags)
>     + {
>     + 	git_zstream s;
>     + 	unsigned char ibuf[16384];
>     +@@ bulk-checkin.c: static int stream_blob_to_pack(struct bulk_checkin_packfile *state,
>     + 
>     + 	git_deflate_init(&s, pack_compression_level);
>     + 
>     +-	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), OBJ_BLOB,
>     +-					      size);
>     ++	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), type, size);
>     + 	s.next_out = obuf + hdrlen;
>     + 	s.avail_out = sizeof(obuf) - hdrlen;
>     + 
>      @@ bulk-checkin.c: static void prepare_to_stream(struct bulk_checkin_packfile *state,
>       		die_errno("unable to write pack header");
>       }
>     @@ bulk-checkin.c: static void prepare_to_stream(struct bulk_checkin_packfile *stat
>       	unsigned header_len;
>       	struct hashfile_checkpoint checkpoint = {0};
>       	struct pack_idx_entry *idx = NULL;
>     --	struct bulk_checkin_source source = {
>     --		.type = SOURCE_FILE,
>     --		.fd = fd,
>     --		.size = size,
>     --		.path = path,
>     --	};
>     +-	struct bulk_checkin_source source;
>       
>     +-	init_bulk_checkin_source_from_fd(&source, fd, size, path);
>     +-
>      -	seekback = lseek(fd, 0, SEEK_CUR);
>      -	if (seekback == (off_t) -1)
>      -		return error("cannot find the current offset");
>     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
>       		prepare_to_stream(state, flags);
>       		if (idx) {
>      @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>     + 			idx->offset = state->offset;
>       			crc32_begin(state->f);
>       		}
>     - 		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
>     --					&source, OBJ_BLOB, flags))
>     +-		if (!stream_blob_to_pack(state, &ctx, &already_hashed_to,
>     +-					 &source, flags))
>     ++		if (!stream_obj_to_pack(state, &ctx, &already_hashed_to,
>      +					source, type, flags))
>       			break;
>       		/*
>     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
>       		hashfile_truncate(state->f, &checkpoint);
>       		state->offset = checkpoint.offset;
>       		flush_bulk_checkin_packfile(state);
>     --		if (bulk_checkin_source_seek_to(&source, seekback) == (off_t)-1)
>     -+		if (bulk_checkin_source_seek_to(source, seekback) == (off_t)-1)
>     +-		if (source.seek(&source, seekback) == (off_t)-1)
>     ++		if (source->seek(source, seekback) == (off_t)-1)
>       			return error("cannot seek back");
>       	}
>       	the_hash_algo->final_oid_fn(result_oid, &ctx);
>     @@ bulk-checkin.c: static int deflate_blob_to_pack(struct bulk_checkin_packfile *st
>      +				int fd, size_t size,
>      +				const char *path, unsigned flags)
>      +{
>     -+	struct bulk_checkin_source source = {
>     -+		.type = SOURCE_FILE,
>     -+		.fd = fd,
>     -+		.size = size,
>     -+		.path = path,
>     -+	};
>     -+	off_t seekback = lseek(fd, 0, SEEK_CUR);
>     ++	struct bulk_checkin_source source;
>     ++	off_t seekback;
>     ++
>     ++	init_bulk_checkin_source_from_fd(&source, fd, size, path);
>     ++
>     ++	seekback = lseek(fd, 0, SEEK_CUR);
>      +	if (seekback == (off_t) -1)
>      +		return error("cannot find the current offset");
>      +
> 4:  e427fe6ad3 < -:  ---------- bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source`
> 5:  48095afe80 ! 3:  d8cf8e4395 bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
>     @@ Metadata
>       ## Commit message ##
>          bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
>      
>     -    Now that we have factored out many of the common routines necessary to
>     -    index a new object into a pack created by the bulk-checkin machinery, we
>     -    can introduce a variant of `index_blob_bulk_checkin()` that acts on
>     -    blobs whose contents we can fit in memory.
>     +    Introduce `index_blob_bulk_checkin_incore()` which allows streaming
>     +    arbitrary blob contents from memory into the bulk-checkin pack.
>     +
>     +    In order to support streaming from a location in memory, we must
>     +    implement a new kind of bulk_checkin_source that does just that. These
>     +    implementation in spread out across:
>     +
>     +      - init_bulk_checkin_source_incore()
>     +      - bulk_checkin_source_read_incore()
>     +      - bulk_checkin_source_seek_incore()
>     +
>     +    Note that, unlike file descriptors, which manage their own offset
>     +    internally, we have to keep track of how many bytes we've read out of
>     +    the buffer, and make sure we don't read past the end of the buffer.
>      
>          This will be useful in a couple of more commits in order to provide the
>          `merge-tree` builtin with a mechanism to create a new pack containing
>     @@ Commit message
>          Signed-off-by: Taylor Blau <me@ttaylorr.com>
>      
>       ## bulk-checkin.c ##
>     +@@ bulk-checkin.c: struct bulk_checkin_source {
>     + 		struct {
>     + 			int fd;
>     + 		} from_fd;
>     ++		struct {
>     ++			const void *buf;
>     ++			size_t nr_read;
>     ++		} incore;
>     + 	} data;
>     + 
>     + 	size_t size;
>     +@@ bulk-checkin.c: static off_t bulk_checkin_source_seek_from_fd(struct bulk_checkin_source *source
>     + 	return lseek(source->data.from_fd.fd, offset, SEEK_SET);
>     + }
>     + 
>     ++static off_t bulk_checkin_source_read_incore(struct bulk_checkin_source *source,
>     ++					     void *buf, size_t nr)
>     ++{
>     ++	const unsigned char *src = source->data.incore.buf;
>     ++
>     ++	if (source->data.incore.nr_read > source->size)
>     ++		BUG("read beyond bulk-checkin source buffer end "
>     ++		    "(%"PRIuMAX" > %"PRIuMAX")",
>     ++		    (uintmax_t)source->data.incore.nr_read,
>     ++		    (uintmax_t)source->size);
>     ++
>     ++	if (nr > source->size - source->data.incore.nr_read)
>     ++		nr = source->size - source->data.incore.nr_read;
>     ++
>     ++	src += source->data.incore.nr_read;
>     ++
>     ++	memcpy(buf, src, nr);
>     ++	source->data.incore.nr_read += nr;
>     ++	return nr;
>     ++}
>     ++
>     ++static off_t bulk_checkin_source_seek_incore(struct bulk_checkin_source *source,
>     ++					     off_t offset)
>     ++{
>     ++	if (!(0 <= offset && offset < source->size))
>     ++		return (off_t)-1;
>     ++	source->data.incore.nr_read = offset;
>     ++	return source->data.incore.nr_read;
>     ++}
>     ++
>     + static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
>     + 					     int fd, size_t size,
>     + 					     const char *path)
>     +@@ bulk-checkin.c: static void init_bulk_checkin_source_from_fd(struct bulk_checkin_source *source,
>     + 	source->path = path;
>     + }
>     + 
>     ++static void init_bulk_checkin_source_incore(struct bulk_checkin_source *source,
>     ++					    const void *buf, size_t size,
>     ++					    const char *path)
>     ++{
>     ++	memset(source, 0, sizeof(struct bulk_checkin_source));
>     ++
>     ++	source->read = bulk_checkin_source_read_incore;
>     ++	source->seek = bulk_checkin_source_seek_incore;
>     ++
>     ++	source->data.incore.buf = buf;
>     ++	source->data.incore.nr_read = 0;
>     ++
>     ++	source->size = size;
>     ++	source->path = path;
>     ++}
>     ++
>     + /*
>     +  * Read the contents from 'source' for 'size' bytes, streaming it to the
>     +  * packfile in state while updating the hash in ctx. Signal a failure
>      @@ bulk-checkin.c: static int deflate_obj_to_pack(struct bulk_checkin_packfile *state,
>       	return 0;
>       }
>     @@ bulk-checkin.c: static int deflate_obj_to_pack(struct bulk_checkin_packfile *sta
>      +				       const char *path, enum object_type type,
>      +				       unsigned flags)
>      +{
>     -+	struct bulk_checkin_source source = {
>     -+		.type = SOURCE_INCORE,
>     -+		.buf = buf,
>     -+		.size = size,
>     -+		.read = 0,
>     -+		.path = path,
>     -+	};
>     ++	struct bulk_checkin_source source;
>     ++
>     ++	init_bulk_checkin_source_incore(&source, buf, size, path);
>      +
>      +	return deflate_obj_to_pack(state, result_oid, &source, type, 0, flags);
>      +}
> 6:  60568f9281 = 4:  2670192802 bulk-checkin: introduce `index_tree_bulk_checkin_incore()`
> 7:  b9be9df122 ! 5:  3595db76a5 builtin/merge-tree.c: implement support for `--write-pack`
>     @@ Documentation/git-merge-tree.txt: OPTIONS
>      
>       ## builtin/merge-tree.c ##
>      @@
>     - #include "quote.h"
>       #include "tree.h"
>       #include "config.h"
>     + #include "strvec.h"
>      +#include "bulk-checkin.h"
>       
>       static int line_termination = '\n';
>       
>      @@ builtin/merge-tree.c: struct merge_tree_options {
>     - 	int show_messages;
>       	int name_only;
>       	int use_stdin;
>     + 	struct merge_options merge_options;
>      +	int write_pack;
>       };
>       
>       static int real_merge(struct merge_tree_options *o,
>      @@ builtin/merge-tree.c: static int real_merge(struct merge_tree_options *o,
>     - 	init_merge_options(&opt, the_repository);
>     + 				 _("not something we can merge"));
>       
>       	opt.show_rename_progress = 0;
>      +	opt.write_pack = o->write_pack;
>     @@ builtin/merge-tree.c: static int real_merge(struct merge_tree_options *o,
>       	opt.branch1 = branch1;
>       	opt.branch2 = branch2;
>      @@ builtin/merge-tree.c: int cmd_merge_tree(int argc, const char **argv, const char *prefix)
>     - 			   &merge_base,
>     - 			   N_("commit"),
>       			   N_("specify a merge-base for the merge")),
>     + 		OPT_STRVEC('X', "strategy-option", &xopts, N_("option=value"),
>     + 			N_("option for selected merge strategy")),
>      +		OPT_BOOL(0, "write-pack", &o.write_pack,
>      +			 N_("write new objects to a pack instead of as loose")),
>       		OPT_END()
> 
> base-commit: ceadf0f3cf51550166a387ec8508bb55e7883057
> -- 
> 2.42.0.425.g963d08ddb3.dirty

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 1/5] bulk-checkin: extract abstract `bulk_checkin_source`
  2023-10-25  7:37     ` Jeff King
@ 2023-10-25 15:39       ` Taylor Blau
  2023-10-27 23:12       ` Junio C Hamano
  1 sibling, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2023-10-25 15:39 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Elijah Newren, Eric W. Biederman, Junio C Hamano,
	Patrick Steinhardt

On Wed, Oct 25, 2023 at 03:37:36AM -0400, Jeff King wrote:
> I don't mind this in-between state. It is a funny layering violating
> from an OO standpoint, but it's not like we expect an unbounded set of
> concrete types to "inherit" from the source struct.

Yeah, this was exactly my thinking when writing up the changes for this
round. Since all of the "sub-classes" are local to the bulk-checkin.o
compilation unit, I don't have grave concerns about one implementation
peering into the details of another's.

Gotta stop somewhere ;-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
  2023-10-25  7:58     ` Patrick Steinhardt
@ 2023-10-25 15:44       ` Taylor Blau
  2023-10-25 17:21         ` Eric Sunshine
  0 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2023-10-25 15:44 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano

On Wed, Oct 25, 2023 at 09:58:06AM +0200, Patrick Steinhardt wrote:
> On Mon, Oct 23, 2023 at 06:45:01PM -0400, Taylor Blau wrote:
> > Introduce `index_blob_bulk_checkin_incore()` which allows streaming
> > arbitrary blob contents from memory into the bulk-checkin pack.
> >
> > In order to support streaming from a location in memory, we must
> > implement a new kind of bulk_checkin_source that does just that. These
> > implementation in spread out across:
>
> Nit: the commit message is a bit off here. Probably not worth a reroll
> though.

Your eyes are definitely mine, because I'm not seeing where the commit
message is off! But hopefully since you already don't think it's worth a
reroll, and I'm not even sure what the issue is that we can just leave
it ;-).

> > +static off_t bulk_checkin_source_seek_incore(struct bulk_checkin_source *source,
> > +					     off_t offset)
> > +{
> > +	if (!(0 <= offset && offset < source->size))
> > +		return (off_t)-1;
>
> At the risk of showing my own ignorance, but why is the cast here
> necessary?

It's not necessary, see this godbolt example that shows two functions
doing the same thing (one with the explicit cast, one without):

  https://godbolt.org/z/f737EcGfG

But there are a couple of other spots within the bulk-checkin code
(specifically within the deflate_blob_to_pack() routine) that explicitly
cast -1 to an off_t, so I was more trying to imitate the local style.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-10-25  7:58     ` Patrick Steinhardt
@ 2023-10-25 15:46       ` Taylor Blau
  0 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2023-10-25 15:46 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Elijah Newren, Eric W. Biederman, Jeff King, Junio C Hamano

On Wed, Oct 25, 2023 at 09:58:11AM +0200, Patrick Steinhardt wrote:
> > +test_expect_success 'merge-tree can pack its result with --write-pack' '
> > +	test_when_finished "rm -rf repo" &&
> > +	git init repo &&
> > +
> > +	# base has lines [3, 4, 5]
> > +	#   - side adds to the beginning, resulting in [1, 2, 3, 4, 5]
> > +	#   - other adds to the end, resulting in [3, 4, 5, 6, 7]
> > +	#
> > +	# merging the two should result in a new blob object containing
> > +	# [1, 2, 3, 4, 5, 6, 7], along with a new tree.
> > +	test_commit -C repo base file "$(test_seq 3 5)" &&
> > +	git -C repo branch -M main &&
> > +	git -C repo checkout -b side main &&
> > +	test_commit -C repo side file "$(test_seq 1 5)" &&
> > +	git -C repo checkout -b other main &&
> > +	test_commit -C repo other file "$(test_seq 3 7)" &&
> > +
> > +	find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
> > +	tree="$(git -C repo merge-tree --write-pack \
> > +		refs/tags/side refs/tags/other)" &&
> > +	blob="$(git -C repo rev-parse $tree:file)" &&
> > +	find repo/$packdir -type f -name "pack-*.idx" >packs.after &&
>
> While we do assert that we write a new packfile, we don't assert whether
> parts of the written object may have been written as loose objects. Do
> we want to tighten the checks to verify that?

We could, but the tests in t1050 already verify this, so I'm not sure
that we would be significantly hardening our test coverage here. If you
feel strongly about it, though, I'm happy to change it up.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
  2023-10-25 15:44       ` Taylor Blau
@ 2023-10-25 17:21         ` Eric Sunshine
  2023-10-26  8:16           ` Patrick Steinhardt
  2023-11-11  0:17           ` Elijah Newren
  0 siblings, 2 replies; 63+ messages in thread
From: Eric Sunshine @ 2023-10-25 17:21 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Patrick Steinhardt, git, Elijah Newren, Eric W. Biederman,
	Jeff King, Junio C Hamano

On Wed, Oct 25, 2023 at 11:44 AM Taylor Blau <me@ttaylorr.com> wrote:
> On Wed, Oct 25, 2023 at 09:58:06AM +0200, Patrick Steinhardt wrote:
> > On Mon, Oct 23, 2023 at 06:45:01PM -0400, Taylor Blau wrote:
> > > In order to support streaming from a location in memory, we must
> > > implement a new kind of bulk_checkin_source that does just that. These
> > > implementation in spread out across:
> >
> > Nit: the commit message is a bit off here. Probably not worth a reroll
> > though.
>
> Your eyes are definitely mine, because I'm not seeing where the commit
> message is off! But hopefully since you already don't think it's worth a
> reroll, and I'm not even sure what the issue is that we can just leave
> it ;-).

Perhaps:

    s/implementation in/implementations are/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
  2023-10-25 17:21         ` Eric Sunshine
@ 2023-10-26  8:16           ` Patrick Steinhardt
  2023-11-11  0:17           ` Elijah Newren
  1 sibling, 0 replies; 63+ messages in thread
From: Patrick Steinhardt @ 2023-10-26  8:16 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Taylor Blau, git, Elijah Newren, Eric W. Biederman, Jeff King,
	Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]

On Wed, Oct 25, 2023 at 01:21:50PM -0400, Eric Sunshine wrote:
> On Wed, Oct 25, 2023 at 11:44 AM Taylor Blau <me@ttaylorr.com> wrote:
> > On Wed, Oct 25, 2023 at 09:58:06AM +0200, Patrick Steinhardt wrote:
> > > On Mon, Oct 23, 2023 at 06:45:01PM -0400, Taylor Blau wrote:
> > > > In order to support streaming from a location in memory, we must
> > > > implement a new kind of bulk_checkin_source that does just that. These
> > > > implementation in spread out across:
> > >
> > > Nit: the commit message is a bit off here. Probably not worth a reroll
> > > though.
> >
> > Your eyes are definitely mine, because I'm not seeing where the commit
> > message is off! But hopefully since you already don't think it's worth a
> > reroll, and I'm not even sure what the issue is that we can just leave
> > it ;-).
> 
> Perhaps:
> 
>     s/implementation in/implementations are/

Yeah, that's what I referred to. Sorry for not pointing it out
explicitly :)

In any case, as stated before: I don't think any of my comments warrant
a reroll, and overall this patch series looks good to me.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 1/5] bulk-checkin: extract abstract `bulk_checkin_source`
  2023-10-25  7:37     ` Jeff King
  2023-10-25 15:39       ` Taylor Blau
@ 2023-10-27 23:12       ` Junio C Hamano
  1 sibling, 0 replies; 63+ messages in thread
From: Junio C Hamano @ 2023-10-27 23:12 UTC (permalink / raw)
  To: Jeff King
  Cc: Taylor Blau, git, Elijah Newren, Eric W. Biederman, Patrick Steinhardt

Jeff King <peff@peff.net> writes:

> Alternatively, I think some of our other OO code just leaves room for
> a type-specific void pointer, like:
>
>   struct bulk_checkin_source {
> 	...the usual stuff...
>
> 	void *magic_type_data;
>   };
>
> and then the init_bulk_checkin_source_from_fd() function allocates its
> own heap struct for the magic_type_data field and sticks the int in
> there.

Yup.  All the pros-and-cons makes sense.  I earlier said I found
this a good place to stop, exactly because the full OO with 

    struct specific_subclass {
	struct vtbl *vtbl;
	struct base_class common_data_specific_to_instance;
	... other instance specific data members here ...;
    }

would require us to add too many supporting infrastructure struct
types; with only a few subclasses in use, it is a bit too much to
justify.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 0/5] merge-ort: implement support for packing objects together
  2023-10-23 23:31   ` [PATCH v5 0/5] merge-ort: implement support for packing objects together Junio C Hamano
@ 2023-11-06 15:46     ` Johannes Schindelin
  2023-11-06 23:19       ` Junio C Hamano
                         ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Johannes Schindelin @ 2023-11-06 15:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, git, Elijah Newren, Eric W. Biederman, Jeff King,
	Patrick Steinhardt

Hi,

On Mon, 23 Oct 2023, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
>
> > But I think that this approach ended up being less heavy-weight than I
> > had originally imagined, so I think that this version is a worthwhile
> > improvement over v4.
>
> ;-).
>
> This version is a good place to stop, a bit short of going full OO.
> Nicely done.

I wonder whether a more generic approach would be more desirable, an
approach that would work for `git replay`, too, for example (where
streaming objects does not work because they need to be made available
immediately because subsequent `merge_incore_nonrecursive()` might expect
the created objects to be present)?

What I have in mind is more along Elijah's suggestion at the Contributor
Summit to use the `tmp_objdir*()` machinery. But instead of discarding the
temporary object database, the contained objects would be repacked and the
`.pack`, (maybe `.rev`) and the `.idx` file would then be moved (in that
order) before discarding the temporary object database.

This would probably need to be implemented as a new
`tmp_objdir_pack_and_migrate()` function that basically spawns
`pack-objects` and feeds it the list of generated objects, writing
directly into the non-temporary object directory, then discarding the
`tmp_objdir`.

This approach would not only more easily extend to other commands, but it
would also be less intrusive (a `tmp_objdir_*()` call to begin the
transaction before the objects are written, another one to commit the
transaction after the objects are written), and it would potentially allow
for more efficient packs to be generated.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 0/5] merge-ort: implement support for packing objects together
  2023-11-06 15:46     ` Johannes Schindelin
@ 2023-11-06 23:19       ` Junio C Hamano
  2023-11-07  3:42       ` Jeff King
  2023-11-07 15:58       ` Taylor Blau
  2 siblings, 0 replies; 63+ messages in thread
From: Junio C Hamano @ 2023-11-06 23:19 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Taylor Blau, git, Elijah Newren, Eric W. Biederman, Jeff King,
	Patrick Steinhardt

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> What I have in mind is more along Elijah's suggestion at the Contributor
> Summit to use the `tmp_objdir*()` machinery. But instead of discarding the
> temporary object database, the contained objects would be repacked and the
> `.pack`, (maybe `.rev`) and the `.idx` file would then be moved (in that
> order) before discarding the temporary object database.

That may be more involved but does indeed sound like an approach
more generally applicable.  Back when the bulk-checkin machinery was
invented, I envisioned that we would be adding annotations to
various codepaths so that object creation machinery can say "now we
are plugged, in anticipation for creating many objects at once" and
"now the flood of new object creation is done, time to wrap up" for
that kind of optimization.  

The callsites to {begin,end}_odb_transaction() functions haven't
grown beyond the original "add" and "update-index" (because the user
can add the entire working tree worth of files to the object
database), "unpack-objects" (because a fetch can bring in many
objects), and "cache-tree" (because a tree creation can cascade up
to create many objects), but I agree "merge" and "replay" are prime
candidates to benefit from the optimization of the same kind (so is
"fast-import").  They are about creating many objects at once, and
give us an opportunity for such an optimization.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 0/5] merge-ort: implement support for packing objects together
  2023-11-06 15:46     ` Johannes Schindelin
  2023-11-06 23:19       ` Junio C Hamano
@ 2023-11-07  3:42       ` Jeff King
  2023-11-07 15:58       ` Taylor Blau
  2 siblings, 0 replies; 63+ messages in thread
From: Jeff King @ 2023-11-07  3:42 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Taylor Blau, git, Elijah Newren,
	Eric W. Biederman, Patrick Steinhardt

On Mon, Nov 06, 2023 at 04:46:32PM +0100, Johannes Schindelin wrote:

> I wonder whether a more generic approach would be more desirable, an
> approach that would work for `git replay`, too, for example (where
> streaming objects does not work because they need to be made available
> immediately because subsequent `merge_incore_nonrecursive()` might expect
> the created objects to be present)?
> 
> What I have in mind is more along Elijah's suggestion at the Contributor
> Summit to use the `tmp_objdir*()` machinery. But instead of discarding the
> temporary object database, the contained objects would be repacked and the
> `.pack`, (maybe `.rev`) and the `.idx` file would then be moved (in that
> order) before discarding the temporary object database.
> 
> This would probably need to be implemented as a new
> `tmp_objdir_pack_and_migrate()` function that basically spawns
> `pack-objects` and feeds it the list of generated objects, writing
> directly into the non-temporary object directory, then discarding the
> `tmp_objdir`.

Perhaps I'm missing some piece of the puzzle, but I'm not sure what
you're trying to accomplish with that approach.

If the goal is to increase performance by avoiding the loose object
writes, then we haven't really helped much. We're still writing them,
and then writing them again for the repack.

If the goal is just to end up with a single nice pack for the long term,
then why do we need to use tmp_objdir at all? That point of that API is
to avoid letting other simultaneous processes see the intermediate state
before we're committed to keeping the objects around. That makes sense
for receiving a fetch or push, since we want to do some quality checks
on the objects before agreeing to keep them. But does it make sense for
a merge? Sure, in some workflows (like GitHub's test merges) we might
end up throwing away the merge result if it's not clean. But there is no
real downside to other processes seeing those objects. They can be
cleaned up at the next pruning repack.

I guess if your scenario requirements include "and we are never allowed
to run a pruning repack", then that could make sense. And I know that
has been a historical issue for GitHub. But I'm not sure it's
necessarily a good driver for an upstream feature.

As an alternative, though, I wonder if you need to have access to the
objects outside of the merge process. If not, then rather than an
alternate object store, what if that single process wrote to a streaming
pack _and_ used its running in-core index of the objects to allow access
via the usual object-retrieval. Then you'd get a single, clean pack as
the outcome _and_ you'd get the performance boost over just "write loose
objects, repack, and prune".

-Peff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 0/5] merge-ort: implement support for packing objects together
  2023-11-06 15:46     ` Johannes Schindelin
  2023-11-06 23:19       ` Junio C Hamano
  2023-11-07  3:42       ` Jeff King
@ 2023-11-07 15:58       ` Taylor Blau
  2023-11-07 18:22         ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Taylor Blau
  2 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2023-11-07 15:58 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, git, Elijah Newren, Eric W. Biederman, Jeff King,
	Patrick Steinhardt

On Mon, Nov 06, 2023 at 04:46:32PM +0100, Johannes Schindelin wrote:
> Hi,
>
> On Mon, 23 Oct 2023, Junio C Hamano wrote:
>
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > > But I think that this approach ended up being less heavy-weight than I
> > > had originally imagined, so I think that this version is a worthwhile
> > > improvement over v4.
> >
> > ;-).
> >
> > This version is a good place to stop, a bit short of going full OO.
> > Nicely done.
>
> I wonder whether a more generic approach would be more desirable, an
> approach that would work for `git replay`, too, for example (where
> streaming objects does not work because they need to be made available
> immediately because subsequent `merge_incore_nonrecursive()` might expect
> the created objects to be present)?

The goal of this series is to bound the number of inodes consumed by a
single merge-tree invocation down from arbitrarily many (in the case of
storing each new object loose) to a constant (by storing everything in a
single pack).

I'd think that we would want a similar approach for 'replay', but as
you note we have some additional requirements, too:

  - each replayed commit is computed in a single step, which will result
    in a new pack
  - we must be able to see objects from previous steps

I think one feasible approach here for replay is to combine the two
ideas and have a separate objdir that stores N packs (one for each step
of the replay), but then repacks them down into a single pack before
migrating back to the main object store.

That would ensure that we have some isolation between replay-created
objects and the rest of the repository in the intermediate state. Even
though we'd have as many packs as there are commits, we'd consume far
fewer inodes in the process, since each commit can introduce arbitrarily
many new objects, each requiring at least a single inode (potentially
more with sharding).

We'd have to be mindful of having a large number of packs, but I think
that this should mostly be a non-issue, since we'd only be living with N
packs for the lifetime of the replay command (before repacking them down
to a single pack and migrating them back to the main object store).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC PATCH 0/3] replay: implement support for writing new objects to a pack
  2023-11-07 15:58       ` Taylor Blau
@ 2023-11-07 18:22         ` Taylor Blau
  2023-11-07 18:22           ` [RFC PATCH 1/3] merge-ort.c: finalize ODB transactions after each step Taylor Blau
                             ` (4 more replies)
  0 siblings, 5 replies; 63+ messages in thread
From: Taylor Blau @ 2023-11-07 18:22 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Patrick Steinhardt, Elijah Newren, Junio C Hamano,
	Johannes Schindelin

(Based on a combination of Christian's cc/git-replay and my
tb/merge-tree-write-pack branches).

This RFC demonstrates extending the new `--write-pack` option that
`merge-tree` recently learned to the `replay` builtin as well.

The approach is as follows:

  - write a pack out after each step in the replay operation, so that
    subsequent steps may see any new object(s) created during previous
    steps

  - combine those packs into one before migrating them back into the
    main object store

This is accomplished with a combination of the bulk-checkin and
tmp-objdir APIs, with some minor modifications made to when we flush out
and finalize bulk-checkin transactions.

The benefit to this approach is that we bound the number of inodes
required per replayed commit to a constant (by default, 3: one for the
.pack, one for the .idx, and another for the .rev file), instead of
having each operation take an unbounded number of inodes proportional to
the number of new objects created during that step. We also only migrate
a single pack back to the main object store.

In other words, this makes the maximum number of inodes required by
'replay' grow proportional to the number of commits being replayed,
instead of the number of new *objects* created as a result of the replay
operation.

Taylor Blau (3):
  merge-ort.c: finalize ODB transactions after each step
  tmp-objdir: introduce `tmp_objdir_repack()`
  builtin/replay.c: introduce `--write-pack`

 Documentation/git-replay.txt |  4 ++++
 builtin/replay.c             | 18 ++++++++++++++++++
 merge-ort.c                  |  5 ++++-
 t/t3650-replay-basics.sh     | 37 ++++++++++++++++++++++++++++++++++++
 tmp-objdir.c                 | 13 +++++++++++++
 tmp-objdir.h                 |  6 ++++++
 6 files changed, 82 insertions(+), 1 deletion(-)

-- 
2.42.0.446.g0b9ef90488

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC PATCH 1/3] merge-ort.c: finalize ODB transactions after each step
  2023-11-07 18:22         ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Taylor Blau
@ 2023-11-07 18:22           ` Taylor Blau
  2023-11-11  3:45             ` Elijah Newren
  2023-11-07 18:22           ` [RFC PATCH 2/3] tmp-objdir: introduce `tmp_objdir_repack()` Taylor Blau
                             ` (3 subsequent siblings)
  4 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2023-11-07 18:22 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Patrick Steinhardt, Elijah Newren, Junio C Hamano,
	Johannes Schindelin

In a previous commit, the ORT merge backend learned how to use the
bulk-checkin mechanism to emit a single pack containing any new objects
created during the merge. This functionality was implemented by setting
up a new ODB transaction, and finalizing it at the end of the merge via
`process_entries()`.

In a future commit, we will extend this functionality to the new `git
replay` command, which needs to see objects from previous steps in order
to replay each commit.

As a step towards implementing this, teach the ORT backend to flush the
ODB transaction at the end of each step in `process_entries()`, and then
finalize the result with `end_odb_transaction()` when calling
`merge_finalize()`.

For normal `merge-tree --write-pack` invocations, this produces no
functional change: the pack is written out at the end of
`process_entries()`, and then the `end_odb_transaction()` call is a
noop.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 merge-ort.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 523577d71e..7b352451cc 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -4354,7 +4354,7 @@ static int process_entries(struct merge_options *opt,
 		ret = -1;
 
 	if (opt->write_pack)
-		end_odb_transaction();
+		flush_odb_transaction();
 
 cleanup:
 	string_list_clear(&plist, 0);
@@ -4726,6 +4726,9 @@ void merge_switch_to_result(struct merge_options *opt,
 void merge_finalize(struct merge_options *opt,
 		    struct merge_result *result)
 {
+	if (opt->write_pack)
+		end_odb_transaction();
+
 	if (opt->renormalize)
 		git_attr_set_direction(GIT_ATTR_CHECKIN);
 	assert(opt->priv == NULL);
-- 
2.42.0.446.g0b9ef90488


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH 2/3] tmp-objdir: introduce `tmp_objdir_repack()`
  2023-11-07 18:22         ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Taylor Blau
  2023-11-07 18:22           ` [RFC PATCH 1/3] merge-ort.c: finalize ODB transactions after each step Taylor Blau
@ 2023-11-07 18:22           ` Taylor Blau
  2023-11-08  7:05             ` Patrick Steinhardt
  2023-11-07 18:23           ` [RFC PATCH 3/3] builtin/replay.c: introduce `--write-pack` Taylor Blau
                             ` (2 subsequent siblings)
  4 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2023-11-07 18:22 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Patrick Steinhardt, Elijah Newren, Junio C Hamano,
	Johannes Schindelin

In the following commit, we will teach `git replay` how to write a pack
containing the set of new objects created as a result of the `replay`
operation.

Since `replay` needs to be able to see the object(s) written
from previous steps in order to replay each commit, the ODB transaction
may have multiple pending packs. Migrating multiple packs back into the
main object store has a couple of downsides:

  - It is error-prone to do so: each pack must be migrated in the
    correct order (with the ".idx" file staged last), and the set of
    packs themselves must be moved over in the correct order to avoid
    racy behavior.

  - It is a (potentially significant) performance degradation to migrate
    a large number of packs back into the main object store.

Introduce a new function that combines the set of all packs in the
temporary object store to produce a single pack which is the logical
concatenation of all packs created during that level of the ODB
transaction.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 tmp-objdir.c | 13 +++++++++++++
 tmp-objdir.h |  6 ++++++
 2 files changed, 19 insertions(+)

diff --git a/tmp-objdir.c b/tmp-objdir.c
index 5f9074ad1c..ef53180b47 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -12,6 +12,7 @@
 #include "strvec.h"
 #include "quote.h"
 #include "object-store-ll.h"
+#include "run-command.h"
 
 struct tmp_objdir {
 	struct strbuf path;
@@ -277,6 +278,18 @@ int tmp_objdir_migrate(struct tmp_objdir *t)
 	return ret;
 }
 
+int tmp_objdir_repack(struct tmp_objdir *t)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+
+	cmd.git_cmd = 1;
+
+	strvec_pushl(&cmd.args, "repack", "-a", "-d", "-k", "-l", NULL);
+	strvec_pushv(&cmd.env, tmp_objdir_env(t));
+
+	return run_command(&cmd);
+}
+
 const char **tmp_objdir_env(const struct tmp_objdir *t)
 {
 	if (!t)
diff --git a/tmp-objdir.h b/tmp-objdir.h
index 237d96b660..d00e3b3e27 100644
--- a/tmp-objdir.h
+++ b/tmp-objdir.h
@@ -36,6 +36,12 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix);
  */
 const char **tmp_objdir_env(const struct tmp_objdir *);
 
+/*
+ * Combines all packs in the tmp_objdir into a single pack before migrating.
+ * Removes original pack(s) after installing the combined pack into place.
+ */
+int tmp_objdir_repack(struct tmp_objdir *);
+
 /*
  * Finalize a temporary object directory by migrating its objects into the main
  * object database, removing the temporary directory, and freeing any
-- 
2.42.0.446.g0b9ef90488


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH 3/3] builtin/replay.c: introduce `--write-pack`
  2023-11-07 18:22         ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Taylor Blau
  2023-11-07 18:22           ` [RFC PATCH 1/3] merge-ort.c: finalize ODB transactions after each step Taylor Blau
  2023-11-07 18:22           ` [RFC PATCH 2/3] tmp-objdir: introduce `tmp_objdir_repack()` Taylor Blau
@ 2023-11-07 18:23           ` Taylor Blau
  2023-11-11  3:42           ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Elijah Newren
  2023-11-11  4:04           ` Elijah Newren
  4 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2023-11-07 18:23 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Patrick Steinhardt, Elijah Newren, Junio C Hamano,
	Johannes Schindelin

Now that the prerequisites are in place, we can implement a
`--write-pack` option for `git replay`, corresponding to the existing
one in `git merge-tree`.

The changes are mostly limited to:

  - introducing a new option in the builtin
  - replacing the main object store with the temporary one
  - then repacking and migrating the temporary object store back into
    the main object store after the replay has completed

Along with tests and documentation to ensure that the new behavior
matches our expectations.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-replay.txt |  4 ++++
 builtin/replay.c             | 18 ++++++++++++++++++
 t/t3650-replay-basics.sh     | 37 ++++++++++++++++++++++++++++++++++++
 3 files changed, 59 insertions(+)

diff --git a/Documentation/git-replay.txt b/Documentation/git-replay.txt
index e7551aec54..f424c1a676 100644
--- a/Documentation/git-replay.txt
+++ b/Documentation/git-replay.txt
@@ -42,6 +42,10 @@ When `--advance` is specified, the update-ref command(s) in the output
 will update the branch passed as an argument to `--advance` to point at
 the new commits (in other words, this mimics a cherry-pick operation).
 
+--write-pack::
+	Write any new objects into a separate packfile instead of as
+	individual loose objects.
+
 <revision-range>::
 	Range of commits to replay. More than one <revision-range> can
 	be passed, but in `--advance <branch>` mode, they should have
diff --git a/builtin/replay.c b/builtin/replay.c
index c3d53ff0cd..72b7b7f43a 100644
--- a/builtin/replay.c
+++ b/builtin/replay.c
@@ -17,6 +17,7 @@
 #include "strmap.h"
 #include <oidset.h>
 #include <tree.h>
+#include "tmp-objdir.h"
 
 static const char *short_commit_name(struct commit *commit)
 {
@@ -272,6 +273,7 @@ int cmd_replay(int argc, const char **argv, const char *prefix)
 	struct commit *onto = NULL;
 	const char *onto_name = NULL;
 	int contained = 0;
+	int write_pack = 0;
 
 	struct rev_info revs;
 	struct commit *last_commit = NULL;
@@ -279,6 +281,7 @@ int cmd_replay(int argc, const char **argv, const char *prefix)
 	struct merge_options merge_opt;
 	struct merge_result result;
 	struct strset *update_refs = NULL;
+	struct tmp_objdir *tmp_objdir = NULL;
 	kh_oid_map_t *replayed_commits;
 	int i, ret = 0;
 
@@ -296,6 +299,8 @@ int cmd_replay(int argc, const char **argv, const char *prefix)
 			   N_("replay onto given commit")),
 		OPT_BOOL(0, "contained", &contained,
 			 N_("advance all branches contained in revision-range")),
+		OPT_BOOL(0, "write-pack", &write_pack,
+			 N_("write new objects to a pack instead of as loose")),
 		OPT_END()
 	};
 
@@ -352,8 +357,15 @@ int cmd_replay(int argc, const char **argv, const char *prefix)
 	init_merge_options(&merge_opt, the_repository);
 	memset(&result, 0, sizeof(result));
 	merge_opt.show_rename_progress = 0;
+	merge_opt.write_pack = write_pack;
 	last_commit = onto;
 	replayed_commits = kh_init_oid_map();
+
+	if (merge_opt.write_pack) {
+		tmp_objdir = tmp_objdir_create("replay");
+		tmp_objdir_replace_primary_odb(tmp_objdir, 0);
+	}
+
 	while ((commit = get_revision(&revs))) {
 		const struct name_decoration *decoration;
 		khint_t pos;
@@ -417,5 +429,11 @@ int cmd_replay(int argc, const char **argv, const char *prefix)
 	/* Return */
 	if (ret < 0)
 		exit(128);
+	if (ret && tmp_objdir) {
+		if (tmp_objdir_repack(tmp_objdir) < 0)
+			ret = 0;
+		else if (tmp_objdir_migrate(tmp_objdir) < 0)
+			ret = 0;
+	}
 	return ret ? 0 : 1;
 }
diff --git a/t/t3650-replay-basics.sh b/t/t3650-replay-basics.sh
index 389670262e..e7048748c2 100755
--- a/t/t3650-replay-basics.sh
+++ b/t/t3650-replay-basics.sh
@@ -67,6 +67,43 @@ test_expect_success 'using replay to rebase two branches, one on top of other' '
 	test_cmp expect result
 '
 
+packdir=.git/objects/pack
+
+test_expect_success 'using replay to rebase two branches, with --write-pack' '
+	# Remove the results of the previous rebase, ensuring that they
+	# are pruned from the object store.
+	git gc --prune=now &&
+	test_must_fail git cat-file -t "$(cut -d " " -f 3 expect)" &&
+
+	# Create an extra packfile to ensure that the tmp-objdir repack
+	# takes place outside of the main object store.
+	git checkout --detach &&
+	test_commit unreachable &&
+	git repack -d &&
+	git checkout main &&
+
+	find $packdir -type f -name "*.idx" | sort >packs.before &&
+	git replay --write-pack --onto main topic1..topic2 >result &&
+	find $packdir -type f -name "*.idx" | sort >packs.after &&
+
+	comm -13 packs.before packs.after >packs.new &&
+
+	# Ensure that we got a single new pack.
+	test_line_count = 1 result &&
+	test_line_count = 1 packs.new &&
+
+	# ... and that the rest of the results match our expeectations.
+	git log --format=%s $(cut -f 3 -d " " result) >actual &&
+	test_write_lines E D M L B A >expect &&
+	test_cmp expect actual &&
+
+	printf "update refs/heads/topic2 " >expect &&
+	printf "%s " $(cut -f 3 -d " " result) >>expect &&
+	git rev-parse topic2 >>expect &&
+
+	test_cmp expect result
+'
+
 test_expect_success 'using replay on bare repo to rebase two branches, one on top of other' '
 	git -C bare replay --onto main topic1..topic2 >result-bare &&
 	test_cmp expect result-bare
-- 
2.42.0.446.g0b9ef90488

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 2/3] tmp-objdir: introduce `tmp_objdir_repack()`
  2023-11-07 18:22           ` [RFC PATCH 2/3] tmp-objdir: introduce `tmp_objdir_repack()` Taylor Blau
@ 2023-11-08  7:05             ` Patrick Steinhardt
  2023-11-09 19:26               ` Taylor Blau
  0 siblings, 1 reply; 63+ messages in thread
From: Patrick Steinhardt @ 2023-11-08  7:05 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Jeff King, Elijah Newren, Junio C Hamano, Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 3155 bytes --]

On Tue, Nov 07, 2023 at 01:22:58PM -0500, Taylor Blau wrote:
> In the following commit, we will teach `git replay` how to write a pack
> containing the set of new objects created as a result of the `replay`
> operation.
> 
> Since `replay` needs to be able to see the object(s) written
> from previous steps in order to replay each commit, the ODB transaction
> may have multiple pending packs. Migrating multiple packs back into the
> main object store has a couple of downsides:
> 
>   - It is error-prone to do so: each pack must be migrated in the
>     correct order (with the ".idx" file staged last), and the set of
>     packs themselves must be moved over in the correct order to avoid
>     racy behavior.
> 
>   - It is a (potentially significant) performance degradation to migrate
>     a large number of packs back into the main object store.
> 
> Introduce a new function that combines the set of all packs in the
> temporary object store to produce a single pack which is the logical
> concatenation of all packs created during that level of the ODB
> transaction.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  tmp-objdir.c | 13 +++++++++++++
>  tmp-objdir.h |  6 ++++++
>  2 files changed, 19 insertions(+)
> 
> diff --git a/tmp-objdir.c b/tmp-objdir.c
> index 5f9074ad1c..ef53180b47 100644
> --- a/tmp-objdir.c
> +++ b/tmp-objdir.c
> @@ -12,6 +12,7 @@
>  #include "strvec.h"
>  #include "quote.h"
>  #include "object-store-ll.h"
> +#include "run-command.h"
>  
>  struct tmp_objdir {
>  	struct strbuf path;
> @@ -277,6 +278,18 @@ int tmp_objdir_migrate(struct tmp_objdir *t)
>  	return ret;
>  }
>  
> +int tmp_objdir_repack(struct tmp_objdir *t)
> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +
> +	cmd.git_cmd = 1;
> +
> +	strvec_pushl(&cmd.args, "repack", "-a", "-d", "-k", "-l", NULL);
> +	strvec_pushv(&cmd.env, tmp_objdir_env(t));

I wonder what performance of this repack would be like in a large
repository with many refs. Ideally, I would expect that the repacking
performance should scale with the number of objects we have written into
the temporary object directory. But in practice, the repack will need to
compute reachability and thus also scales with the size of the repo
itself, doesn't it?

Patrick

> +	return run_command(&cmd);
> +}
> +
>  const char **tmp_objdir_env(const struct tmp_objdir *t)
>  {
>  	if (!t)
> diff --git a/tmp-objdir.h b/tmp-objdir.h
> index 237d96b660..d00e3b3e27 100644
> --- a/tmp-objdir.h
> +++ b/tmp-objdir.h
> @@ -36,6 +36,12 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix);
>   */
>  const char **tmp_objdir_env(const struct tmp_objdir *);
>  
> +/*
> + * Combines all packs in the tmp_objdir into a single pack before migrating.
> + * Removes original pack(s) after installing the combined pack into place.
> + */
> +int tmp_objdir_repack(struct tmp_objdir *);
> +
>  /*
>   * Finalize a temporary object directory by migrating its objects into the main
>   * object database, removing the temporary directory, and freeing any
> -- 
> 2.42.0.446.g0b9ef90488
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 2/3] tmp-objdir: introduce `tmp_objdir_repack()`
  2023-11-08  7:05             ` Patrick Steinhardt
@ 2023-11-09 19:26               ` Taylor Blau
  0 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2023-11-09 19:26 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Jeff King, Elijah Newren, Junio C Hamano, Johannes Schindelin

On Wed, Nov 08, 2023 at 08:05:46AM +0100, Patrick Steinhardt wrote:
> > @@ -277,6 +278,18 @@ int tmp_objdir_migrate(struct tmp_objdir *t)
> >  	return ret;
> >  }
> >
> > +int tmp_objdir_repack(struct tmp_objdir *t)
> > +{
> > +	struct child_process cmd = CHILD_PROCESS_INIT;
> > +
> > +	cmd.git_cmd = 1;
> > +
> > +	strvec_pushl(&cmd.args, "repack", "-a", "-d", "-k", "-l", NULL);
> > +	strvec_pushv(&cmd.env, tmp_objdir_env(t));
>
> I wonder what performance of this repack would be like in a large
> repository with many refs. Ideally, I would expect that the repacking
> performance should scale with the number of objects we have written into
> the temporary object directory. But in practice, the repack will need to
> compute reachability and thus also scales with the size of the repo
> itself, doesn't it?

Good question. We definitely do not want to be doing an all-into-one
repack as a consequence of running 'git replay' in a large repository
with lots of refs, objects, or both.

But since we push the result of calling `tmp_objdir_env(t)` into the
environment of the child process, we are only repacking the objects in
the temporary directory, not the main object store.

I have a test that verifies this is the case by making sure that in a
repository with some arbitrary set of pre-existing packs, that only one
pack is added to that set after running 'replay', and that the
pre-existing packs remain in place.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-10-23 22:45   ` [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
  2023-10-25  7:58     ` Patrick Steinhardt
@ 2023-11-10 23:51     ` Elijah Newren
  2023-11-11  0:27       ` Junio C Hamano
                         ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Elijah Newren @ 2023-11-10 23:51 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Eric W. Biederman, Jeff King, Junio C Hamano, Patrick Steinhardt

Hi,

Sorry for taking so long to find some time to review.  And sorry for
the bad news, but...

On Mon, Oct 23, 2023 at 3:45 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> When using merge-tree often within a repository[^1], it is possible to
> generate a relatively large number of loose objects, which can result in
> degraded performance, and inode exhaustion in extreme cases.
>
> Building on the functionality introduced in previous commits, the
> bulk-checkin machinery now has support to write arbitrary blob and tree
> objects which are small enough to be held in-core. We can use this to
> write any blob/tree objects generated by ORT into a separate pack
> instead of writing them out individually as loose.
>
> This functionality is gated behind a new `--write-pack` option to
> `merge-tree` that works with the (non-deprecated) `--write-tree` mode.
>
> The implementation is relatively straightforward. There are two spots
> within the ORT mechanism where we call `write_object_file()`, one for
> content differences within blobs, and another to assemble any new trees
> necessary to construct the merge. In each of those locations,
> conditionally replace calls to `write_object_file()` with
> `index_blob_bulk_checkin_incore()` or `index_tree_bulk_checkin_incore()`
> depending on which kind of object we are writing.
>
> The only remaining task is to begin and end the transaction necessary to
> initialize the bulk-checkin machinery, and move any new pack(s) it
> created into the main object store.

I believe the above is built on an assumption that any objects written
will not need to be read until after the merge is completed.  And we
might have a nesting issue too...

> @@ -2108,10 +2109,19 @@ static int handle_content_merge(struct merge_options *opt,
>                 if ((merge_status < 0) || !result_buf.ptr)
>                         ret = error(_("failed to execute internal merge"));
>
> -               if (!ret &&
> -                   write_object_file(result_buf.ptr, result_buf.size,
> -                                     OBJ_BLOB, &result->oid))
> -                       ret = error(_("unable to add %s to database"), path);
> +               if (!ret) {
> +                       ret = opt->write_pack
> +                               ? index_blob_bulk_checkin_incore(&result->oid,
> +                                                                result_buf.ptr,
> +                                                                result_buf.size,
> +                                                                path, 1)
> +                               : write_object_file(result_buf.ptr,
> +                                                   result_buf.size,
> +                                                   OBJ_BLOB, &result->oid);
> +                       if (ret)
> +                               ret = error(_("unable to add %s to database"),
> +                                           path);
> +               }
>
>                 free(result_buf.ptr);
>                 if (ret)

This is unsafe; the object may need to be read later within the same
merge.  Perhaps the simplest example related to your testcase is
modifying the middle section of that testcase (I'll highlight where
when I comment on the testcase) to read:

    test_commit -C repo base file "$(test_seq 3 5)" &&
    git -C repo branch -M main &&
    git -C repo checkout -b side main &&
    test_commit -C repo side-file file "$(test_seq 1 5)" &&
    test_commit -C repo side-file2 file2 "$(test_seq 1 6)" &&
    git -C repo checkout -b other main &&
    test_commit -C repo other-file file "$(test_seq 3 7)" &&
    git -C repo mv file file2 &&
    git -C repo commit -m other-file2 &&

    find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
    git -C repo merge-tree --write-pack side other &&

In words, what I'm doing here is having both sides modify "file" (the
same as you did) but also having one side rename "file"->"file2".  The
side that doesn't rename "file" also introduces a new "file2".  ort
needs to merge the three versions of "file" to get a new blob object,
and then merge that with the content from the brand new "file2".  More
complicated cases are possible, of course.  Anyway, with the modified
testcase above, merge-tree will fail with:

    fatal: unable to read blob object 06e567b11dfdafeaf7d3edcc89864149383aeab6

I think (untested) that you could fix this by creating two packs
instead of just one.  In particular, add a call to
flush_odb_transcation() after the "redo_after_renames" block in
merge_ort_nonrecursive_internal().  (It's probably easier to see that
you could place the flush_odb_transaction() call inside
detect_and_process_renames() just after the process_renames() call,
but when redo_after_renames is true you'd end up with three packs
instead of just two).

What happens with the odb transaction stuff if no new objects are
written before the call to flush_odb_transaction?  Will that be a
problem?

(Since any tree written will not be re-read within the same merge, the
other write_object_file() call you changed does not have the same
problem as the one here.)

>@@ -4332,9 +4349,13 @@ static int process_entries(struct merge_options *opt,
>                fflush(stdout);
>                BUG("dir_metadata accounting completely off; shouldn't happen");
>        }
>-       if (write_tree(result_oid, &dir_metadata.versions, 0,
>+       if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
>                       opt->repo->hash_algo->rawsz) < 0)
>                ret = -1;
>
> +
> +       if (opt->write_pack)
> +               end_odb_transaction();
> +

Is the end_odb_transaction() here going to fail with an "Unbalanced
ODB transaction nesting" when faced with a recursive merge?

Perhaps flushing here, and then calling end_odb_transaction() in
merge_finalize(), much as you do in your replay-and-write-pack series,
should actually be moved to this commit and included here?

This does mean that for a recursive merge, that you'll get up to 2*N
packfiles, where N is the depth of the recursive merge.

> +       test_commit -C repo base file "$(test_seq 3 5)" &&
> +       git -C repo branch -M main &&
> +       git -C repo checkout -b side main &&
> +       test_commit -C repo side file "$(test_seq 1 5)" &&
> +       git -C repo checkout -b other main &&
> +       test_commit -C repo other file "$(test_seq 3 7)" &&
> +
> +       find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
> +       tree="$(git -C repo merge-tree --write-pack \
> +               refs/tags/side refs/tags/other)" &&

These were the lines from your testcase that I replaced to show the bug.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()`
  2023-10-25 17:21         ` Eric Sunshine
  2023-10-26  8:16           ` Patrick Steinhardt
@ 2023-11-11  0:17           ` Elijah Newren
  1 sibling, 0 replies; 63+ messages in thread
From: Elijah Newren @ 2023-11-11  0:17 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Taylor Blau, Patrick Steinhardt, git, Eric W. Biederman,
	Jeff King, Junio C Hamano

On Wed, Oct 25, 2023 at 10:22 AM Eric Sunshine <sunshine@sunshineco.com> wrote:
>
> On Wed, Oct 25, 2023 at 11:44 AM Taylor Blau <me@ttaylorr.com> wrote:
> > On Wed, Oct 25, 2023 at 09:58:06AM +0200, Patrick Steinhardt wrote:
> > > On Mon, Oct 23, 2023 at 06:45:01PM -0400, Taylor Blau wrote:
> > > > In order to support streaming from a location in memory, we must
> > > > implement a new kind of bulk_checkin_source that does just that. These
> > > > implementation in spread out across:
> > >
>
> Perhaps:
>
>     s/implementation in/implementations are/

Was just about to comment with the same suggestion; I had a hard time
parsing the commit message as well because of this.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-10 23:51     ` Elijah Newren
@ 2023-11-11  0:27       ` Junio C Hamano
  2023-11-11  1:34         ` Taylor Blau
  2023-11-11  1:24       ` Taylor Blau
  2023-11-13 22:02       ` Jeff King
  2 siblings, 1 reply; 63+ messages in thread
From: Junio C Hamano @ 2023-11-11  0:27 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Taylor Blau, git, Eric W. Biederman, Jeff King, Patrick Steinhardt

Elijah Newren <newren@gmail.com> writes:

> I believe the above is built on an assumption that any objects written
> will not need to be read until after the merge is completed.  And we
> might have a nesting issue too...
> ...
> This is unsafe; the object may need to be read later within the same
> merge.

Thanks for a good analysis.  I agree.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-10 23:51     ` Elijah Newren
  2023-11-11  0:27       ` Junio C Hamano
@ 2023-11-11  1:24       ` Taylor Blau
  2023-11-13 22:05         ` Jeff King
  2023-11-13 22:02       ` Jeff King
  2 siblings, 1 reply; 63+ messages in thread
From: Taylor Blau @ 2023-11-11  1:24 UTC (permalink / raw)
  To: Elijah Newren
  Cc: git, Eric W. Biederman, Jeff King, Junio C Hamano, Patrick Steinhardt

On Fri, Nov 10, 2023 at 03:51:18PM -0800, Elijah Newren wrote:
> > @@ -2108,10 +2109,19 @@ static int handle_content_merge(struct merge_options *opt,
> >                 if ((merge_status < 0) || !result_buf.ptr)
> >                         ret = error(_("failed to execute internal merge"));
> >
> > -               if (!ret &&
> > -                   write_object_file(result_buf.ptr, result_buf.size,
> > -                                     OBJ_BLOB, &result->oid))
> > -                       ret = error(_("unable to add %s to database"), path);
> > +               if (!ret) {
> > +                       ret = opt->write_pack
> > +                               ? index_blob_bulk_checkin_incore(&result->oid,
> > +                                                                result_buf.ptr,
> > +                                                                result_buf.size,
> > +                                                                path, 1)
> > +                               : write_object_file(result_buf.ptr,
> > +                                                   result_buf.size,
> > +                                                   OBJ_BLOB, &result->oid);
> > +                       if (ret)
> > +                               ret = error(_("unable to add %s to database"),
> > +                                           path);
> > +               }
> >
> >                 free(result_buf.ptr);
> >                 if (ret)
>
> This is unsafe; the object may need to be read later within the same
> merge.  Perhaps the simplest example related to your testcase is
> modifying the middle section of that testcase (I'll highlight where
> when I comment on the testcase) to read:
>
>     test_commit -C repo base file "$(test_seq 3 5)" &&
>     git -C repo branch -M main &&
>     git -C repo checkout -b side main &&
>     test_commit -C repo side-file file "$(test_seq 1 5)" &&
>     test_commit -C repo side-file2 file2 "$(test_seq 1 6)" &&
>     git -C repo checkout -b other main &&
>     test_commit -C repo other-file file "$(test_seq 3 7)" &&
>     git -C repo mv file file2 &&
>     git -C repo commit -m other-file2 &&
>
>     find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
>     git -C repo merge-tree --write-pack side other &&
>
> In words, what I'm doing here is having both sides modify "file" (the
> same as you did) but also having one side rename "file"->"file2".  The
> side that doesn't rename "file" also introduces a new "file2".  ort
> needs to merge the three versions of "file" to get a new blob object,
> and then merge that with the content from the brand new "file2".  More
> complicated cases are possible, of course.  Anyway, with the modified
> testcase above, merge-tree will fail with:
>
>     fatal: unable to read blob object 06e567b11dfdafeaf7d3edcc89864149383aeab6
>
> I think (untested) that you could fix this by creating two packs
> instead of just one.  In particular, add a call to
> flush_odb_transcation() after the "redo_after_renames" block in
> merge_ort_nonrecursive_internal().  (It's probably easier to see that
> you could place the flush_odb_transaction() call inside
> detect_and_process_renames() just after the process_renames() call,
> but when redo_after_renames is true you'd end up with three packs
> instead of just two).

Great analysis, thanks for catching this error. I tested your approach,
and indeed a flush_odb_transaction() call after the process_renames()
call in detect_and_process_renames() does do the trick.

> What happens with the odb transaction stuff if no new objects are
> written before the call to flush_odb_transaction?  Will that be a
> problem?

I think that the bulk-checkin code is flexible enough to understand that
we shouldn't do anything when there aren't any objects to pack.

> (Since any tree written will not be re-read within the same merge, the
> other write_object_file() call you changed does not have the same
> problem as the one here.)
>
> >@@ -4332,9 +4349,13 @@ static int process_entries(struct merge_options *opt,
> >                fflush(stdout);
> >                BUG("dir_metadata accounting completely off; shouldn't happen");
> >        }
> >-       if (write_tree(result_oid, &dir_metadata.versions, 0,
> >+       if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
> >                       opt->repo->hash_algo->rawsz) < 0)
> >                ret = -1;
> >
> > +
> > +       if (opt->write_pack)
> > +               end_odb_transaction();
> > +
>
> Is the end_odb_transaction() here going to fail with an "Unbalanced
> ODB transaction nesting" when faced with a recursive merge?

I think so, and we should have a test-case demonstrating that. In the
remaining three patches that I posted to extend this approach to 'git
replay', I moved this call elsewhere in such a way that I think

> Perhaps flushing here, and then calling end_odb_transaction() in
> merge_finalize(), much as you do in your replay-and-write-pack series,
> should actually be moved to this commit and included here?

Yep, exactly.

> This does mean that for a recursive merge, that you'll get up to 2*N
> packfiles, where N is the depth of the recursive merge.

We definitely want to avoid that ;-). I think there are a couple of
potential directions forward here, but the most promising one I think is
due to Johannes who suggests that we write loose objects into a
temporary directory with a replace_tmp_objdir() call, and then repack
that side directory before migrating a single pack back into the main
object store.

Thanks,
eaylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-11  0:27       ` Junio C Hamano
@ 2023-11-11  1:34         ` Taylor Blau
  0 siblings, 0 replies; 63+ messages in thread
From: Taylor Blau @ 2023-11-11  1:34 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren, git, Eric W. Biederman, Jeff King, Patrick Steinhardt

On Sat, Nov 11, 2023 at 09:27:42AM +0900, Junio C Hamano wrote:
> Elijah Newren <newren@gmail.com> writes:
>
> > I believe the above is built on an assumption that any objects written
> > will not need to be read until after the merge is completed.  And we
> > might have a nesting issue too...
> > ...
> > This is unsafe; the object may need to be read later within the same
> > merge.
>
> Thanks for a good analysis.  I agree.

Ditto. I responded to Elijah more in-depth elsewhere in the thread, but
I think for your purposes it is OK to discard this series.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 0/3] replay: implement support for writing new objects to a pack
  2023-11-07 18:22         ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Taylor Blau
                             ` (2 preceding siblings ...)
  2023-11-07 18:23           ` [RFC PATCH 3/3] builtin/replay.c: introduce `--write-pack` Taylor Blau
@ 2023-11-11  3:42           ` Elijah Newren
  2023-11-11  4:04           ` Elijah Newren
  4 siblings, 0 replies; 63+ messages in thread
From: Elijah Newren @ 2023-11-11  3:42 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Jeff King, Patrick Steinhardt, Junio C Hamano, Johannes Schindelin

On Tue, Nov 7, 2023 at 10:22 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> (Based on a combination of Christian's cc/git-replay and my
> tb/merge-tree-write-pack branches).
>
> This RFC demonstrates extending the new `--write-pack` option that
> `merge-tree` recently learned to the `replay` builtin as well.
>
> The approach is as follows:
>
>   - write a pack out after each step in the replay operation, so that
>     subsequent steps may see any new object(s) created during previous
>     steps
>
>   - combine those packs into one before migrating them back into the
>     main object store
>
> This is accomplished with a combination of the bulk-checkin and
> tmp-objdir APIs, with some minor modifications made to when we flush out
> and finalize bulk-checkin transactions.
>
> The benefit to this approach is that we bound the number of inodes
> required per replayed commit to a constant (by default, 3: one for the
> .pack, one for the .idx, and another for the .rev file), instead of
> having each operation take an unbounded number of inodes proportional to
> the number of new objects created during that step. We also only migrate
> a single pack back to the main object store.

Isn't it actually 4?  Since you only put blobs and trees into the bulk
checkin packfiles, the commit object will still be loose.

> In other words, this makes the maximum number of inodes required by
> 'replay' grow proportional to the number of commits being replayed,
> instead of the number of new *objects* created as a result of the replay
> operation.

As per comments on the other series, we actually need 2 packs per
commit (when replaying non-merge commits, we don't have to worry about
recursive merges, so 2*depth is just 2), so this would be 7 inodes per
commit (3 files per pack * 2 packs + 1 loose commit object).

I was curious what the comparative number of loose objects might be if
we didn't use the bulk checking, so:
$ cd linux.git
$ git rev-list --no-merges --count HEAD
1141436
$ time git log --oneline --no-merges --name-only | wc -l
3781628
$ python -c 'print(3781628/1141436)'
3.3130442705504293

So, if repositories are similar to linux, that's a bit over 3 files
modified per commit.  It's a bit harder to count trees, but let's take
a wild guess and say that each file is 7 directories (because I'm too
lazy to do real research and 7 happens to give me nice round numbers
later), so that's up to 7*3 tree objects.  So 3 file objects + 21 tree
objects + 1 commit object = 25 loose objects.

So, your scheme certainly seems to reduce number of inodes, but does
it introduce a different scaling issue?  git-gc repacks when there are
>= 50 packs, or when there are >= 6700 loose objects, suggesting that
if we exceed those numbers, we might start seeing performance suffer.
We would hit 50 packs with a mere 25 commits being replayed, and
wouldn't expect to get to 6700 loose objects until we replayed about
268 commits (6700/25).  Does this mean we are risking worse
performance degradation with this scheme than with just using loose
objects until the end of the operation?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 1/3] merge-ort.c: finalize ODB transactions after each step
  2023-11-07 18:22           ` [RFC PATCH 1/3] merge-ort.c: finalize ODB transactions after each step Taylor Blau
@ 2023-11-11  3:45             ` Elijah Newren
  0 siblings, 0 replies; 63+ messages in thread
From: Elijah Newren @ 2023-11-11  3:45 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Jeff King, Patrick Steinhardt, Junio C Hamano, Johannes Schindelin

On Tue, Nov 7, 2023 at 10:22 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> In a previous commit, the ORT merge backend learned how to use the
> bulk-checkin mechanism to emit a single pack containing any new objects
> created during the merge. This functionality was implemented by setting
> up a new ODB transaction, and finalizing it at the end of the merge via
> `process_entries()`.
>
> In a future commit, we will extend this functionality to the new `git
> replay` command, which needs to see objects from previous steps in order
> to replay each commit.
>
> As a step towards implementing this, teach the ORT backend to flush the
> ODB transaction at the end of each step in `process_entries()`, and then
> finalize the result with `end_odb_transaction()` when calling
> `merge_finalize()`.

process_entries() contains a for loop inside it, so "end of each step
in `process_entries()`" sounds like you are flushing after entry, i.e.
several times per commit being replayed.

Perhaps "at the end of `process_entries()` (thus creating one pack per
commit)" ?


(Of course, the fact that we need this change earlier, in the other
series, kinda makes this point moot.  Just thought I'd mention it
anyway in case it comes back up in your restructuring.)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH 0/3] replay: implement support for writing new objects to a pack
  2023-11-07 18:22         ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Taylor Blau
                             ` (3 preceding siblings ...)
  2023-11-11  3:42           ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Elijah Newren
@ 2023-11-11  4:04           ` Elijah Newren
  4 siblings, 0 replies; 63+ messages in thread
From: Elijah Newren @ 2023-11-11  4:04 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Jeff King, Patrick Steinhardt, Junio C Hamano, Johannes Schindelin

On Tue, Nov 7, 2023 at 10:22 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> (Based on a combination of Christian's cc/git-replay and my
> tb/merge-tree-write-pack branches).
>
> This RFC demonstrates extending the new `--write-pack` option that
> `merge-tree` recently learned to the `replay` builtin as well.
>
> The approach is as follows:
>
>   - write a pack out after each step in the replay operation, so that
>     subsequent steps may see any new object(s) created during previous
>     steps
>
>   - combine those packs into one before migrating them back into the
>     main object store
>
> This is accomplished with a combination of the bulk-checkin and
> tmp-objdir APIs, with some minor modifications made to when we flush out
> and finalize bulk-checkin transactions.

Just thought I'd note that I finished reading all of this series as
well as the five-commit series it's based on.  Just wanted to note
that any commits I didn't comment on from either series looked good to
me.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-10 23:51     ` Elijah Newren
  2023-11-11  0:27       ` Junio C Hamano
  2023-11-11  1:24       ` Taylor Blau
@ 2023-11-13 22:02       ` Jeff King
  2023-11-13 22:34         ` Taylor Blau
  2 siblings, 1 reply; 63+ messages in thread
From: Jeff King @ 2023-11-13 22:02 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Taylor Blau, git, Eric W. Biederman, Junio C Hamano, Patrick Steinhardt

On Fri, Nov 10, 2023 at 03:51:18PM -0800, Elijah Newren wrote:

> This is unsafe; the object may need to be read later within the same
> merge. [...]
>
> I think (untested) that you could fix this by creating two packs
> instead of just one.  In particular, add a call to
> flush_odb_transcation() after the "redo_after_renames" block in
> merge_ort_nonrecursive_internal().

It might not be too hard to just let in-process callers access the
objects we've written. A quick and dirty patch is below, which works
with the test you suggested (the test still fails because it finds a
conflict, but it gets past the "woah, I can't find that sha1" part).

I don't know if that is sufficient, though. Would other spawned
processes (hooks, external merge drivers, and so on) need to be able to
see these objects, too?

The patch teaches the packfile code about the special bulk checkin pack.
It might be cleaner to invert it, and just have the bulk checkin code
register a magic packed_git (it would need to fake the .idx somehow).

diff --git a/bulk-checkin.c b/bulk-checkin.c
index bd6151ba3c..566fc36e68 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -30,6 +30,8 @@ static struct bulk_checkin_packfile {
 	struct pack_idx_entry **written;
 	uint32_t alloc_written;
 	uint32_t nr_written;
+
+	struct packed_git *fake_packed_git;
 } bulk_checkin_packfile;
 
 static void finish_tmp_packfile(struct strbuf *basename,
@@ -82,6 +84,7 @@ static void flush_bulk_checkin_packfile(struct bulk_checkin_packfile *state)
 
 clear_exit:
 	free(state->written);
+	free(state->fake_packed_git);
 	memset(state, 0, sizeof(*state));
 
 	strbuf_release(&packname);
@@ -530,3 +533,37 @@ void end_odb_transaction(void)
 
 	flush_odb_transaction();
 }
+
+static struct packed_git *fake_packed_git(struct bulk_checkin_packfile *state)
+{
+	struct packed_git *p = state->fake_packed_git;
+	if (!p) {
+		FLEX_ALLOC_STR(p, pack_name, "/fake/in-progress.pack");
+		state->fake_packed_git = p;
+		p->pack_fd = state->f->fd;
+		p->do_not_close = 1;
+	}
+
+	hashflush(state->f);
+	p->pack_size = state->f->total; /* maybe add 20 to simulate trailer? */
+
+	return p;
+}
+
+int bulk_checkin_pack_entry(const struct object_id *oid, struct pack_entry *e)
+{
+	size_t i;
+	/*
+	 * this really ought to have some more efficient data structure for
+	 * lookup; ditto for the existing already_written()
+	 */
+	for (i = 0; i < bulk_checkin_packfile.nr_written; i++) {
+		struct pack_idx_entry *p = bulk_checkin_packfile.written[i];
+		if (oideq(&p->oid, oid)) {
+			e->p = fake_packed_git(&bulk_checkin_packfile);
+			e->offset = p->offset;
+			return 0;
+		}
+	}
+	return -1;
+}
diff --git a/bulk-checkin.h b/bulk-checkin.h
index 89786b3954..153fe87c06 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -44,4 +44,7 @@ void flush_odb_transaction(void);
  */
 void end_odb_transaction(void);
 
+struct pack_entry;
+int bulk_checkin_pack_entry(const struct object_id *oid, struct pack_entry *e);
+
 #endif
diff --git a/packfile.c b/packfile.c
index 9cc0a2e37a..05194b1d9b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -23,6 +23,7 @@
 #include "commit-graph.h"
 #include "pack-revindex.h"
 #include "promisor-remote.h"
+#include "bulk-checkin.h"
 
 char *odb_pack_name(struct strbuf *buf,
 		    const unsigned char *hash,
@@ -2045,6 +2046,9 @@ int find_pack_entry(struct repository *r, const struct object_id *oid, struct pa
 	struct list_head *pos;
 	struct multi_pack_index *m;
 
+	if (!bulk_checkin_pack_entry(oid, e))
+		return 1;
+
 	prepare_packed_git(r);
 	if (!r->objects->packed_git && !r->objects->multi_pack_index)
 		return 0;

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-11  1:24       ` Taylor Blau
@ 2023-11-13 22:05         ` Jeff King
  2023-11-14  1:40           ` Junio C Hamano
  2023-11-14  3:08           ` Elijah Newren
  0 siblings, 2 replies; 63+ messages in thread
From: Jeff King @ 2023-11-13 22:05 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Elijah Newren, git, Eric W. Biederman, Junio C Hamano,
	Patrick Steinhardt

On Fri, Nov 10, 2023 at 08:24:44PM -0500, Taylor Blau wrote:

> > This does mean that for a recursive merge, that you'll get up to 2*N
> > packfiles, where N is the depth of the recursive merge.
> 
> We definitely want to avoid that ;-). I think there are a couple of
> potential directions forward here, but the most promising one I think is
> due to Johannes who suggests that we write loose objects into a
> temporary directory with a replace_tmp_objdir() call, and then repack
> that side directory before migrating a single pack back into the main
> object store.

I posted an alternative in response to Elijah; the general idea being to
allow the usual object-lookup code to access the in-progress pack. That
would keep us limited to a single pack.

It _might_ be a terrible idea. E.g., if you write a non-bulk object that
references a bulk one, then that non-bulk one is broken from the
perspective of other processes (until the bulk checkin is flushed). But
I think we'd always be writing to one or the other here, never
interleaving?

-Peff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-13 22:02       ` Jeff King
@ 2023-11-13 22:34         ` Taylor Blau
  2023-11-14  2:50           ` Elijah Newren
  2023-11-14 22:04           ` Jeff King
  0 siblings, 2 replies; 63+ messages in thread
From: Taylor Blau @ 2023-11-13 22:34 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren, git, Eric W. Biederman, Junio C Hamano,
	Patrick Steinhardt

On Mon, Nov 13, 2023 at 05:02:54PM -0500, Jeff King wrote:
> On Fri, Nov 10, 2023 at 03:51:18PM -0800, Elijah Newren wrote:
>
> > This is unsafe; the object may need to be read later within the same
> > merge. [...]
> >
> > I think (untested) that you could fix this by creating two packs
> > instead of just one.  In particular, add a call to
> > flush_odb_transcation() after the "redo_after_renames" block in
> > merge_ort_nonrecursive_internal().
>
> It might not be too hard to just let in-process callers access the
> objects we've written. A quick and dirty patch is below, which works
> with the test you suggested (the test still fails because it finds a
> conflict, but it gets past the "woah, I can't find that sha1" part).

That's a very slick idea, and I think that this series has some legs to
stand on now as a result.

There are a few of interesting conclusions we can draw here:

  1. This solves the recursive merge problem that Elijah mentioned
     earlier where we could generate up to 2^N packs (where N is the
     maximum depth of the recursive merge).

  2. This also solves the case where the merge-ort code needs to read an
     object that it wrote earlier on in the same process without having
     to flush out intermediate packs. So in the best case we need only a
     single pack (instead of two).

  3. This also solves the 'replay' issue, *and* allows us to avoid the
     tmp-objdir thing there entirely, since we can simulate object reads
     in the bulk-checkin pack.

So I think that this is a direction worth pursuing.

In terms of making those lookups faster, I think that what you'd want is
an oidmap. The overhead is slightly unfortunate, but I think that any
other solution which requires keeping the written array in sorted order
would in practice be more expensive as you have to constantly reallocate
and copy portions of the array around to maintain its invariant.

> I don't know if that is sufficient, though. Would other spawned
> processes (hooks, external merge drivers, and so on) need to be able to
> see these objects, too?

Interesting point. In theory those processes could ask about newly
created objects, and if they were spawned before the bulk-checkin pack
was flushed, those lookups would fail.

But that requires that, e.g. for hooks, that we know a-priori the object
ID of some newly-written objects. If you wanted to make those lookups
succeed, I think there are a couple of options:

  - teach child-processes about the bulk-checkin pack, and let them
    perform the same fake lookup

  - flush (but do not close) the bulk-checkin transaction

In any event, I think that this is a sufficiently rare and niche case
that we'd be OK to declare that you should not expect the above
scenarios to work when using `--write-pack`. If someone does ask for
that feature in the future, we could implement it relatively painlessly
using one of the above options.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-13 22:05         ` Jeff King
@ 2023-11-14  1:40           ` Junio C Hamano
  2023-11-14  2:54             ` Elijah Newren
  2023-11-14 21:55             ` Jeff King
  2023-11-14  3:08           ` Elijah Newren
  1 sibling, 2 replies; 63+ messages in thread
From: Junio C Hamano @ 2023-11-14  1:40 UTC (permalink / raw)
  To: Jeff King
  Cc: Taylor Blau, Elijah Newren, git, Eric W. Biederman, Patrick Steinhardt

Jeff King <peff@peff.net> writes:

> I posted an alternative in response to Elijah; the general idea being to
> allow the usual object-lookup code to access the in-progress pack. That
> would keep us limited to a single pack.

If such a mechanism is done in a generic way, would we be able to
simplify fast-import a lot, I wonder?  IIRC, it had quite a lot of
code to remember what it has written to its output to work around
the exact issue your alternative tries to solve.  In fact, maybe we
could make fast-import a thin wrapper around the bulk checkin
infrastructure?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-13 22:34         ` Taylor Blau
@ 2023-11-14  2:50           ` Elijah Newren
  2023-11-14 21:53             ` Jeff King
  2023-11-14 22:04           ` Jeff King
  1 sibling, 1 reply; 63+ messages in thread
From: Elijah Newren @ 2023-11-14  2:50 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Jeff King, git, Eric W. Biederman, Junio C Hamano, Patrick Steinhardt

On Mon, Nov 13, 2023 at 2:34 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Mon, Nov 13, 2023 at 05:02:54PM -0500, Jeff King wrote:
> > On Fri, Nov 10, 2023 at 03:51:18PM -0800, Elijah Newren wrote:
> >
> > > This is unsafe; the object may need to be read later within the same
> > > merge. [...]
> > >
> > > I think (untested) that you could fix this by creating two packs
> > > instead of just one.  In particular, add a call to
> > > flush_odb_transcation() after the "redo_after_renames" block in
> > > merge_ort_nonrecursive_internal().
> >
> > It might not be too hard to just let in-process callers access the
> > objects we've written. A quick and dirty patch is below, which works
> > with the test you suggested (the test still fails because it finds a
> > conflict, but it gets past the "woah, I can't find that sha1" part).
>
> That's a very slick idea, and I think that this series has some legs to
> stand on now as a result.
>
> There are a few of interesting conclusions we can draw here:
>
>   1. This solves the recursive merge problem that Elijah mentioned
>      earlier where we could generate up to 2^N packs (where N is the
>      maximum depth of the recursive merge).
>
>   2. This also solves the case where the merge-ort code needs to read an
>      object that it wrote earlier on in the same process without having
>      to flush out intermediate packs. So in the best case we need only a
>      single pack (instead of two).
>
>   3. This also solves the 'replay' issue, *and* allows us to avoid the
>      tmp-objdir thing there entirely, since we can simulate object reads
>      in the bulk-checkin pack.
>
> So I think that this is a direction worth pursuing.

Agreed; this looks great.  It's basically bringing fast-import-like
functionality -- writing objects to a single new packfile while making
previous objects accessible to subsequent ones -- to additional
callers.

> In terms of making those lookups faster, I think that what you'd want is
> an oidmap. The overhead is slightly unfortunate, but I think that any
> other solution which requires keeping the written array in sorted order
> would in practice be more expensive as you have to constantly reallocate
> and copy portions of the array around to maintain its invariant.

When comparing the overhead of an oidmap to a bunch of inodes,
however, it seems relatively cheap.  :-)

> > I don't know if that is sufficient, though. Would other spawned
> > processes (hooks, external merge drivers, and so on) need to be able to
> > see these objects, too?
>
> Interesting point. In theory those processes could ask about newly
> created objects, and if they were spawned before the bulk-checkin pack
> was flushed, those lookups would fail.

One of the big design differences that I was pushing really hard with
git-replay was performance and things that came with it -- no
worktree, no per-commit hooks (which are nearly ruled out by no
worktree, but it's still worth calling out separately), etc.  A
post-operation hook could be fine, but would also not get to assume a
worktree.

merge-tree is the same as far as hooks; I'd rather just not have them,
but if we did, they'd be a post-operation hook.

In both cases, that makes hooks not much of a sticking point.

External merge drivers, however...

> But that requires that, e.g. for hooks, that we know a-priori the object
> ID of some newly-written objects. If you wanted to make those lookups
> succeed, I think there are a couple of options:
>
>   - teach child-processes about the bulk-checkin pack, and let them
>     perform the same fake lookup
>
>   - flush (but do not close) the bulk-checkin transaction
>
> In any event, I think that this is a sufficiently rare and niche case
> that we'd be OK to declare that you should not expect the above
> scenarios to work when using `--write-pack`. If someone does ask for
> that feature in the future, we could implement it relatively painlessly
> using one of the above options.

Seems reasonable to me.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-14  1:40           ` Junio C Hamano
@ 2023-11-14  2:54             ` Elijah Newren
  2023-11-14 21:55             ` Jeff King
  1 sibling, 0 replies; 63+ messages in thread
From: Elijah Newren @ 2023-11-14  2:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Taylor Blau, git, Eric W. Biederman, Patrick Steinhardt

On Mon, Nov 13, 2023 at 5:41 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jeff King <peff@peff.net> writes:
>
> > I posted an alternative in response to Elijah; the general idea being to
> > allow the usual object-lookup code to access the in-progress pack. That
> > would keep us limited to a single pack.
>
> If such a mechanism is done in a generic way, would we be able to
> simplify fast-import a lot, I wonder?  IIRC, it had quite a lot of
> code to remember what it has written to its output to work around
> the exact issue your alternative tries to solve.  In fact, maybe we
> could make fast-import a thin wrapper around the bulk checkin
> infrastructure?

fast-import also attempts to delta objects against previous ones as it
writes them to the pack.  That's one thing still lacking from this
solution, but aside from that, it also sounds to me like the bulk
checkin infrastructure is getting closer to becoming a replacement for
much of the fast-import code.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-13 22:05         ` Jeff King
  2023-11-14  1:40           ` Junio C Hamano
@ 2023-11-14  3:08           ` Elijah Newren
  1 sibling, 0 replies; 63+ messages in thread
From: Elijah Newren @ 2023-11-14  3:08 UTC (permalink / raw)
  To: Jeff King
  Cc: Taylor Blau, git, Eric W. Biederman, Junio C Hamano, Patrick Steinhardt

On Mon, Nov 13, 2023 at 2:05 PM Jeff King <peff@peff.net> wrote:
>
> On Fri, Nov 10, 2023 at 08:24:44PM -0500, Taylor Blau wrote:
>
> > > This does mean that for a recursive merge, that you'll get up to 2*N
> > > packfiles, where N is the depth of the recursive merge.
> >
> > We definitely want to avoid that ;-). I think there are a couple of
> > potential directions forward here, but the most promising one I think is
> > due to Johannes who suggests that we write loose objects into a
> > temporary directory with a replace_tmp_objdir() call, and then repack
> > that side directory before migrating a single pack back into the main
> > object store.
>
> I posted an alternative in response to Elijah; the general idea being to
> allow the usual object-lookup code to access the in-progress pack. That
> would keep us limited to a single pack.
>
> It _might_ be a terrible idea. E.g., if you write a non-bulk object that
> references a bulk one, then that non-bulk one is broken from the
> perspective of other processes (until the bulk checkin is flushed). But
> I think we'd always be writing to one or the other here, never
> interleaving?

I think it sounds like a really cool idea, personally.  I also don't
see why any current uses would result in interleaving.

fast-import's created pack has the same kind of restriction...and has
even grown some extensions to work around the need for early access.
In particular, it has a `checkpoint` feature for flushing the current
pack and starting a new one, and also gained the `cat-blob` command
for when folks really wanted to access an object without writing out
the whole packfile.  So, _if_ the need for early access arises, we
have some prior art to look to for ways to provide it.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-14  2:50           ` Elijah Newren
@ 2023-11-14 21:53             ` Jeff King
  0 siblings, 0 replies; 63+ messages in thread
From: Jeff King @ 2023-11-14 21:53 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Taylor Blau, git, Eric W. Biederman, Junio C Hamano, Patrick Steinhardt

On Mon, Nov 13, 2023 at 06:50:08PM -0800, Elijah Newren wrote:

> merge-tree is the same as far as hooks; I'd rather just not have them,
> but if we did, they'd be a post-operation hook.
> 
> In both cases, that makes hooks not much of a sticking point.
> 
> External merge drivers, however...

I suspect external merge drivers are OK, in the sense that they should
not be looking up arbitrary objects anyway. We should hand the driver
what it needs via tempfiles, etc.

That said, I'm also OK with the notion that "--write-pack" is optional,
and if some features don't work with it, that's OK. It's nice if we can
flag that explicitly (rather than having things break in weird and
subtle ways), but I think we might not even know about an implicit
assumption until somebody runs across it in practice.

-Peff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-14  1:40           ` Junio C Hamano
  2023-11-14  2:54             ` Elijah Newren
@ 2023-11-14 21:55             ` Jeff King
  1 sibling, 0 replies; 63+ messages in thread
From: Jeff King @ 2023-11-14 21:55 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Elijah Newren, git, Eric W. Biederman, Patrick Steinhardt

On Tue, Nov 14, 2023 at 10:40:58AM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I posted an alternative in response to Elijah; the general idea being to
> > allow the usual object-lookup code to access the in-progress pack. That
> > would keep us limited to a single pack.
> 
> If such a mechanism is done in a generic way, would we be able to
> simplify fast-import a lot, I wonder?  IIRC, it had quite a lot of
> code to remember what it has written to its output to work around
> the exact issue your alternative tries to solve.  In fact, maybe we
> could make fast-import a thin wrapper around the bulk checkin
> infrastructure?

I suspect that the implementation could be shared with fast-import. I'm
not sure it would save all that much code, though. There's a lot going
on in fast-import besides keeping track of which objects we wrote into a
pack. ;)

The bigger issue, though, is that fast-import does generate some deltas
and the bulk checkin code does not. And I'm not sure how the bulk
checkin interface would expose that API (you need the caller to say
"...and I suspect this object might be a good delta base").

-Peff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack`
  2023-11-13 22:34         ` Taylor Blau
  2023-11-14  2:50           ` Elijah Newren
@ 2023-11-14 22:04           ` Jeff King
  1 sibling, 0 replies; 63+ messages in thread
From: Jeff King @ 2023-11-14 22:04 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Elijah Newren, git, Eric W. Biederman, Junio C Hamano,
	Patrick Steinhardt

On Mon, Nov 13, 2023 at 05:34:42PM -0500, Taylor Blau wrote:

> > It might not be too hard to just let in-process callers access the
> > objects we've written. A quick and dirty patch is below, which works
> > with the test you suggested (the test still fails because it finds a
> > conflict, but it gets past the "woah, I can't find that sha1" part).
> 
> That's a very slick idea, and I think that this series has some legs to
> stand on now as a result.

I'm glad people seem to like it. ;)

I was mostly throwing it out there as an illustration, and I don't plan
on polishing it up further. But in case somebody else wants to work on
it, here are random extra thoughts on the topic:

  - rather than teach packfile.c about bulk checkin, I think it might be
    cleaner to let packed_git structs have a function pointer for
    accessing the index (and if NULL, naturally we'd open the .idx file
    in the usual way). And then bulk checkin could just register the
    "fake" packed_git

  - the flip side, though, is: would it be weird for other parts of the
    code to ever see the fake bulk packed_git in the list? It doesn't
    represent a real packfile the way the other ones do. So maybe it is
    better to keep it separate

  - I suspect this scheme may violate some bounds-checking of the
    packfiles. For example, we haven't written the hashfile trailer yet in
    the still-open packfile. But I think use_pack() assumes we always
    have an extra the_hash_algo->rawsz bytes of padding at the end. I'm
    not sure if that would ever cause us to accidentally read past the
    end of the file.

  - as you mentioned, an oidmap would be an obvious replacement for the
    existing unsorted list

  - the existing already_written() function faces the same problem. I
    don't think anybody cared because we'd usually only bulk checkin a
    few huge objects. But with --write-pack, we might write a lot of
    such objects, and the linear scan in already_written() makes it
    accidentally quadratic.

  - speaking of already_written(): it calls repo_has_object_file() to
    see if we can elide the write. It probably should be using
    freshen_*_object() in the usual way. Again, this is an existing bug
    but I suspect nobody noticed because the bulk checkin code path
    hardly ever kicks in.

-Peff

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2023-11-14 22:04 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-19 17:28 [PATCH v4 0/7] merge-ort: implement support for packing objects together Taylor Blau
2023-10-19 17:28 ` [PATCH v4 1/7] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
2023-10-20  7:35   ` Jeff King
2023-10-20 16:55     ` Junio C Hamano
2023-10-19 17:28 ` [PATCH v4 2/7] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types Taylor Blau
2023-10-19 17:28 ` [PATCH v4 3/7] bulk-checkin: refactor deflate routine to accept a `bulk_checkin_source` Taylor Blau
2023-10-19 17:28 ` [PATCH v4 4/7] bulk-checkin: implement `SOURCE_INCORE` mode for `bulk_checkin_source` Taylor Blau
2023-10-23  9:19   ` Patrick Steinhardt
2023-10-23 18:58     ` Jeff King
2023-10-24  6:34       ` Patrick Steinhardt
2023-10-24 17:08         ` Junio C Hamano
2023-10-19 17:28 ` [PATCH v4 5/7] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
2023-10-19 17:28 ` [PATCH v4 6/7] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
2023-10-19 17:29 ` [PATCH v4 7/7] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
2023-10-19 21:47 ` [PATCH v4 0/7] merge-ort: implement support for packing objects together Junio C Hamano
2023-10-20  7:29 ` Jeff King
2023-10-20 16:53   ` Junio C Hamano
2023-10-23  9:19 ` Patrick Steinhardt
2023-10-23 22:44 ` [PATCH v5 0/5] " Taylor Blau
2023-10-23 22:44   ` [PATCH v5 1/5] bulk-checkin: extract abstract `bulk_checkin_source` Taylor Blau
2023-10-25  7:37     ` Jeff King
2023-10-25 15:39       ` Taylor Blau
2023-10-27 23:12       ` Junio C Hamano
2023-10-23 22:44   ` [PATCH v5 2/5] bulk-checkin: generify `stream_blob_to_pack()` for arbitrary types Taylor Blau
2023-10-23 22:45   ` [PATCH v5 3/5] bulk-checkin: introduce `index_blob_bulk_checkin_incore()` Taylor Blau
2023-10-25  7:58     ` Patrick Steinhardt
2023-10-25 15:44       ` Taylor Blau
2023-10-25 17:21         ` Eric Sunshine
2023-10-26  8:16           ` Patrick Steinhardt
2023-11-11  0:17           ` Elijah Newren
2023-10-23 22:45   ` [PATCH v5 4/5] bulk-checkin: introduce `index_tree_bulk_checkin_incore()` Taylor Blau
2023-10-23 22:45   ` [PATCH v5 5/5] builtin/merge-tree.c: implement support for `--write-pack` Taylor Blau
2023-10-25  7:58     ` Patrick Steinhardt
2023-10-25 15:46       ` Taylor Blau
2023-11-10 23:51     ` Elijah Newren
2023-11-11  0:27       ` Junio C Hamano
2023-11-11  1:34         ` Taylor Blau
2023-11-11  1:24       ` Taylor Blau
2023-11-13 22:05         ` Jeff King
2023-11-14  1:40           ` Junio C Hamano
2023-11-14  2:54             ` Elijah Newren
2023-11-14 21:55             ` Jeff King
2023-11-14  3:08           ` Elijah Newren
2023-11-13 22:02       ` Jeff King
2023-11-13 22:34         ` Taylor Blau
2023-11-14  2:50           ` Elijah Newren
2023-11-14 21:53             ` Jeff King
2023-11-14 22:04           ` Jeff King
2023-10-23 23:31   ` [PATCH v5 0/5] merge-ort: implement support for packing objects together Junio C Hamano
2023-11-06 15:46     ` Johannes Schindelin
2023-11-06 23:19       ` Junio C Hamano
2023-11-07  3:42       ` Jeff King
2023-11-07 15:58       ` Taylor Blau
2023-11-07 18:22         ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Taylor Blau
2023-11-07 18:22           ` [RFC PATCH 1/3] merge-ort.c: finalize ODB transactions after each step Taylor Blau
2023-11-11  3:45             ` Elijah Newren
2023-11-07 18:22           ` [RFC PATCH 2/3] tmp-objdir: introduce `tmp_objdir_repack()` Taylor Blau
2023-11-08  7:05             ` Patrick Steinhardt
2023-11-09 19:26               ` Taylor Blau
2023-11-07 18:23           ` [RFC PATCH 3/3] builtin/replay.c: introduce `--write-pack` Taylor Blau
2023-11-11  3:42           ` [RFC PATCH 0/3] replay: implement support for writing new objects to a pack Elijah Newren
2023-11-11  4:04           ` Elijah Newren
2023-10-25  7:58   ` [PATCH v5 0/5] merge-ort: implement support for packing objects together Patrick Steinhardt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.