All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] Add a new --remerge-diff capability to show & log
@ 2021-08-31  2:26 Elijah Newren via GitGitGadget
  2021-08-31  2:26 ` [PATCH 1/7] merge-ort: mark a few more conflict messages as omittable Elijah Newren via GitGitGadget
                   ` (9 more replies)
  0 siblings, 10 replies; 64+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-08-31  2:26 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren

Here are some patches to add a --remerge-diff capability to show & log,
which works by comparing merge commits to an automatic remerge (note that
the automatic remerge tree can contain files with conflict markers).

Here are some example commits you can try this out on (with git show
--remerge-diff $COMMIT):

 * git.git conflicted merge: 07601b5b36
 * git.git non-conflicted change: bf04590ecd
 * linux.git conflicted merge: eab3540562fb
 * linux.git non-conflicted change: 223cea6a4f05

Many more can be found by just running git log --merges --remerge-diff in
your repository of choice and searching for diffs (most merges tend to be
clean and unmodified and thus produce no diff but a search of '^diff' in the
log output tends to find the examples nicely).

Some basic high level details about this new option:

 * This option is most naturally compared to --cc, though the output seems
   to be much more understandable to most users than --cc output.
 * Since merges are often clean and unmodified, this new option results in
   an empty diff for most merges.
 * This new option shows things like the removal of conflict markers, which
   hunks users picked from the various conflicted sides to keep or remove,
   and shows changes made outside of conflict markers (which might reflect
   changes needed to resolve semantic conflicts or cleanups of e.g.
   compilation warnings or other additional changes an integrator felt
   belonged in the merged result).
 * This new option does not (currently) work for octopus merges, since
   merge-ort is specific to two-parent merges[1].
 * This option will not work on a read-only or full filesystem[2].
 * We discussed this capability at Git Merge 2020, and one of the
   suggestions was doing a periodic git gc --auto during the operation (due
   to potential new blobs and trees created during the operation). I found a
   way to avoid that; see [2].
 * This option is faster than you'd probably expect; it handles 33.5 merge
   commits per second in linux.git on my computer; see below.

In regards to the performance point above, the timing for running the
following command:

time git log --min-parents=2 --max-parents=2 $DIFF_FLAG | wc -l


in linux.git (with v5.4 checked out, since my copy of linux is very out of
date) is as follows:

DIFF_FLAG=--cc:            71m 31.536s
DIFF_FLAG=--remerge-diff:  31m  3.170s


Note that there are 62476 merges in this history. Also, output size is:

DIFF_FLAG=--cc:            2169111 lines
DIFF_FLAG=--remerge-diff:  2458020 lines


So roughly the same amount of output as --cc, as you'd expect.

As a side note: git log --remerge-diff, when run in various repositories and
allowed to run all the way back to the beginning(s) of history, is a nice
stress test of sorts for merge-ort. Especially when users run it for you on
their repositories they are working on, whether intentionally or via a bug
in a tool triggering that command to be run unexpectedly. Long story short,
such a bug in an internal tool existed last December and this command was
run on an internal repository and found a platform-specific bug in merge-ort
on some really old merge commit from that repo. I fixed that bug (a
STABLE_QSORT thing) while upstreaming all the merge-ort patches in the mean
time, but it was nice getting extra testing. Having more folks run this on
their repositories might be useful extra testing of the new merge strategy.

Also, I previously mentioned --remerge-diff-only (a flag to show how
cherry-picks or reverts differ from an automatic cherry-pick or revert, in
addition to showing how merges differ from an automatic merge). This series
does not include the patches to introduce that option; I'll submit them
later.

Two other things that might be interesting but are not included and which I
haven't investigated:

 * some mechanism for passing extra merge options through (e.g.
   -Xignore-space-change)
 * a capability to compare the automatic merge to a second automatic merge
   done with different merge options. (Not sure if this would be of interest
   to end users, but might be interesting while developing new a
   --strategy-option, or maybe checking how changing some default in the
   merge algorithm would affect historical merges in various repositories).

[1] I have nebulous ideas of how an Octopus-centric ORT strategy could be
written -- basically, just repeatedly invoking ort and trying to make sure
nested conflicts can be differentiated. For now, though, a simple warning is
printed that octopus merges are not handled and no diff will be shown. [2]
New blobs/trees can be written by the three-way merging step. These are
written to a temporary area (via tmp-objdir.c) under the git object store
that is cleaned up at the end of the operation, with the new loose objects
from the remerge being cleaned up after each individual merge.

Elijah Newren (7):
  merge-ort: mark a few more conflict messages as omittable
  merge-ort: add ability to record conflict messages in a file
  ll-merge: add API for capturing warnings in a strbuf instead of stderr
  merge-ort: capture and print ll-merge warnings in our preferred
    fashion
  tmp-objdir: new API for creating and removing primary object dirs
  show, log: provide a --remerge-diff capability
  doc/diff-options: explain the new --remerge-diff option

 Documentation/diff-options.txt |  8 +++
 builtin/log.c                  | 23 ++++++++
 diff-merges.c                  | 12 +++++
 ll-merge.c                     | 51 +++++++++++++-----
 ll-merge.h                     |  9 ++++
 log-tree.c                     | 69 ++++++++++++++++++++++++
 merge-ort.c                    | 96 +++++++++++++++++++++++++++++++---
 merge-recursive.c              |  3 ++
 merge-recursive.h              |  1 +
 revision.h                     |  6 ++-
 t/t6404-recursive-merge.sh     | 10 +++-
 t/t6406-merge-attr.sh          | 10 +++-
 tmp-objdir.c                   | 29 ++++++++++
 tmp-objdir.h                   | 16 ++++++
 14 files changed, 319 insertions(+), 24 deletions(-)


base-commit: c4203212e360b25a1c69467b5a8437d45a373cac
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1080%2Fnewren%2Fremerge-diff-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1080/newren/remerge-diff-v1
Pull-Request: https://github.com/git/git/pull/1080
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/7] merge-ort: mark a few more conflict messages as omittable
  2021-08-31  2:26 [PATCH 0/7] Add a new --remerge-diff capability to show & log Elijah Newren via GitGitGadget
@ 2021-08-31  2:26 ` Elijah Newren via GitGitGadget
  2021-08-31 21:06   ` Junio C Hamano
  2021-08-31  2:26 ` [PATCH 2/7] merge-ort: add ability to record conflict messages in a file Elijah Newren via GitGitGadget
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-08-31  2:26 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

path_msg() has the ability to mark messages as omittable when recording
conflict messages in an in-tree file.  This conflict message file will
then be shown by diffing the merge-created tree to the actual merge
commit that is created.  While all the messages touched in this commit
are very useful when trying to create a merge initially, early use with
the --remerge-diff feature (the only user of this omittable conflict
message capability), suggests that the particular messages marked in
this commit are just noise when trying to see what changes users made to
create a merge commit.  Mark them as omittable.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 515dc39b7f6..4dbf0a477af 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2409,7 +2409,7 @@ static void apply_directory_rename_modifications(struct merge_options *opt,
 		 */
 		ci->path_conflict = 1;
 		if (pair->status == 'A')
-			path_msg(opt, new_path, 0,
+			path_msg(opt, new_path, 1,
 				 _("CONFLICT (file location): %s added in %s "
 				   "inside a directory that was renamed in %s, "
 				   "suggesting it should perhaps be moved to "
@@ -2417,7 +2417,7 @@ static void apply_directory_rename_modifications(struct merge_options *opt,
 				 old_path, branch_with_new_path,
 				 branch_with_dir_rename, new_path);
 		else
-			path_msg(opt, new_path, 0,
+			path_msg(opt, new_path, 1,
 				 _("CONFLICT (file location): %s renamed to %s "
 				   "in %s, inside a directory that was renamed "
 				   "in %s, suggesting it should perhaps be "
@@ -3814,7 +3814,7 @@ static void process_entry(struct merge_options *opt,
 				reason = _("add/add");
 			if (S_ISGITLINK(merged_file.mode))
 				reason = _("submodule");
-			path_msg(opt, path, 0,
+			path_msg(opt, path, 1,
 				 _("CONFLICT (%s): Merge conflict in %s"),
 				 reason, path);
 		}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 2/7] merge-ort: add ability to record conflict messages in a file
  2021-08-31  2:26 [PATCH 0/7] Add a new --remerge-diff capability to show & log Elijah Newren via GitGitGadget
  2021-08-31  2:26 ` [PATCH 1/7] merge-ort: mark a few more conflict messages as omittable Elijah Newren via GitGitGadget
@ 2021-08-31  2:26 ` Elijah Newren via GitGitGadget
  2021-09-28 22:29   ` Jeff King
  2021-08-31  2:26 ` [PATCH 3/7] ll-merge: add API for capturing warnings in a strbuf instead of stderr Elijah Newren via GitGitGadget
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-08-31  2:26 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

We want to add an ability for users to run
    git show --remerge-diff $MERGE_COMMIT
or even
    git log -p --remerge-diff ...
and have git show the differences between where the merge machinery
would stop and what is recorded in merge commits.  However, in such
cases, stdout is not an appropriate location to dump conflict messages.
Write those messages to a file in the merge result (not to the working
tree or index, but just to a blob recorded in the relevant tree object).

There are several considerations here:
  * We have to pick file(s) where we write these conflict messages too
  * We want to make it as clear as possible that it's not a real file
    but a set of messages about another file
  * We want conflict messages about a file to appear near the file in
    question in a diff, preferably immediately preceding the file in
    question
  * Related to the above, per-file conflict messages are preferred
    over lumping all conflict messages into one big file

To achive the above:
  * We put the conflict messages for $filename in
      $filename[0:-1] + " " + $filename[-1] + ".conflict_msg"
    or, in words, we insert a space before the final character of
    the filename and then also add ".conflict_msg" at the end.
  * We start the file with a "== Conflict notices for $filename =="
    banner

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c       | 79 ++++++++++++++++++++++++++++++++++++++++++++++-
 merge-recursive.c |  3 ++
 merge-recursive.h |  1 +
 3 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 4dbf0a477af..a9e69d9cbb0 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -632,7 +632,11 @@ static void path_msg(struct merge_options *opt,
 		     const char *fmt, ...)
 {
 	va_list ap;
-	struct strbuf *sb = strmap_get(&opt->priv->output, path);
+	struct strbuf *sb;
+
+	if (opt->record_conflict_msgs_in_tree && omittable_hint)
+		return; /* Do not record mere hints in tree */
+	sb = strmap_get(&opt->priv->output, path);
 	if (!sb) {
 		sb = xmalloc(sizeof(*sb));
 		strbuf_init(sb, 0);
@@ -3531,6 +3535,74 @@ static void write_completed_directory(struct merge_options *opt,
 	info->last_directory_len = strlen(info->last_directory);
 }
 
+static void put_path_msgs_in_file(struct merge_options *opt,
+				  const char *path,
+				  struct merged_info *mi,
+				  struct directory_versions *dir_metadata)
+{
+	struct strbuf tmp = STRBUF_INIT, new_path_contents = STRBUF_INIT;
+	char *new_path;
+	int new_path_basic_len, unique_counter;
+	struct merged_info *new_mi;
+	char final;
+	struct strbuf *sb = strmap_get(&opt->priv->output, path);
+
+	assert(opt->record_conflict_msgs_in_tree);
+	if (!sb)
+		return;
+
+	/*
+	 * Determine a pathname for recording conflict messages.  We'd like it
+	 * to sort just before path, but have a name very similar to what path
+	 * has.
+	 */
+	strbuf_addstr(&tmp, path);
+	final = tmp.buf[tmp.len-1];
+	strbuf_setlen(&tmp, tmp.len-1);
+	strbuf_addf(&tmp, " %c.conflict_msg", final);
+
+	/*
+	 * In extremely unlikely event this filename is not unique, modify it
+	 * with ".<integer>" suffixes until it is.
+	 */
+	new_path_basic_len = tmp.len;
+	unique_counter = 0;
+	while (strmap_contains(&opt->priv->paths, tmp.buf)) {
+		strbuf_setlen(&tmp, new_path_basic_len);
+		strbuf_addf(&tmp, ".%d", ++unique_counter);
+	}
+
+	/* Now that we have a unique string, move it to our pool */
+	new_path = mem_pool_strdup(&opt->priv->pool, tmp.buf);
+	strbuf_release(&tmp);
+
+	/* Set up contents we want to place in the file. */
+	strbuf_addf(&new_path_contents, "== Conflict notices for %s ==\n",
+		    path);
+	strbuf_addbuf(&new_path_contents, sb);
+
+	/* Set up new_mi */
+	new_mi = mem_pool_calloc(&opt->priv->pool, 1, sizeof(*new_mi));
+	new_mi->result.mode = 0100644;
+	new_mi->is_null = 0;
+	new_mi->clean = 1;
+	new_mi->basename_offset = mi->basename_offset;
+	new_mi->directory_name = mi->directory_name;
+	if (write_object_file(new_path_contents.buf, new_path_contents.len,
+			      blob_type, &new_mi->result.oid))
+		die(_("Unable to add %s to database"), new_path);
+
+	/*
+	 * Save new_mi in opt->priv->path (so that something will deallocate
+	 * it later), and record the entry for it.
+	 */
+	strmap_put(&opt->priv->paths, new_path, new_mi);
+	record_entry_for_tree(dir_metadata, new_path, new_mi);
+
+	/* Cleanup */
+	strbuf_release(&new_path_contents);
+}
+
 /* Per entry merge function */
 static void process_entry(struct merge_options *opt,
 			  const char *path,
@@ -3991,6 +4063,8 @@ static void process_entries(struct merge_options *opt,
 			struct conflict_info *ci = (struct conflict_info *)mi;
 			process_entry(opt, path, ci, &dir_metadata);
 		}
+		if (!opt->priv->call_depth && opt->record_conflict_msgs_in_tree)
+			put_path_msgs_in_file(opt, path, mi, &dir_metadata);
 	}
 	trace2_region_leave("merge", "processing", opt->repo);
 
@@ -4226,6 +4300,9 @@ void merge_switch_to_result(struct merge_options *opt,
 		struct string_list olist = STRING_LIST_INIT_NODUP;
 		int i;
 
+		if (opt->record_conflict_msgs_in_tree)
+			BUG("Either display conflict messages or record them in tree, not both");
+
 		trace2_region_enter("merge", "display messages", opt->repo);
 
 		/* Hack to pre-allocate olist to the desired size */
diff --git a/merge-recursive.c b/merge-recursive.c
index 3355d50e8ad..b14fa15be91 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -3697,6 +3697,9 @@ static int merge_start(struct merge_options *opt, struct tree *head)
 
 	assert(opt->priv == NULL);
 
+	/* Not supported; option specific to merge-ort */
+	assert(!opt->record_conflict_msgs_in_tree);
+
 	/* Sanity check on repo state; index must match head */
 	if (repo_index_has_changes(opt->repo, head, &sb)) {
 		err(opt, _("Your local changes to the following files would be overwritten by merge:\n  %s"),
diff --git a/merge-recursive.h b/merge-recursive.h
index 0795a1d3ec1..2e2fab37f46 100644
--- a/merge-recursive.h
+++ b/merge-recursive.h
@@ -46,6 +46,7 @@ struct merge_options {
 	/* miscellaneous control options */
 	const char *subtree_shift;
 	unsigned renormalize : 1;
+	unsigned record_conflict_msgs_in_tree : 1;
 
 	/* internal fields used by the implementation */
 	struct merge_options_internal *priv;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 3/7] ll-merge: add API for capturing warnings in a strbuf instead of stderr
  2021-08-31  2:26 [PATCH 0/7] Add a new --remerge-diff capability to show & log Elijah Newren via GitGitGadget
  2021-08-31  2:26 ` [PATCH 1/7] merge-ort: mark a few more conflict messages as omittable Elijah Newren via GitGitGadget
  2021-08-31  2:26 ` [PATCH 2/7] merge-ort: add ability to record conflict messages in a file Elijah Newren via GitGitGadget
@ 2021-08-31  2:26 ` Elijah Newren via GitGitGadget
  2021-09-28 22:37   ` Jeff King
  2021-08-31  2:26 ` [PATCH 4/7] merge-ort: capture and print ll-merge warnings in our preferred fashion Elijah Newren via GitGitGadget
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 64+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-08-31  2:26 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Whenever ll-merge encounters a binary file, it prints a warning to
stderr.  However, for the --remerge-diff option we want to add, we need
to capture all conflict messages and show them in a diff instead of
dumping them straight to stdout or stderr.  Add some new API that will
allow us to capture this output and display it in our preferred method.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 ll-merge.c | 51 +++++++++++++++++++++++++++++++++++++++------------
 ll-merge.h |  9 +++++++++
 2 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/ll-merge.c b/ll-merge.c
index 261657578c7..4d5bdc12464 100644
--- a/ll-merge.c
+++ b/ll-merge.c
@@ -11,11 +11,13 @@
 #include "run-command.h"
 #include "ll-merge.h"
 #include "quote.h"
+#include "strbuf.h"
 
 struct ll_merge_driver;
 
 typedef int (*ll_merge_fn)(const struct ll_merge_driver *,
 			   mmbuffer_t *result,
+			   struct strbuf *warnings,
 			   const char *path,
 			   mmfile_t *orig, const char *orig_name,
 			   mmfile_t *src1, const char *name1,
@@ -51,6 +53,7 @@ void reset_merge_attributes(void)
  */
 static int ll_binary_merge(const struct ll_merge_driver *drv_unused,
 			   mmbuffer_t *result,
+			   struct strbuf *warnings,
 			   const char *path,
 			   mmfile_t *orig, const char *orig_name,
 			   mmfile_t *src1, const char *name1,
@@ -59,6 +62,7 @@ static int ll_binary_merge(const struct ll_merge_driver *drv_unused,
 			   int marker_size)
 {
 	mmfile_t *stolen;
+	const char *msg = "Cannot merge binary files: %s (%s vs. %s)";
 	assert(opts);
 
 	/*
@@ -71,8 +75,11 @@ static int ll_binary_merge(const struct ll_merge_driver *drv_unused,
 	} else {
 		switch (opts->variant) {
 		default:
-			warning("Cannot merge binary files: %s (%s vs. %s)",
-				path, name1, name2);
+			if (warnings) {
+				strbuf_addstr(warnings, "Warning: ");
+				strbuf_addf(warnings, msg, path, name1, name2);
+			} else
+				warning(msg, path, name1, name2);
 			/* fallthru */
 		case XDL_MERGE_FAVOR_OURS:
 			stolen = src1;
@@ -98,6 +105,7 @@ static int ll_binary_merge(const struct ll_merge_driver *drv_unused,
 
 static int ll_xdl_merge(const struct ll_merge_driver *drv_unused,
 			mmbuffer_t *result,
+			struct strbuf *warnings,
 			const char *path,
 			mmfile_t *orig, const char *orig_name,
 			mmfile_t *src1, const char *name1,
@@ -114,7 +122,7 @@ static int ll_xdl_merge(const struct ll_merge_driver *drv_unused,
 	    buffer_is_binary(orig->ptr, orig->size) ||
 	    buffer_is_binary(src1->ptr, src1->size) ||
 	    buffer_is_binary(src2->ptr, src2->size)) {
-		return ll_binary_merge(drv_unused, result,
+		return ll_binary_merge(drv_unused, result, warnings,
 				       path,
 				       orig, orig_name,
 				       src1, name1,
@@ -138,6 +146,7 @@ static int ll_xdl_merge(const struct ll_merge_driver *drv_unused,
 
 static int ll_union_merge(const struct ll_merge_driver *drv_unused,
 			  mmbuffer_t *result,
+			  struct strbuf *warnings,
 			  const char *path,
 			  mmfile_t *orig, const char *orig_name,
 			  mmfile_t *src1, const char *name1,
@@ -150,7 +159,7 @@ static int ll_union_merge(const struct ll_merge_driver *drv_unused,
 	assert(opts);
 	o = *opts;
 	o.variant = XDL_MERGE_FAVOR_UNION;
-	return ll_xdl_merge(drv_unused, result, path,
+	return ll_xdl_merge(drv_unused, result, warnings, path,
 			    orig, orig_name, src1, name1, src2, name2,
 			    &o, marker_size);
 }
@@ -180,6 +189,7 @@ static void create_temp(mmfile_t *src, char *path, size_t len)
  */
 static int ll_ext_merge(const struct ll_merge_driver *fn,
 			mmbuffer_t *result,
+			struct strbuf *warnings,
 			const char *path,
 			mmfile_t *orig, const char *orig_name,
 			mmfile_t *src1, const char *name1,
@@ -362,13 +372,14 @@ static void normalize_file(mmfile_t *mm, const char *path, struct index_state *i
 	}
 }
 
-int ll_merge(mmbuffer_t *result_buf,
-	     const char *path,
-	     mmfile_t *ancestor, const char *ancestor_label,
-	     mmfile_t *ours, const char *our_label,
-	     mmfile_t *theirs, const char *their_label,
-	     struct index_state *istate,
-	     const struct ll_merge_options *opts)
+int ll_merge_with_warnings(mmbuffer_t *result_buf,
+			   struct strbuf *warnings,
+			   const char *path,
+			   mmfile_t *ancestor, const char *ancestor_label,
+			   mmfile_t *ours, const char *our_label,
+			   mmfile_t *theirs, const char *their_label,
+			   struct index_state *istate,
+			   const struct ll_merge_options *opts)
 {
 	struct attr_check *check = load_merge_attributes();
 	static const struct ll_merge_options default_opts;
@@ -401,11 +412,27 @@ int ll_merge(mmbuffer_t *result_buf,
 	if (opts->extra_marker_size) {
 		marker_size += opts->extra_marker_size;
 	}
-	return driver->fn(driver, result_buf, path, ancestor, ancestor_label,
+	return driver->fn(driver, result_buf, warnings, path,
+			  ancestor, ancestor_label,
 			  ours, our_label, theirs, their_label,
 			  opts, marker_size);
 }
 
+int ll_merge(mmbuffer_t *result_buf,
+	     const char *path,
+	     mmfile_t *ancestor, const char *ancestor_label,
+	     mmfile_t *ours, const char *our_label,
+	     mmfile_t *theirs, const char *their_label,
+	     struct index_state *istate,
+	     const struct ll_merge_options *opts)
+{
+	return ll_merge_with_warnings(result_buf, NULL, path,
+				      ancestor, ancestor_label,
+				      ours, our_label,
+				      theirs, their_label,
+				      istate, opts);
+}
+
 int ll_merge_marker_size(struct index_state *istate, const char *path)
 {
 	static struct attr_check *check;
diff --git a/ll-merge.h b/ll-merge.h
index aceb1b24132..a5469918ad4 100644
--- a/ll-merge.h
+++ b/ll-merge.h
@@ -96,6 +96,15 @@ int ll_merge(mmbuffer_t *result_buf,
 	     struct index_state *istate,
 	     const struct ll_merge_options *opts);
 
+int ll_merge_with_warnings(mmbuffer_t *result_buf,
+			   struct strbuf *warnings,
+			   const char *path,
+			   mmfile_t *ancestor, const char *ancestor_label,
+			   mmfile_t *ours, const char *our_label,
+			   mmfile_t *theirs, const char *their_label,
+			   struct index_state *istate,
+			   const struct ll_merge_options *opts);
+
 int ll_merge_marker_size(struct index_state *istate, const char *path);
 void reset_merge_attributes(void);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 4/7] merge-ort: capture and print ll-merge warnings in our preferred fashion
  2021-08-31  2:26 [PATCH 0/7] Add a new --remerge-diff capability to show & log Elijah Newren via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-08-31  2:26 ` [PATCH 3/7] ll-merge: add API for capturing warnings in a strbuf instead of stderr Elijah Newren via GitGitGadget
@ 2021-08-31  2:26 ` Elijah Newren via GitGitGadget
  2021-09-28 22:39   ` Jeff King
  2021-09-30 16:53   ` Ævar Arnfjörð Bjarmason
  2021-08-31  2:26 ` [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs Elijah Newren via GitGitGadget
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 64+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-08-31  2:26 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Instead of immediately printing ll-merge warnings to stderr, we save
them in our output strbuf.  Besides allowing us to move these warnings
to a special file for --remerge-diff, this has two other benefits for
regular merges done by merge-ort:

  * The deferral of messages ensures we can print all messages about
    any given path together (merge-recursive was known to sometimes
    intersperse messages about other paths, particularly when renames
    were involved).

  * The deferral of messages means we can avoid printing spurious
    conflict messages when we just end up aborting due to local user
    modifications in the way.  (In contrast to merge-recursive.c which
    prematurely checks for local modifications in the way via
    unpack_trees() and gets the check wrong both in terms of false
    positives and false negatives relative to renames, merge-ort does
    not perform the local modifications in the way check until the
    checkout() step after the full merge has been computed.)

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c                | 11 ++++++++---
 t/t6404-recursive-merge.sh | 10 ++++++++--
 t/t6406-merge-attr.sh      | 10 ++++++++--
 3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index a9e69d9cbb0..831f8f3e049 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1741,6 +1741,7 @@ static int merge_3way(struct merge_options *opt,
 	struct ll_merge_options ll_opts = {0};
 	char *base, *name1, *name2;
 	int merge_status;
+	struct strbuf warnings = STRBUF_INIT;
 
 	if (!opt->priv->attr_index.initialized)
 		initialize_attr_index(opt);
@@ -1781,10 +1782,14 @@ static int merge_3way(struct merge_options *opt,
 	read_mmblob(&src1, a);
 	read_mmblob(&src2, b);
 
-	merge_status = ll_merge(result_buf, path, &orig, base,
-				&src1, name1, &src2, name2,
-				&opt->priv->attr_index, &ll_opts);
+	merge_status = ll_merge_with_warnings(result_buf, &warnings, path,
+					      &orig, base,
+					      &src1, name1, &src2, name2,
+					      &opt->priv->attr_index, &ll_opts);
+	if (warnings.len > 0)
+		path_msg(opt, path, 0, "%s", warnings.buf);
 
+	strbuf_release(&warnings);
 	free(base);
 	free(name1);
 	free(name2);
diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
index eaf48e941e2..a986eec0420 100755
--- a/t/t6404-recursive-merge.sh
+++ b/t/t6404-recursive-merge.sh
@@ -108,8 +108,14 @@ test_expect_success 'refuse to merge binary files' '
 	printf "\0\0" >binary-file &&
 	git add binary-file &&
 	git commit -m binary2 &&
-	test_must_fail git merge F >merge.out 2>merge.err &&
-	grep "Cannot merge binary files: binary-file (HEAD vs. F)" merge.err
+	if test "$GIT_TEST_MERGE_ALGORITHM" = ort
+	then
+		test_must_fail git merge F >merge.out &&
+		grep "Cannot merge binary files: binary-file (HEAD vs. F)" merge.out
+	else
+		test_must_fail git merge F >merge.out 2>merge.err &&
+		grep "Cannot merge binary files: binary-file (HEAD vs. F)" merge.err
+	fi
 '
 
 test_expect_success 'mark rename/delete as unmerged' '
diff --git a/t/t6406-merge-attr.sh b/t/t6406-merge-attr.sh
index 84946458371..75a7994988a 100755
--- a/t/t6406-merge-attr.sh
+++ b/t/t6406-merge-attr.sh
@@ -221,8 +221,14 @@ test_expect_success 'binary files with union attribute' '
 	printf "two\0" >bin.txt &&
 	git commit -am two &&
 
-	test_must_fail git merge bin-main 2>stderr &&
-	grep -i "warning.*cannot merge.*HEAD vs. bin-main" stderr
+	if test "$GIT_TEST_MERGE_ALGORITHM" = ort
+	then
+		test_must_fail git merge bin-main >stdout &&
+		grep -i "warning.*cannot merge.*HEAD vs. bin-main" stdout
+	else
+		test_must_fail git merge bin-main 2>stderr &&
+		grep -i "warning.*cannot merge.*HEAD vs. bin-main" stderr
+	fi
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-08-31  2:26 [PATCH 0/7] Add a new --remerge-diff capability to show & log Elijah Newren via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-08-31  2:26 ` [PATCH 4/7] merge-ort: capture and print ll-merge warnings in our preferred fashion Elijah Newren via GitGitGadget
@ 2021-08-31  2:26 ` Elijah Newren via GitGitGadget
  2021-09-28  7:55   ` Ævar Arnfjörð Bjarmason
  2021-09-28 23:17   ` Jeff King
  2021-08-31  2:26 ` [PATCH 6/7] show, log: provide a --remerge-diff capability Elijah Newren via GitGitGadget
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 64+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-08-31  2:26 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The tmp_objdir API provides the ability to create temporary object
directories, but was designed with the goal of having subprocesses
access these object stores, followed by the main process migrating
objects from it to the main object store or just deleting it.  The
subprocesses would view it as their primary datastore and write to it.

For the --remerge-diff option we want to add to show & log, we want all
writes of intermediate merge results (both blobs and trees) to go to
this alternate object store; since those writes will be done by the main
process, we need this "alternate" object store to actually be the
primary object store.  When show & log are done, they'll simply remove
this temporary object store.

We also need one more thing -- `git log --remerge-diff` can cause the
temporary object store to fill up with loose objects.  Rather than
trying to gc that are known garbage anyway, we simply want to know the
location of the temporary object store so we can purge the loose objects
after each merge.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 tmp-objdir.c | 29 +++++++++++++++++++++++++++++
 tmp-objdir.h | 16 ++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/tmp-objdir.c b/tmp-objdir.c
index b8d880e3626..9f75a75d1c0 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t)
 	return t->env.v;
 }
 
+const char *tmp_objdir_path(struct tmp_objdir *t)
+{
+	return t->path.buf;
+}
+
 void tmp_objdir_add_as_alternate(const struct tmp_objdir *t)
 {
 	add_to_alternates_memory(t->path.buf);
 }
+
+void tmp_objdir_make_primary(struct repository *r, const struct tmp_objdir *t)
+{
+	struct object_directory *od;
+	od = xcalloc(1, sizeof(*od));
+
+	od->path = xstrdup(t->path.buf);
+	od->next = r->objects->odb;
+	r->objects->odb = od;
+}
+
+void tmp_objdir_remove_as_primary(struct repository *r,
+				  const struct tmp_objdir *t)
+{
+	struct object_directory *od;
+
+	od = r->objects->odb;
+	if (strcmp(t->path.buf, od->path))
+		BUG("expected %s as primary object store; found %s",
+		    t->path.buf, od->path);
+	r->objects->odb = od->next;
+	free(od->path);
+	free(od);
+}
diff --git a/tmp-objdir.h b/tmp-objdir.h
index b1e45b4c75d..6067da24e8c 100644
--- a/tmp-objdir.h
+++ b/tmp-objdir.h
@@ -19,6 +19,7 @@
  *
  */
 
+struct repository;
 struct tmp_objdir;
 
 /*
@@ -33,6 +34,11 @@ struct tmp_objdir *tmp_objdir_create(void);
  */
 const char **tmp_objdir_env(const struct tmp_objdir *);
 
+/*
+ * Return the path used for the temporary object directory.
+ */
+const char *tmp_objdir_path(struct tmp_objdir *t);
+
 /*
  * Finalize a temporary object directory by migrating its objects into the main
  * object database, removing the temporary directory, and freeing any
@@ -51,4 +57,14 @@ int tmp_objdir_destroy(struct tmp_objdir *);
  */
 void tmp_objdir_add_as_alternate(const struct tmp_objdir *);
 
+/*
+ * Add the temporary object directory as the *primary* object store in the
+ * current process, turning the previous primary object store into an
+ * alternate.
+ */
+void tmp_objdir_make_primary(struct repository *r,
+			     const struct tmp_objdir *t);
+void tmp_objdir_remove_as_primary(struct repository *r,
+				  const struct tmp_objdir *t);
+
 #endif /* TMP_OBJDIR_H */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 6/7] show, log: provide a --remerge-diff capability
  2021-08-31  2:26 [PATCH 0/7] Add a new --remerge-diff capability to show & log Elijah Newren via GitGitGadget
                   ` (4 preceding siblings ...)
  2021-08-31  2:26 ` [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs Elijah Newren via GitGitGadget
@ 2021-08-31  2:26 ` Elijah Newren via GitGitGadget
  2021-08-31  9:19   ` Sergey Organov
  2021-09-28  8:01   ` Ævar Arnfjörð Bjarmason
  2021-08-31  2:26 ` [PATCH 7/7] doc/diff-options: explain the new --remerge-diff option Elijah Newren via GitGitGadget
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 64+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-08-31  2:26 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When this option is specified, we remerge all (two parent) merge commits
and diff the actual merge commit to the automatically created version,
in order to show how users removed conflict markers, resolved the
different conflict versions, and potentially added new changes outside
of conflict regions in order to resolve semantic merge problems (or,
possibly, just to hide other random changes).

This capability works by creating a temporary object directory and
marking it as the primary object store, so that any blobs or trees
created during the automatic merge, can be easily removed afterwards by
just deleting all objects from the temporary object directory.  We can
do this after handling each merge commit, in order to avoid the need to
worry about doing `git gc --auto` runs while running `git log
--remerge-diff`.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/log.c | 23 +++++++++++++++++
 diff-merges.c | 12 +++++++++
 log-tree.c    | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
 revision.h    |  6 ++++-
 4 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/builtin/log.c b/builtin/log.c
index 3d7717ba5ca..6c7288716e6 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -35,6 +35,8 @@
 #include "repository.h"
 #include "commit-reach.h"
 #include "range-diff.h"
+#include "dir.h"
+#include "tmp-objdir.h"
 
 #define MAIL_DEFAULT_WRAP 72
 #define COVER_FROM_AUTO_MAX_SUBJECT_LEN 100
@@ -51,6 +53,7 @@ static int default_encode_email_headers = 1;
 static int decoration_style;
 static int decoration_given;
 static int use_mailmap_config = 1;
+static struct tmp_objdir *tmp_objdir;
 static const char *fmt_patch_subject_prefix = "PATCH";
 static int fmt_patch_name_max = FORMAT_PATCH_NAME_MAX_DEFAULT;
 static const char *fmt_pretty;
@@ -407,6 +410,17 @@ static int cmd_log_walk(struct rev_info *rev)
 	int saved_nrl = 0;
 	int saved_dcctc = 0;
 
+	if (rev->remerge_diff) {
+		tmp_objdir = tmp_objdir_create();
+		if (!tmp_objdir)
+			die(_("unable to create temporary object directory"));
+		tmp_objdir_make_primary(the_repository, tmp_objdir);
+
+		strbuf_init(&rev->remerge_objdir_location, 0);
+		strbuf_addstr(&rev->remerge_objdir_location,
+			      tmp_objdir_path(tmp_objdir));
+	}
+
 	if (rev->early_output)
 		setup_early_output();
 
@@ -449,6 +463,13 @@ static int cmd_log_walk(struct rev_info *rev)
 	rev->diffopt.no_free = 0;
 	diff_free(&rev->diffopt);
 
+	if (rev->remerge_diff) {
+		strbuf_release(&rev->remerge_objdir_location);
+		tmp_objdir_remove_as_primary(the_repository, tmp_objdir);
+		tmp_objdir_destroy(tmp_objdir);
+		tmp_objdir = NULL;
+	}
+
 	if (rev->diffopt.output_format & DIFF_FORMAT_CHECKDIFF &&
 	    rev->diffopt.flags.check_failed) {
 		return 02;
@@ -1943,6 +1964,8 @@ int cmd_format_patch(int argc, const char **argv, const char *prefix)
 		die(_("--name-status does not make sense"));
 	if (rev.diffopt.output_format & DIFF_FORMAT_CHECKDIFF)
 		die(_("--check does not make sense"));
+	if (rev.remerge_diff)
+		die(_("--remerge_diff does not make sense"));
 
 	if (!use_patch_format &&
 		(!rev.diffopt.output_format ||
diff --git a/diff-merges.c b/diff-merges.c
index d897fd8a293..3a24c45b8e5 100644
--- a/diff-merges.c
+++ b/diff-merges.c
@@ -17,6 +17,7 @@ static void suppress(struct rev_info *revs)
 	revs->combined_all_paths = 0;
 	revs->merges_imply_patch = 0;
 	revs->merges_need_diff = 0;
+	revs->remerge_diff = 0;
 }
 
 static void set_separate(struct rev_info *revs)
@@ -45,6 +46,12 @@ static void set_dense_combined(struct rev_info *revs)
 	revs->dense_combined_merges = 1;
 }
 
+static void set_remerge_diff(struct rev_info *revs)
+{
+	suppress(revs);
+	revs->remerge_diff = 1;
+}
+
 static diff_merges_setup_func_t func_by_opt(const char *optarg)
 {
 	if (!strcmp(optarg, "off") || !strcmp(optarg, "none"))
@@ -57,6 +64,8 @@ static diff_merges_setup_func_t func_by_opt(const char *optarg)
 		return set_combined;
 	else if (!strcmp(optarg, "cc") || !strcmp(optarg, "dense-combined"))
 		return set_dense_combined;
+	else if (!strcmp(optarg, "r") || !strcmp(optarg, "remerge"))
+		return set_remerge_diff;
 	else if (!strcmp(optarg, "m") || !strcmp(optarg, "on"))
 		return set_to_default;
 	return NULL;
@@ -113,6 +122,9 @@ int diff_merges_parse_opts(struct rev_info *revs, const char **argv)
 	} else if (!strcmp(arg, "--cc")) {
 		set_dense_combined(revs);
 		revs->merges_imply_patch = 1;
+	} else if (!strcmp(arg, "--remerge-diff")) {
+		set_remerge_diff(revs);
+		revs->merges_imply_patch = 1;
 	} else if (!strcmp(arg, "--no-diff-merges")) {
 		suppress(revs);
 	} else if (!strcmp(arg, "--combined-all-paths")) {
diff --git a/log-tree.c b/log-tree.c
index 6dc4412268b..ed69a4da140 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -1,4 +1,5 @@
 #include "cache.h"
+#include "commit-reach.h"
 #include "config.h"
 #include "diff.h"
 #include "object-store.h"
@@ -7,6 +8,7 @@
 #include "tag.h"
 #include "graph.h"
 #include "log-tree.h"
+#include "merge-ort.h"
 #include "reflog-walk.h"
 #include "refs.h"
 #include "string-list.h"
@@ -16,6 +18,7 @@
 #include "line-log.h"
 #include "help.h"
 #include "range-diff.h"
+#include "dir.h"
 
 static struct decoration name_decoration = { "object names" };
 static int decoration_loaded;
@@ -902,6 +905,60 @@ static int do_diff_combined(struct rev_info *opt, struct commit *commit)
 	return !opt->loginfo;
 }
 
+static int do_remerge_diff(struct rev_info *opt,
+			   struct commit_list *parents,
+			   struct object_id *oid,
+			   struct commit *commit)
+{
+	struct merge_options o;
+	struct commit_list *bases;
+	struct merge_result res;
+	struct pretty_print_context ctx = {0};
+	struct strbuf commit1 = STRBUF_INIT;
+	struct strbuf commit2 = STRBUF_INIT;
+
+	/* Setup merge options */
+	init_merge_options(&o, the_repository);
+	memset(&res, 0, sizeof(res));
+	o.show_rename_progress = 0;
+
+	ctx.abbrev = DEFAULT_ABBREV;
+	format_commit_message(parents->item,       "%h (%s)", &commit1, &ctx);
+	format_commit_message(parents->next->item, "%h (%s)", &commit2, &ctx);
+	o.branch1 = commit1.buf;
+	o.branch2 = commit2.buf;
+	o.record_conflict_msgs_in_tree = 1;
+
+	/* Parse the relevant commits and get the merge bases */
+	parse_commit_or_die(parents->item);
+	parse_commit_or_die(parents->next->item);
+	bases = get_merge_bases(parents->item, parents->next->item);
+
+	/* Re-merge the parents */
+	merge_incore_recursive(&o,
+			       bases, parents->item, parents->next->item,
+			       &res);
+
+	/* Show the diff */
+	diff_tree_oid(&res.tree->object.oid, oid, "", &opt->diffopt);
+	log_tree_diff_flush(opt);
+
+	/* Cleanup */
+	strbuf_release(&commit1);
+	strbuf_release(&commit2);
+	merge_finalize(&o, &res);
+
+	/* Clean up the temporary object directory */
+	if (opt->remerge_objdir_location.buf != NULL &&
+	    *opt->remerge_objdir_location.buf != '\0')
+		remove_dir_recursively(&opt->remerge_objdir_location,
+				       REMOVE_DIR_KEEP_TOPLEVEL);
+	else
+		BUG("unable to remove temporary object directory");
+
+	return !opt->loginfo;
+}
+
 /*
  * Show the diff of a commit.
  *
@@ -936,6 +993,18 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
 	}
 
 	if (is_merge) {
+		int octopus = (parents->next->next != NULL);
+
+		if (opt->remerge_diff) {
+			if (octopus) {
+				show_log(opt);
+				fprintf(opt->diffopt.file,
+					"diff: warning: Skipping remerge-diff "
+					"for octopus merges.\n");
+				return 1;
+			}
+			return do_remerge_diff(opt, parents, oid, commit);
+		}
 		if (opt->combine_merges)
 			return do_diff_combined(opt, commit);
 		if (opt->separate_merges) {
diff --git a/revision.h b/revision.h
index fbb068da9fb..d6562c52252 100644
--- a/revision.h
+++ b/revision.h
@@ -198,7 +198,8 @@ struct rev_info {
 			combine_merges:1,
 			combined_all_paths:1,
 			dense_combined_merges:1,
-			first_parent_merges:1;
+			first_parent_merges:1,
+			remerge_diff:1;
 
 	/* Format info */
 	int		show_notes;
@@ -320,6 +321,9 @@ struct rev_info {
 
 	/* misc. flags related to '--no-kept-objects' */
 	unsigned keep_pack_cache_flags;
+
+	/* Location where temporary objects for remerge-diff are written. */
+	struct strbuf remerge_objdir_location;
 };
 
 int ref_excluded(struct string_list *, const char *path);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 7/7] doc/diff-options: explain the new --remerge-diff option
  2021-08-31  2:26 [PATCH 0/7] Add a new --remerge-diff capability to show & log Elijah Newren via GitGitGadget
                   ` (5 preceding siblings ...)
  2021-08-31  2:26 ` [PATCH 6/7] show, log: provide a --remerge-diff capability Elijah Newren via GitGitGadget
@ 2021-08-31  2:26 ` Elijah Newren via GitGitGadget
  2021-08-31 11:05 ` [PATCH 0/7] Add a new --remerge-diff capability to show & log Bagas Sanjaya
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 64+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-08-31  2:26 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/diff-options.txt | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index c89d530d3d1..b05f1c9f1c9 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -64,6 +64,14 @@ ifdef::git-log[]
 	each of the parents. Separate log entry and diff is generated
 	for each parent.
 +
+--diff-merges=remerge:::
+--diff-merges=r:::
+--remerge-diff:::
+	With this option, two-parent merge commits are remerged to
+	create a temporary tree object -- potentially containing files
+	with conflict markers and such.  A diff is then shown between
+	that temporary tree and the actual merge commit.
++
 --diff-merges=combined:::
 --diff-merges=c:::
 -c:::
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 6/7] show, log: provide a --remerge-diff capability
  2021-08-31  2:26 ` [PATCH 6/7] show, log: provide a --remerge-diff capability Elijah Newren via GitGitGadget
@ 2021-08-31  9:19   ` Sergey Organov
  2021-09-28  8:01   ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 64+ messages in thread
From: Sergey Organov @ 2021-08-31  9:19 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jeff King, Jonathan Nieder, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>

[...]

> diff --git a/diff-merges.c b/diff-merges.c
> index d897fd8a293..3a24c45b8e5 100644
> --- a/diff-merges.c
> +++ b/diff-merges.c
> @@ -17,6 +17,7 @@ static void suppress(struct rev_info *revs)
>  	revs->combined_all_paths = 0;
>  	revs->merges_imply_patch = 0;
>  	revs->merges_need_diff = 0;
> +	revs->remerge_diff = 0;
>  }
>  
>  static void set_separate(struct rev_info *revs)
> @@ -45,6 +46,12 @@ static void set_dense_combined(struct rev_info *revs)
>  	revs->dense_combined_merges = 1;
>  }
>  
> +static void set_remerge_diff(struct rev_info *revs)
> +{
> +	suppress(revs);
> +	revs->remerge_diff = 1;
> +}
> +
>  static diff_merges_setup_func_t func_by_opt(const char *optarg)
>  {
>  	if (!strcmp(optarg, "off") || !strcmp(optarg, "none"))
> @@ -57,6 +64,8 @@ static diff_merges_setup_func_t func_by_opt(const char *optarg)
>  		return set_combined;
>  	else if (!strcmp(optarg, "cc") || !strcmp(optarg, "dense-combined"))
>  		return set_dense_combined;
> +	else if (!strcmp(optarg, "r") || !strcmp(optarg, "remerge"))
> +		return set_remerge_diff;
>  	else if (!strcmp(optarg, "m") || !strcmp(optarg, "on"))
>  		return set_to_default;
>  	return NULL;
> @@ -113,6 +122,9 @@ int diff_merges_parse_opts(struct rev_info *revs, const char **argv)
>  	} else if (!strcmp(arg, "--cc")) {
>  		set_dense_combined(revs);
>  		revs->merges_imply_patch = 1;
> +	} else if (!strcmp(arg, "--remerge-diff")) {
> +		set_remerge_diff(revs);
> +		revs->merges_imply_patch = 1;
>  	} else if (!strcmp(arg, "--no-diff-merges")) {
>  		suppress(revs);
>  	} else if (!strcmp(arg, "--combined-all-paths")) {

The diff-merges options handling looks fine to me.

Thanks,
-- Sergey Organov

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/7] Add a new --remerge-diff capability to show & log
  2021-08-31  2:26 [PATCH 0/7] Add a new --remerge-diff capability to show & log Elijah Newren via GitGitGadget
                   ` (6 preceding siblings ...)
  2021-08-31  2:26 ` [PATCH 7/7] doc/diff-options: explain the new --remerge-diff option Elijah Newren via GitGitGadget
@ 2021-08-31 11:05 ` Bagas Sanjaya
  2021-08-31 16:16   ` Elijah Newren
  2021-08-31 20:03 ` Junio C Hamano
  2021-09-01 21:07 ` Junio C Hamano
  9 siblings, 1 reply; 64+ messages in thread
From: Bagas Sanjaya @ 2021-08-31 11:05 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren

On 31/08/21 09.26, Elijah Newren via GitGitGadget wrote:
> Here are some patches to add a --remerge-diff capability to show & log,
> which works by comparing merge commits to an automatic remerge (note that
> the automatic remerge tree can contain files with conflict markers).
> 
> Here are some example commits you can try this out on (with git show
> --remerge-diff $COMMIT):
> 
>   * git.git conflicted merge: 07601b5b36
>   * git.git non-conflicted change: bf04590ecd
>   * linux.git conflicted merge: eab3540562fb
>   * linux.git non-conflicted change: 223cea6a4f05
> 
<snip>...
> In regards to the performance point above, the timing for running the
> following command:
> 
> time git log --min-parents=2 --max-parents=2 $DIFF_FLAG | wc -l
> 
> 
> in linux.git (with v5.4 checked out, since my copy of linux is very out of
> date) is as follows:
> 
> DIFF_FLAG=--cc:            71m 31.536s
> DIFF_FLAG=--remerge-diff:  31m  3.170s
> 
> 
> Note that there are 62476 merges in this history. Also, output size is:
> 
> DIFF_FLAG=--cc:            2169111 lines
> DIFF_FLAG=--remerge-diff:  2458020 lines
> 

Which repo did you mean by linux.git? Kernel developers often work 
against Linus' mainline tree [1], while end-users (including myself) 
prefer stable tree (which is mainline + stable release branches and 
tags) [2].

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/7] Add a new --remerge-diff capability to show & log
  2021-08-31 11:05 ` [PATCH 0/7] Add a new --remerge-diff capability to show & log Bagas Sanjaya
@ 2021-08-31 16:16   ` Elijah Newren
  0 siblings, 0 replies; 64+ messages in thread
From: Elijah Newren @ 2021-08-31 16:16 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Sergey Organov

On Tue, Aug 31, 2021 at 4:05 AM Bagas Sanjaya <bagasdotme@gmail.com> wrote:
>
> On 31/08/21 09.26, Elijah Newren via GitGitGadget wrote:
> > Here are some patches to add a --remerge-diff capability to show & log,
> > which works by comparing merge commits to an automatic remerge (note that
> > the automatic remerge tree can contain files with conflict markers).
> >
> > Here are some example commits you can try this out on (with git show
> > --remerge-diff $COMMIT):
> >
> >   * git.git conflicted merge: 07601b5b36
> >   * git.git non-conflicted change: bf04590ecd
> >   * linux.git conflicted merge: eab3540562fb
> >   * linux.git non-conflicted change: 223cea6a4f05
> >
> <snip>...
> > In regards to the performance point above, the timing for running the
> > following command:
> >
> > time git log --min-parents=2 --max-parents=2 $DIFF_FLAG | wc -l
> >
> >
> > in linux.git (with v5.4 checked out, since my copy of linux is very out of
> > date) is as follows:
> >
> > DIFF_FLAG=--cc:            71m 31.536s
> > DIFF_FLAG=--remerge-diff:  31m  3.170s
> >
> >
> > Note that there are 62476 merges in this history. Also, output size is:
> >
> > DIFF_FLAG=--cc:            2169111 lines
> > DIFF_FLAG=--remerge-diff:  2458020 lines
> >
>
> Which repo did you mean by linux.git? Kernel developers often work
> against Linus' mainline tree [1], while end-users (including myself)
> prefer stable tree (which is mainline + stable release branches and
> tags) [2].
>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

$ git remote -v
origin git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
(fetch)
origin git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
(push)
$ git log -1 origin/master
commit 11a48a5a18c63fd7621bb050228cebf13566e4d8 (tag: v5.6-rc2,
origin/master, origin/HEAD, master)
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sun Feb 16 13:16:59 2020 -0800

    Linux 5.6-rc2
$ git log -1 HEAD
commit 219d54332a09e8d8741c1e1982f5eae56099de85 (HEAD, tag: v5.4)
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sun Nov 24 16:32:01 2019 -0800

    Linux 5.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/7] Add a new --remerge-diff capability to show & log
  2021-08-31  2:26 [PATCH 0/7] Add a new --remerge-diff capability to show & log Elijah Newren via GitGitGadget
                   ` (7 preceding siblings ...)
  2021-08-31 11:05 ` [PATCH 0/7] Add a new --remerge-diff capability to show & log Bagas Sanjaya
@ 2021-08-31 20:03 ` Junio C Hamano
  2021-08-31 20:23   ` Elijah Newren
  2021-09-01 21:07 ` Junio C Hamano
  9 siblings, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2021-08-31 20:03 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Here are some patches to add a --remerge-diff capability to show & log,
> which works by comparing merge commits to an automatic remerge (note that
> the automatic remerge tree can contain files with conflict markers).

Excited ;-)

>  * This new option does not (currently) work for octopus merges, since
>    merge-ort is specific to two-parent merges[1].

Unless you do so manually, the native "octopus" backend does not let
you create non-trivial merges anyway, so punting on them should not
be a big loss.  Falling back to --cc might be a usable alternative.

>  * This option will not work on a read-only or full filesystem[2].

OK.  I am not sure if it is worth doing the "temporary objects"
trick, though---would it risk repository corruption if somebody is
creating a new blob that happens to be identical to the one that is
involved in the remerge operation at the same time, or there is no
visibility of the temporary area to these "somebody" outside so
there is no risk?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/7] Add a new --remerge-diff capability to show & log
  2021-08-31 20:03 ` Junio C Hamano
@ 2021-08-31 20:23   ` Elijah Newren
  0 siblings, 0 replies; 64+ messages in thread
From: Elijah Newren @ 2021-08-31 20:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Sergey Organov

On Tue, Aug 31, 2021 at 1:03 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > Here are some patches to add a --remerge-diff capability to show & log,
> > which works by comparing merge commits to an automatic remerge (note that
> > the automatic remerge tree can contain files with conflict markers).
>
> Excited ;-)
>
> >  * This new option does not (currently) work for octopus merges, since
> >    merge-ort is specific to two-parent merges[1].
>
> Unless you do so manually, the native "octopus" backend does not let
> you create non-trivial merges anyway, so punting on them should not
> be a big loss.  Falling back to --cc might be a usable alternative.
>
> >  * This option will not work on a read-only or full filesystem[2].
>
> OK.  I am not sure if it is worth doing the "temporary objects"
> trick, though---would it risk repository corruption if somebody is
> creating a new blob that happens to be identical to the one that is
> involved in the remerge operation at the same time, or there is no
> visibility of the temporary area to these "somebody" outside so
> there is no risk?

The temporary area is only used by the process running --remerge-diff,
so there's no risk of corruption.  If you have two `git log
--remerge-diff ...` processes running at the same time, they each have
their own temporary areas.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/7] merge-ort: mark a few more conflict messages as omittable
  2021-08-31  2:26 ` [PATCH 1/7] merge-ort: mark a few more conflict messages as omittable Elijah Newren via GitGitGadget
@ 2021-08-31 21:06   ` Junio C Hamano
  2021-09-01  0:03     ` Elijah Newren
  0 siblings, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2021-08-31 21:06 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> path_msg() has the ability to mark messages as omittable when recording
> conflict messages in an in-tree file.  This conflict message file will
> then be shown by diffing the merge-created tree to the actual merge
> commit that is created.  While all the messages touched in this commit
> are very useful when trying to create a merge initially, early use with
> the --remerge-diff feature (the only user of this omittable conflict
> message capability), suggests that the particular messages marked in
> this commit are just noise when trying to see what changes users made to
> create a merge commit.  Mark them as omittable.

Sorry for asking something that may be obvious, but if you recreate
an automated and potentially conflicting merge in-core, in order to
then compare the recorded merge result, is there *any* message that
is worth showing while doing the first step, and where in the output
do users see them?


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/7] merge-ort: mark a few more conflict messages as omittable
  2021-08-31 21:06   ` Junio C Hamano
@ 2021-09-01  0:03     ` Elijah Newren
  2021-09-01 17:19       ` Junio C Hamano
  0 siblings, 1 reply; 64+ messages in thread
From: Elijah Newren @ 2021-09-01  0:03 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Sergey Organov

On Tue, Aug 31, 2021 at 2:06 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Elijah Newren <newren@gmail.com>
> >
> > path_msg() has the ability to mark messages as omittable when recording
> > conflict messages in an in-tree file.  This conflict message file will
> > then be shown by diffing the merge-created tree to the actual merge
> > commit that is created.  While all the messages touched in this commit
> > are very useful when trying to create a merge initially, early use with
> > the --remerge-diff feature (the only user of this omittable conflict
> > message capability), suggests that the particular messages marked in
> > this commit are just noise when trying to see what changes users made to
> > create a merge commit.  Mark them as omittable.
>
> Sorry for asking something that may be obvious

No, it's not obvious.  I had a hard time figuring out the right way to
split up this series and order the patches in part because so many
little pieces are needed before I can show the solution.  So some of
this comes from the later patches, but let me see if I can motivate
the issue and solution I picked a bit more right here.

> but if you recreate
> an automated and potentially conflicting merge in-core, in order to
> then compare the recorded merge result, is there *any* message that
> is worth showing while doing the first step,

Yes, absolutely.  While three-way *content* conflicts are easily
representable within a file in the working tree using conflict
markers, *non-content* conflicts are usually not easily representable
in such a fashion.  For example, a rename/delete conflict will not
result in any conflict markers, and will result in the renamed version
of the file being left in the working tree; the only way you know
there is a conflict for that file is either because of the stage in
the index or the the message that is printed to the terminal:

    CONFLICT (rename/delete): %s renamed to %s in %s, but deleted in %s.

But relying on higher order stages in the index and messages printed
to the terminal present a bit of a problem for --remerge-diff; as
described, it ignores both those sources of input.  Here's a few other
conflict types that face similar issues:

  * modify/delete conflicts
  * failure to merge submodule
  * directory rename detection (i.e. new file added to old directory
that other side renamed; which directory should file end up in?)
  * distinct types of files (e.g. regular file and symlink) located at
the same path
  * rename/rename conflicts

In all these cases (and some others), relying on a diff of "what the
working directory looks like at the end of a merge" to "the tree
recorded in the merge commit" does not convey enough (if any)
information about the above types of conflicts.

> and where in the output do users see them?

For --remerge-diff, I chose to handle the fact that the working
directory didn't naturally have enough information, by augmenting the
working directory with additional information.  So, for example, if
there was a file named `dir/my.file` that had a modify/delete conflict
(and the user who did the real merge edited that file a bit as part of
conflict resolution), then I would also add a `dir/my.fil
e.conflict_msg` file whose contents are

"""
== Conflict notices for my.file ==
CONFLICT (modify/delete): dir/my.file deleted in HASH1 (SHORT
SUMMARY1) and modified in HASH2 (SHORT SUMMARY 2).  Version HASH2
(SHORT SUMMARY2) of  dir/my.file left in tree.
"""

Creating this file means you see something like this in the
remerge-diff (note there are diffs for two files, with the synthetic
file appearing just before the file it has messages about):

diff --git a/dir/my.fil e.conflict_msg b/dir/my.fil e.conflict_msg
deleted file mode 100644
index 2bd215a32f06..000000000000
--- a/dir/my.fil e.conflict_msg
+++ /dev/null
@@ -1,2 +0,0 @@
-== Conflict notices for my.file ==
-CONFLICT (modify/delete): dir/my.file deleted in HASH1 (SHORT
SUMMARY1) and modified in HASH2 (SHORT SUMMARY 2).  Version HASH2
(SHORT SUMMARY2) of  dir/my.file left in tree.
diff --git a/dir/my.file b/dir/my.file
index 09c78c725a3b..79f9d1fb7611 100644
--- a/dir/my.file
+++ b/dir/my.file
@@ -950,15 +912,15 @@ int my_func(struct stuff *data,
                                        rq->timeline,
                                        rq);

-               old_func(data, 1);
+               new_func("for funsies", data, VALID);
                obj->value = 8;
                obj->read_attempts = 0;
        } else {
-               err = old_func(data, 0);
+               err = new_func("for funsies", data, INVALID);
                if (unlikely(err))
                        return err;

-               another_old_func(data, obj->value);
+               another_new_func(data, obj->value, INVALID);
                obj->value++;
        }
        obj->read_attempts += 1;


The `dir/my.fil e.conflict_msg` file is definitely slightly weird, but
any solution here would be.  I think it's fairly intuitive what is
meant.  No one has commented on this choice in the 9+ months it's been
in use internally by ~50 users (even with -p implying --remerge-diff
automatically to make it more likely people have used this option), so
either it really is intuitive, or it doesn't come up much.  It could
well be the latter.

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/7] merge-ort: mark a few more conflict messages as omittable
  2021-09-01  0:03     ` Elijah Newren
@ 2021-09-01 17:19       ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2021-09-01 17:19 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Sergey Organov

Elijah Newren <newren@gmail.com> writes:

> Creating this file means you see something like this in the
> remerge-diff (note there are diffs for two files, with the synthetic
> file appearing just before the file it has messages about):
>
> diff --git a/dir/my.fil e.conflict_msg b/dir/my.fil e.conflict_msg
> deleted file mode 100644
> index 2bd215a32f06..000000000000
> --- a/dir/my.fil e.conflict_msg
> +++ /dev/null
> @@ -1,2 +0,0 @@
> -== Conflict notices for my.file ==
> -CONFLICT (modify/delete): dir/my.file deleted in HASH1 (SHORT
> SUMMARY1) and modified in HASH2 (SHORT SUMMARY 2).  Version HASH2
> (SHORT SUMMARY2) of  dir/my.file left in tree.

I do not know if my reaction should be "Cute" or "Yuck" ;-)

Hopefully nobody uses .conflict_msg as a suffix for their files.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/7] Add a new --remerge-diff capability to show & log
  2021-08-31  2:26 [PATCH 0/7] Add a new --remerge-diff capability to show & log Elijah Newren via GitGitGadget
                   ` (8 preceding siblings ...)
  2021-08-31 20:03 ` Junio C Hamano
@ 2021-09-01 21:07 ` Junio C Hamano
  2021-09-01 21:42   ` Elijah Newren
  9 siblings, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2021-09-01 21:07 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Here are some patches to add a --remerge-diff capability to show & log,

One sad omission from the maintainer usecase is that we do not seem
to know "git diff --remerge-diff" yet during a conflicted merge.

"git diff [-- <path>]" before recording the resolution for the path
with "git add <path>" shows combined patch to give a final sanity
check before committing it to the rerere database.  I am wondering
if viewing it in the --remrege-diff format instead would help this
step even more.




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/7] Add a new --remerge-diff capability to show & log
  2021-09-01 21:07 ` Junio C Hamano
@ 2021-09-01 21:42   ` Elijah Newren
  2021-09-01 21:55     ` Junio C Hamano
  0 siblings, 1 reply; 64+ messages in thread
From: Elijah Newren @ 2021-09-01 21:42 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Sergey Organov

On Wed, Sep 1, 2021 at 2:08 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > Here are some patches to add a --remerge-diff capability to show & log,
>
> One sad omission from the maintainer usecase is that we do not seem
> to know "git diff --remerge-diff" yet during a conflicted merge.
>
> "git diff [-- <path>]" before recording the resolution for the path
> with "git add <path>" shows combined patch to give a final sanity
> check before committing it to the rerere database.  I am wondering
> if viewing it in the --remrege-diff format instead would help this
> step even more.

We do have `git diff AUTO_MERGE`, though.  It's not quite the same as
it doesn't include all the "CONFLICT" messages shown in the terminal
like --remerge-diff does with log/show, but otherwise it's the same.
Perhaps we could even alias `git diff --remerge-diff` to `git diff
AUTO_MERGE`?

See commit 5291828df838 (merge-ort: write $GIT_DIR/AUTO_MERGE whenever
we hit a conflict, 2021-03-20) for more details.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/7] Add a new --remerge-diff capability to show & log
  2021-09-01 21:42   ` Elijah Newren
@ 2021-09-01 21:55     ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2021-09-01 21:55 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Sergey Organov

Elijah Newren <newren@gmail.com> writes:

> We do have `git diff AUTO_MERGE`, though.  It's not quite the same as
> it doesn't include all the "CONFLICT" messages shown in the terminal
> like --remerge-diff does with log/show, but otherwise it's the same.

Ah, forgot about that one, so we are good.

> Perhaps we could even alias `git diff --remerge-diff` to `git diff
> AUTO_MERGE`?

I do not think it is a good idea to hide AUTO_MERGE behind an
option.  It is a feature that deserves more user awareness by
itself.

Thanks.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-08-31  2:26 ` [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs Elijah Newren via GitGitGadget
@ 2021-09-28  7:55   ` Ævar Arnfjörð Bjarmason
  2021-09-29  4:22     ` Elijah Newren
  2021-09-28 23:17   ` Jeff King
  1 sibling, 1 reply; 64+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-28  7:55 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren,
	Neeraj Singh


On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:

I commented just now on how this API is duplicated between here &
another in-flight series in
https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/; Just
comments on this patch here:

> diff --git a/tmp-objdir.c b/tmp-objdir.c
> index b8d880e3626..9f75a75d1c0 100644
> --- a/tmp-objdir.c
> +++ b/tmp-objdir.c
> @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t)
>  	return t->env.v;
>  }
>  
> +const char *tmp_objdir_path(struct tmp_objdir *t)
> +{
> +	return t->path.buf;
> +}

Not your fault & this pre-dates your patch, but FWIW I prefer our APIs
that don't have these "hidden struct" shenanigans (like say "struct
string_list") so a caller could just access this, we can just declare it
"const" appropriately.

We're also all in-tree friends here, so having various accessors for no
reason other than to access struct members seems a bit too much.

All of which is to say if you + Neeraj come up with some common API I
for one wouldn't mind moving the struct deceleration to tmp-objdir.h,
but it looks like in his version it can stay "private" more easily.

In this particular case I hadn't understood on a previous reading of
tmp-objdir.c why this "path" isn't a statically allocated "char
path[PATH_MAX]", or a least we that hardcoded "1024" should be a
PATH_MAX, as it is in some other cases.

But I digress :)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 6/7] show, log: provide a --remerge-diff capability
  2021-08-31  2:26 ` [PATCH 6/7] show, log: provide a --remerge-diff capability Elijah Newren via GitGitGadget
  2021-08-31  9:19   ` Sergey Organov
@ 2021-09-28  8:01   ` Ævar Arnfjörð Bjarmason
  2021-09-29  4:00     ` Elijah Newren
  1 sibling, 1 reply; 64+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-28  8:01 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren,
	Neeraj K. Singh


On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:

>  static int decoration_given;
>  static int use_mailmap_config = 1;
> +static struct tmp_objdir *tmp_objdir;
>  static const char *fmt_patch_subject_prefix = "PATCH";
>  static int fmt_patch_name_max = FORMAT_PATCH_NAME_MAX_DEFAULT;
>  static const char *fmt_pretty;

So here we make this static file-level etc...

> @@ -407,6 +410,17 @@ static int cmd_log_walk(struct rev_info *rev)
>  	int saved_nrl = 0;
>  	int saved_dcctc = 0;
>  
> +	if (rev->remerge_diff) {
> +		tmp_objdir = tmp_objdir_create();
> +		if (!tmp_objdir)
> +			die(_("unable to create temporary object directory"));
> +		tmp_objdir_make_primary(the_repository, tmp_objdir);
> +
> +		strbuf_init(&rev->remerge_objdir_location, 0);
> +		strbuf_addstr(&rev->remerge_objdir_location,
> +			      tmp_objdir_path(tmp_objdir));
> +	}
> +
>  	if (rev->early_output)
>  		setup_early_output();
>  
> @@ -449,6 +463,13 @@ static int cmd_log_walk(struct rev_info *rev)
>  	rev->diffopt.no_free = 0;
>  	diff_free(&rev->diffopt);
>  
> +	if (rev->remerge_diff) {
> +		strbuf_release(&rev->remerge_objdir_location);
> +		tmp_objdir_remove_as_primary(the_repository, tmp_objdir);
> +		tmp_objdir_destroy(tmp_objdir);
> +		tmp_objdir = NULL;

...but all of the "tmp_objdir" usage is in one function, can't the
variable be declared here instead?

We need to hand the "remerge_objdir_location" off to the "rev_info"
struct, but that seems separate from its lifetime.

Re my [1] & [2] I like Neeraj's "atexit cleanup" approach better,
perhaps that makes your cleanup in log-tree.c redundant or easier?

Per [2] it looks like you need to "hand off" the
"remerge_objdir_location", so having the struct live in tmp-objdir.h as
I suggested in [2] might make that work...

1. https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/
2. https://lore.kernel.org/git/87r1d9xh71.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/7] merge-ort: add ability to record conflict messages in a file
  2021-08-31  2:26 ` [PATCH 2/7] merge-ort: add ability to record conflict messages in a file Elijah Newren via GitGitGadget
@ 2021-09-28 22:29   ` Jeff King
  2021-09-29  6:25     ` Elijah Newren
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2021-09-28 22:29 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jonathan Nieder, Sergey Organov, Elijah Newren

On Tue, Aug 31, 2021 at 02:26:35AM +0000, Elijah Newren via GitGitGadget wrote:

> There are several considerations here:
>   * We have to pick file(s) where we write these conflict messages too
>   * We want to make it as clear as possible that it's not a real file
>     but a set of messages about another file
>   * We want conflict messages about a file to appear near the file in
>     question in a diff, preferably immediately preceding the file in
>     question
>   * Related to the above, per-file conflict messages are preferred
>     over lumping all conflict messages into one big file
> 
> To achive the above:
>   * We put the conflict messages for $filename in
>       $filename[0:-1] + " " + $filename[-1] + ".conflict_msg"
>     or, in words, we insert a space before the final character of
>     the filename and then also add ".conflict_msg" at the end.

It took me a minute to understand the space thing. I thought at first it
was about avoiding conflicts with existing names (and while it might
help in practice, it's not a guarantee). But I think it's about the
"appear preceding the file" goal. The space sorts before any other
printable character in the final position.

That's...simultaneously clever and gross. My biggest complaint is that
the space looks like a bug in the output.

Using another character like "." might not be too bad, as it's also
fairly early in the ascii table. But it's really this "do it before the
last character" thing that is key to getting the ordering right.

Just brainstorming some alternatives:

 - we have diff.orderFile, etc. Could we stuff this data into a less
   confusing name (even just "$filename.conflict_msg"), and then provide
   a custom ordering to the diff code? I think it could be done by
   generating a static ordering ahead of time, but it might even just be
   possible to tell diffcore_order() to take the ".conflict_msg"
   extension into account in its comparison function.

 - there can be other non-diff data between the individual segments. For
   example, "patch" will skip over non-diff lines. And certainly in Git
   we have our own custom headers. I'm wondering if we could attach
   these annotations to the diff-pair somehow, and then show something
   like:

     diff --git a/foo.c b/foo.c
     index 1234abcd..5678cdef 100644
     conflict modify/delete foo.c
     --- a/foo.c
     +++ b/foo.c
     @@ some actual diff starts here @@

Obviously such a thing can't really be applied. But then you wouldn't
want to apply the addition of "my.file e.conflict_msg" either.

I dunno. The latter especially is definitely more work, and requires a
bit more cooperation between the merge and diff code. In particular, you
can't just feed a straight tree to the diff anymore. We have to hold
back the annotations, and then apply them to the resulting diff. But I
think the output is much more pleasing to the eye.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/7] ll-merge: add API for capturing warnings in a strbuf instead of stderr
  2021-08-31  2:26 ` [PATCH 3/7] ll-merge: add API for capturing warnings in a strbuf instead of stderr Elijah Newren via GitGitGadget
@ 2021-09-28 22:37   ` Jeff King
  2021-09-28 23:49     ` Junio C Hamano
  2021-09-29  4:03     ` Elijah Newren
  0 siblings, 2 replies; 64+ messages in thread
From: Jeff King @ 2021-09-28 22:37 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jonathan Nieder, Sergey Organov, Elijah Newren

On Tue, Aug 31, 2021 at 02:26:36AM +0000, Elijah Newren via GitGitGadget wrote:

> From: Elijah Newren <newren@gmail.com>
> 
> Whenever ll-merge encounters a binary file, it prints a warning to
> stderr.  However, for the --remerge-diff option we want to add, we need
> to capture all conflict messages and show them in a diff instead of
> dumping them straight to stdout or stderr.  Add some new API that will
> allow us to capture this output and display it in our preferred method.

This is a reasonable strategy for error-handling in general (though some
more thoughts below).

> @@ -71,8 +75,11 @@ static int ll_binary_merge(const struct ll_merge_driver *drv_unused,
>  	} else {
>  		switch (opts->variant) {
>  		default:
> -			warning("Cannot merge binary files: %s (%s vs. %s)",
> -				path, name1, name2);
> +			if (warnings) {
> +				strbuf_addstr(warnings, "Warning: ");
> +				strbuf_addf(warnings, msg, path, name1, name2);
> +			} else
> +				warning(msg, path, name1, name2);

The usual warn_builtin() has a lowercase "warning: " prefix, but your
strbuf variant uses uppercase.

> @@ -98,6 +105,7 @@ static int ll_binary_merge(const struct ll_merge_driver *drv_unused,
>  
>  static int ll_xdl_merge(const struct ll_merge_driver *drv_unused,
>  			mmbuffer_t *result,
> +			struct strbuf *warnings,
>  			const char *path,
>  			mmfile_t *orig, const char *orig_name,
>  			mmfile_t *src1, const char *name1,

There's a lot of plumbing this variable through. This is probably too
gross, but another option would be to call set_warn_routine() to
override it temporarily. It's gross because it affects everything, not
just this call stack (and I also think we'd need to beef up the warn
routine code to handle some of the rough edges).

I do wonder if the ll_merge() code should avoid calling warning() in the
first place. It is after all, meant to be "low-level". We already return
an error code from the function. I wonder if returning a more detailed
code instead, like:

  enum LL_MERGE_RESULT {
	LL_MERGE_OK = 0,
	LL_MERGE_CONFLICT,
	LL_MERGE_BINARY_CONFLICT,
  };

would let the caller do the sensible thing.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 4/7] merge-ort: capture and print ll-merge warnings in our preferred fashion
  2021-08-31  2:26 ` [PATCH 4/7] merge-ort: capture and print ll-merge warnings in our preferred fashion Elijah Newren via GitGitGadget
@ 2021-09-28 22:39   ` Jeff King
  2021-09-30 16:53   ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 64+ messages in thread
From: Jeff King @ 2021-09-28 22:39 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jonathan Nieder, Sergey Organov, Elijah Newren

On Tue, Aug 31, 2021 at 02:26:37AM +0000, Elijah Newren via GitGitGadget wrote:

> From: Elijah Newren <newren@gmail.com>
> 
> Instead of immediately printing ll-merge warnings to stderr, we save
> them in our output strbuf.  Besides allowing us to move these warnings
> to a special file for --remerge-diff, this has two other benefits for
> regular merges done by merge-ort:
> 
>   * The deferral of messages ensures we can print all messages about
>     any given path together (merge-recursive was known to sometimes
>     intersperse messages about other paths, particularly when renames
>     were involved).
> 
>   * The deferral of messages means we can avoid printing spurious
>     conflict messages when we just end up aborting due to local user
>     modifications in the way.  (In contrast to merge-recursive.c which
>     prematurely checks for local modifications in the way via
>     unpack_trees() and gets the check wrong both in terms of false
>     positives and false negatives relative to renames, merge-ort does
>     not perform the local modifications in the way check until the
>     checkout() step after the full merge has been computed.)

Yeah, this is a good example of why having ll_merge() print warnings in
the first place is probably wrong. If you buy my argument that
ll_merge() should be returning an enum, then this code becomes IMHO even
nicer, as you can generate your own sensible message in the caller.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-08-31  2:26 ` [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs Elijah Newren via GitGitGadget
  2021-09-28  7:55   ` Ævar Arnfjörð Bjarmason
@ 2021-09-28 23:17   ` Jeff King
  2021-09-29  4:08     ` Junio C Hamano
  2021-09-29  5:05     ` Elijah Newren
  1 sibling, 2 replies; 64+ messages in thread
From: Jeff King @ 2021-09-28 23:17 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jonathan Nieder, Sergey Organov, Elijah Newren

On Tue, Aug 31, 2021 at 02:26:38AM +0000, Elijah Newren via GitGitGadget wrote:

> From: Elijah Newren <newren@gmail.com>
> 
> The tmp_objdir API provides the ability to create temporary object
> directories, but was designed with the goal of having subprocesses
> access these object stores, followed by the main process migrating
> objects from it to the main object store or just deleting it.  The
> subprocesses would view it as their primary datastore and write to it.
> 
> For the --remerge-diff option we want to add to show & log, we want all
> writes of intermediate merge results (both blobs and trees) to go to
> this alternate object store; since those writes will be done by the main
> process, we need this "alternate" object store to actually be the
> primary object store.  When show & log are done, they'll simply remove
> this temporary object store.

I think this is consistent with the original design of tmp_objdir. I
just never needed to do anything in-process before, so overriding the
environment of sub-processes was sufficient.

I do think these are dangerous and may cause bugs, though. Anything you
write while the tmp_objdir is marked as "primary" is going to go away
when you remove it. So any object names you reference outside of that,
either:

  - by another object that you create after the tmp_objdir is removed;
    or

  - in a ref

are going to turn into repository corruption. Of course that's true for
the existing sub-processes, too, but here we're touching a wider variety
of code.

Obviously the objects we write as part of remerge-diff are meant to be
temporary, and we'll never reference them in any other way. And usually
we would not expect a diff to write any other objects. But one thing
that comes to mind if textconv caching.

If you do a remerge diff on a blob that uses textconv, and the caching
feature is turned on, then we'll write out a new blob with the cached
value, and eventually a new tree and refs/notes/ pointer referencing it.
I'm not sure of the timing of all of that, but it seems likely to me
that at least some of that will end up in your tmp_objdir.

If you remove the tmp_objdir as the primary as soon as you're done with
the merge, but before you run the diff, you might be OK, though.

If not, then I think the solution is probably not to install this as the
"primary", but rather:

  - do the specific remerge-diff writes we want using a special "write
    to this object dir" variant of write_object_file()

  - install the tmp_objdir as an alternate, so the reading side (which
    is diff code that doesn't otherwise know about our remerge-diff
    trickery) can find them

And that lets other writers avoid writing into the temporary area
accidentally.

In that sense this is kind of like the pretend_object_file() interface,
except that it's storing the objects on disk instead of in memory. Of
course another way of doing that would be to stuff the object data into
tempfiles and just put pointers into the in-memory cached_objects array.

It's also not entirely foolproof (nor is the existing
pretend_object_file()). Any operation which is fooled into thinking we
have object X because it's in the fake-object list or the tmp_objdir may
reference that object erroneously, creating a corruption. But it's still
safer than allowing arbitrary writes into the tmp_objdir.

  Side note: The pretend_object_file() approach is actually even better,
  because we know the object is fake. So it does not confuse
  write_object_file()'s "do we already have this object" freshening
  check.

  I suspect it could even be made faster than the tmp_objdir approach.
  From our perspective, these objects really are tempfiles. So we could
  write them as such, not worrying about things like fsyncing them,
  naming them into place, etc. We could just write them out, then mmap
  the results, and put the pointers into cached_objects (currently it
  insists on malloc-ing a copy of the input buffer, but that seems like
  an easy extension to add).

  In fact, I think you could get away with just _one_ tempfile per
  merge. Open up one tempfile. Write out all of the objects you want to
  "store" into it in sequence, and record the lseek() offsets before and
  after for each object. Then mmap the whole result, and stuff the
  appropriate pointers (based on arithmetic with the offsets) into the
  cached_objects list.

  As a bonus, this tempfile can easily be in $TMPDIR, meaning
  remerge-diff could work on a read-only repository (and the OS can
  perhaps handle the data better, especially if $TMPDIR isn't a regular
  filesystem).

> We also need one more thing -- `git log --remerge-diff` can cause the
> temporary object store to fill up with loose objects.  Rather than
> trying to gc that are known garbage anyway, we simply want to know the
> location of the temporary object store so we can purge the loose objects
> after each merge.

This paragraph confused me. At first I thought you were talking about
how to avoid calling "gc --auto" because we don't want to waste time
thinking all those loose objects were worth gc-ing. But we wouldn't do
that anyway (git-log does not expect to make objects so doesn't call it
at all, and anyway you'd expect it to happen at the end of a process,
after we've already removed the tmp_objdir).

But I think you just mean: we can build up a bunch of loose objects,
which is inefficient. We don't want to gc them, because that's even less
efficient. So we want to clean them out between merges.

But I don't see any code to do that here. I guess that's maybe why
you've added tmp_objdir_path(), to find them later. But IMHO this would
be much better encapsulated as tmp_objdir_clear_objects() or something.

But simpler still: is there any reason not to just drop and re-create
the tmp-objdir for each merge? It's not very expensive to do so (and
certainly not compared to the cost of actually writing out the objects).

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/7] ll-merge: add API for capturing warnings in a strbuf instead of stderr
  2021-09-28 22:37   ` Jeff King
@ 2021-09-28 23:49     ` Junio C Hamano
  2021-09-29  4:03     ` Elijah Newren
  1 sibling, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2021-09-28 23:49 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren via GitGitGadget, git, Jonathan Nieder,
	Sergey Organov, Elijah Newren

Jeff King <peff@peff.net> writes:

> I do wonder if the ll_merge() code should avoid calling warning() in the
> first place. It is after all, meant to be "low-level". We already return
> an error code from the function. I wonder if returning a more detailed
> code instead, like:
>
>   enum LL_MERGE_RESULT {
> 	LL_MERGE_OK = 0,
> 	LL_MERGE_CONFLICT,
> 	LL_MERGE_BINARY_CONFLICT,
>   };
>
> would let the caller do the sensible thing.

I like it better as a longer-term direction.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 6/7] show, log: provide a --remerge-diff capability
  2021-09-28  8:01   ` Ævar Arnfjörð Bjarmason
@ 2021-09-29  4:00     ` Elijah Newren
  0 siblings, 0 replies; 64+ messages in thread
From: Elijah Newren @ 2021-09-29  4:00 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Sergey Organov, Neeraj K. Singh

On Tue, Sep 28, 2021 at 1:05 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:
>
> >  static int decoration_given;
> >  static int use_mailmap_config = 1;
> > +static struct tmp_objdir *tmp_objdir;
> >  static const char *fmt_patch_subject_prefix = "PATCH";
> >  static int fmt_patch_name_max = FORMAT_PATCH_NAME_MAX_DEFAULT;
> >  static const char *fmt_pretty;
>
> So here we make this static file-level etc...
>
> > @@ -407,6 +410,17 @@ static int cmd_log_walk(struct rev_info *rev)
> >       int saved_nrl = 0;
> >       int saved_dcctc = 0;
> >
> > +     if (rev->remerge_diff) {
> > +             tmp_objdir = tmp_objdir_create();
> > +             if (!tmp_objdir)
> > +                     die(_("unable to create temporary object directory"));
> > +             tmp_objdir_make_primary(the_repository, tmp_objdir);
> > +
> > +             strbuf_init(&rev->remerge_objdir_location, 0);
> > +             strbuf_addstr(&rev->remerge_objdir_location,
> > +                           tmp_objdir_path(tmp_objdir));
> > +     }
> > +
> >       if (rev->early_output)
> >               setup_early_output();
> >
> > @@ -449,6 +463,13 @@ static int cmd_log_walk(struct rev_info *rev)
> >       rev->diffopt.no_free = 0;
> >       diff_free(&rev->diffopt);
> >
> > +     if (rev->remerge_diff) {
> > +             strbuf_release(&rev->remerge_objdir_location);
> > +             tmp_objdir_remove_as_primary(the_repository, tmp_objdir);
> > +             tmp_objdir_destroy(tmp_objdir);
> > +             tmp_objdir = NULL;
>
> ...but all of the "tmp_objdir" usage is in one function, can't the
> variable be declared here instead?

That's a very good point.

> We need to hand the "remerge_objdir_location" off to the "rev_info"
> struct, but that seems separate from its lifetime.

Given Peff's suggestion elsewhere, though, to destroy the tmp_objdir
after each merge and create a new one, I wonder if I should actually
be passing a tmp_objdir** to rev_info (allowing log-tree to do the
work of destroying and creating a new one after each merge, instead of
using the "remerge_objdir_location" to run a recursive delete of
files).  That'd still work with your idea to remove the statically
scoped variable, though.

> Re my [1] & [2] I like Neeraj's "atexit cleanup" approach better,
> perhaps that makes your cleanup in log-tree.c redundant or easier?

Having an atexit cleanup as a safety measure seems fine.  However, I
don't like avoiding the manual cleanup step and relying on atexit
cleanup; I'd go so far as to say I think that'd be a bug, at least for
my usage.  It presumes one-shot usage, whereas I'd rather move git to
being more library-like.

However, fully destroying the tmp_objdir probably makes the cleanup in
log-tree.c easier.

> Per [2] it looks like you need to "hand off" the
> "remerge_objdir_location", so having the struct live in tmp-objdir.h as
> I suggested in [2] might make that work...
>
> 1. https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/
> 2. https://lore.kernel.org/git/87r1d9xh71.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/7] ll-merge: add API for capturing warnings in a strbuf instead of stderr
  2021-09-28 22:37   ` Jeff King
  2021-09-28 23:49     ` Junio C Hamano
@ 2021-09-29  4:03     ` Elijah Newren
  1 sibling, 0 replies; 64+ messages in thread
From: Elijah Newren @ 2021-09-29  4:03 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Tue, Sep 28, 2021 at 3:37 PM Jeff King <peff@peff.net> wrote:
>
> On Tue, Aug 31, 2021 at 02:26:36AM +0000, Elijah Newren via GitGitGadget wrote:
>
> > From: Elijah Newren <newren@gmail.com>
> >
> > Whenever ll-merge encounters a binary file, it prints a warning to
> > stderr.  However, for the --remerge-diff option we want to add, we need
> > to capture all conflict messages and show them in a diff instead of
> > dumping them straight to stdout or stderr.  Add some new API that will
> > allow us to capture this output and display it in our preferred method.
>
> This is a reasonable strategy for error-handling in general (though some
> more thoughts below).
>
> > @@ -71,8 +75,11 @@ static int ll_binary_merge(const struct ll_merge_driver *drv_unused,
> >       } else {
> >               switch (opts->variant) {
> >               default:
> > -                     warning("Cannot merge binary files: %s (%s vs. %s)",
> > -                             path, name1, name2);
> > +                     if (warnings) {
> > +                             strbuf_addstr(warnings, "Warning: ");
> > +                             strbuf_addf(warnings, msg, path, name1, name2);
> > +                     } else
> > +                             warning(msg, path, name1, name2);
>
> The usual warn_builtin() has a lowercase "warning: " prefix, but your
> strbuf variant uses uppercase.

Ah, thanks for catching that.  I can fix it up.

> > @@ -98,6 +105,7 @@ static int ll_binary_merge(const struct ll_merge_driver *drv_unused,
> >
> >  static int ll_xdl_merge(const struct ll_merge_driver *drv_unused,
> >                       mmbuffer_t *result,
> > +                     struct strbuf *warnings,
> >                       const char *path,
> >                       mmfile_t *orig, const char *orig_name,
> >                       mmfile_t *src1, const char *name1,
>
> There's a lot of plumbing this variable through. This is probably too
> gross, but another option would be to call set_warn_routine() to
> override it temporarily. It's gross because it affects everything, not
> just this call stack (and I also think we'd need to beef up the warn
> routine code to handle some of the rough edges).
>
> I do wonder if the ll_merge() code should avoid calling warning() in the
> first place. It is after all, meant to be "low-level". We already return
> an error code from the function. I wonder if returning a more detailed
> code instead, like:
>
>   enum LL_MERGE_RESULT {
>         LL_MERGE_OK = 0,
>         LL_MERGE_CONFLICT,
>         LL_MERGE_BINARY_CONFLICT,
>   };
>
> would let the caller do the sensible thing.

Ooh, I like this idea.  I'll have to change all the callers, but
there's only about half a dozen or so...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-28 23:17   ` Jeff King
@ 2021-09-29  4:08     ` Junio C Hamano
  2021-09-30  7:33       ` Jeff King
  2021-09-29  5:05     ` Elijah Newren
  1 sibling, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2021-09-29  4:08 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren via GitGitGadget, git, Jonathan Nieder,
	Sergey Organov, Elijah Newren

Jeff King <peff@peff.net> writes:

>   Side note: The pretend_object_file() approach is actually even better,
>   because we know the object is fake. So it does not confuse
>   write_object_file()'s "do we already have this object" freshening
>   check.
>
>   I suspect it could even be made faster than the tmp_objdir approach.
>   From our perspective, these objects really are tempfiles. So we could
>   write them as such, not worrying about things like fsyncing them,
>   naming them into place, etc. We could just write them out, then mmap
>   the results, and put the pointers into cached_objects (currently it
>   insists on malloc-ing a copy of the input buffer, but that seems like
>   an easy extension to add).
>
>   In fact, I think you could get away with just _one_ tempfile per
>   merge. Open up one tempfile. Write out all of the objects you want to
>   "store" into it in sequence, and record the lseek() offsets before and
>   after for each object. Then mmap the whole result, and stuff the
>   appropriate pointers (based on arithmetic with the offsets) into the
>   cached_objects list.

Cute.  The remerge diff code path creates a full tree that records
the mechanical merge result.  By hooking into the lowest layer of
write_object() interface, we'd serialize all objects in such a tree
in the order they are computed (bottom up from the leaf level, I'd
presume) into a single flat file ;-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-28  7:55   ` Ævar Arnfjörð Bjarmason
@ 2021-09-29  4:22     ` Elijah Newren
  2021-09-30  7:41       ` Jeff King
  2021-09-30 14:17       ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 64+ messages in thread
From: Elijah Newren @ 2021-09-29  4:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Sergey Organov, Neeraj Singh

On Tue, Sep 28, 2021 at 1:00 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:
>
> I commented just now on how this API is duplicated between here &
> another in-flight series in
> https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/; Just
> comments on this patch here:
>
> > diff --git a/tmp-objdir.c b/tmp-objdir.c
> > index b8d880e3626..9f75a75d1c0 100644
> > --- a/tmp-objdir.c
> > +++ b/tmp-objdir.c
> > @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t)
> >       return t->env.v;
> >  }
> >
> > +const char *tmp_objdir_path(struct tmp_objdir *t)
> > +{
> > +     return t->path.buf;
> > +}
>
> Not your fault & this pre-dates your patch, but FWIW I prefer our APIs
> that don't have these "hidden struct" shenanigans (like say "struct
> string_list") so a caller could just access this, we can just declare it
> "const" appropriately.
>
> We're also all in-tree friends here, so having various accessors for no
> reason other than to access struct members seems a bit too much.

That's a fair point, but just to play counterpoint for a minute...

FWIW, I dislike when our public facing APIs are polluted with all
kinds of internal details.  merge-recursive being a case in point.
When writing merge-ort, although I had a struct with public fields,
that struct also contained an opaque struct (pointer) within it to
hide several private fields.  (I would have liked to hide or remove a
few more fields, but couldn't do so while the merge_recursive_options
struct was shared between both merge-ort and merge-recursive.)

That said, I agree it can certainly be overdone, and tmp_objdir is
pretty simple.  However, sometimes even in simple cases I like when
folks make use of an opaque struct because it means folks put some
effort into thinking more about the API that should be exposed.
That's something we as a project have often overlooked in the past, as
evidenced both by our one-shot usage mentality, and the existence of
external projects like libgit2 attempting to correct this design
shortcoming.  I'd like git to move more towards being structured as a
reusable library as well as a useful command-line tool.

> All of which is to say if you + Neeraj come up with some common API I
> for one wouldn't mind moving the struct deceleration to tmp-objdir.h,
> but it looks like in his version it can stay "private" more easily.

Peff's ideas to just delete and recreate the temporary object store
after every merge would let me reduce the API extensions and keep
things more private.  So I think our ideas are converging a bit.  :-)

> In this particular case I hadn't understood on a previous reading of
> tmp-objdir.c why this "path" isn't a statically allocated "char
> path[PATH_MAX]", or a least we that hardcoded "1024" should be a
> PATH_MAX, as it is in some other cases.

Actually, PATH_MAX shouldn't be used:

https://lore.kernel.org/git/7d1237c7-5a83-d766-7d93-5f0d59166067@web.de/
https://lore.kernel.org/git/20180720180427.GE22486@sigill.intra.peff.net/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-28 23:17   ` Jeff King
  2021-09-29  4:08     ` Junio C Hamano
@ 2021-09-29  5:05     ` Elijah Newren
  2021-09-30  7:26       ` Jeff King
  1 sibling, 1 reply; 64+ messages in thread
From: Elijah Newren @ 2021-09-29  5:05 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Tue, Sep 28, 2021 at 4:17 PM Jeff King <peff@peff.net> wrote:
>
> On Tue, Aug 31, 2021 at 02:26:38AM +0000, Elijah Newren via GitGitGadget wrote:
>
> > From: Elijah Newren <newren@gmail.com>
> >
> > The tmp_objdir API provides the ability to create temporary object
> > directories, but was designed with the goal of having subprocesses
> > access these object stores, followed by the main process migrating
> > objects from it to the main object store or just deleting it.  The
> > subprocesses would view it as their primary datastore and write to it.
> >
> > For the --remerge-diff option we want to add to show & log, we want all
> > writes of intermediate merge results (both blobs and trees) to go to
> > this alternate object store; since those writes will be done by the main
> > process, we need this "alternate" object store to actually be the
> > primary object store.  When show & log are done, they'll simply remove
> > this temporary object store.
>
> I think this is consistent with the original design of tmp_objdir. I
> just never needed to do anything in-process before, so overriding the
> environment of sub-processes was sufficient.
>
> I do think these are dangerous and may cause bugs, though. Anything you
> write while the tmp_objdir is marked as "primary" is going to go away
> when you remove it. So any object names you reference outside of that,
> either:
>
>   - by another object that you create after the tmp_objdir is removed;
>     or
>
>   - in a ref
>
> are going to turn into repository corruption. Of course that's true for
> the existing sub-processes, too, but here we're touching a wider variety
> of code.
>
> Obviously the objects we write as part of remerge-diff are meant to be
> temporary, and we'll never reference them in any other way. And usually
> we would not expect a diff to write any other objects. But one thing
> that comes to mind if textconv caching.
>
> If you do a remerge diff on a blob that uses textconv, and the caching
> feature is turned on, then we'll write out a new blob with the cached
> value, and eventually a new tree and refs/notes/ pointer referencing it.
> I'm not sure of the timing of all of that, but it seems likely to me
> that at least some of that will end up in your tmp_objdir.

Interesting.  Thanks for explaining this case, and for the testcase in
the other thread.

So there are other parts of running `git log` that are not read-only,
besides what I'm doing.  That's unfortunate.  Can we just disable
those things in this section of the code?

Also, I think it makes sense that textconv caching causes a problem
via writing new blobs and trees *and* refs that reference these blobs
and trees.  However, I'm not sure I'm really grasping it, because I
don't see how your solutions are safe either.  I'll mention more after
each one.

> If you remove the tmp_objdir as the primary as soon as you're done with
> the merge, but before you run the diff, you might be OK, though.

It has to be after I run the diff, because the diff needs access to
the temporary files to diff against them.

> If not, then I think the solution is probably not to install this as the
> "primary", but rather:
>
>   - do the specific remerge-diff writes we want using a special "write
>     to this object dir" variant of write_object_file()
>
>   - install the tmp_objdir as an alternate, so the reading side (which
>     is diff code that doesn't otherwise know about our remerge-diff
>     trickery) can find them
>
> And that lets other writers avoid writing into the temporary area
> accidentally.

Doesn't this have the same problem?  If something similar to textconv
caching were to create new trees that referenced the existing blobs
and then added a ref that referred to that tree, then you have the
exact same problem, right?  The new tree and the new ref would be
corrupt as soon as the tmp-objdir went away.  Whether or not other
callers write to the temporary area isn't the issue, it's whether they
write refs that can refer to anything from the temporary area.  Or am
I missing something?

> In that sense this is kind of like the pretend_object_file() interface,
> except that it's storing the objects on disk instead of in memory. Of
> course another way of doing that would be to stuff the object data into
> tempfiles and just put pointers into the in-memory cached_objects array.
>
> It's also not entirely foolproof (nor is the existing
> pretend_object_file()). Any operation which is fooled into thinking we
> have object X because it's in the fake-object list or the tmp_objdir may
> reference that object erroneously, creating a corruption. But it's still
> safer than allowing arbitrary writes into the tmp_objdir.

Why is the pretend_object_file() interface safer?  Does it disable the
textconv caching somehow?  I don't see why it helps avoid this
problem.

>   Side note: The pretend_object_file() approach is actually even better,
>   because we know the object is fake. So it does not confuse
>   write_object_file()'s "do we already have this object" freshening
>   check.

Oh, that's interesting.  That would be helpful.

>   I suspect it could even be made faster than the tmp_objdir approach.
>   From our perspective, these objects really are tempfiles. So we could
>   write them as such, not worrying about things like fsyncing them,
>   naming them into place, etc. We could just write them out, then mmap
>   the results, and put the pointers into cached_objects (currently it
>   insists on malloc-ing a copy of the input buffer, but that seems like
>   an easy extension to add).
>
>   In fact, I think you could get away with just _one_ tempfile per
>   merge. Open up one tempfile. Write out all of the objects you want to
>   "store" into it in sequence, and record the lseek() offsets before and
>   after for each object. Then mmap the whole result, and stuff the
>   appropriate pointers (based on arithmetic with the offsets) into the
>   cached_objects list.
>
>   As a bonus, this tempfile can easily be in $TMPDIR, meaning
>   remerge-diff could work on a read-only repository (and the OS can
>   perhaps handle the data better, especially if $TMPDIR isn't a regular
>   filesystem).

Interesting.  I'm not familiar with the pretend_object_file() code,
but I think I catch what you're saying.  I don't see anything in
object-file.c that allows removing items from the cache, as I would
want to do, but I'm sure I could add something.

(I'm still not sure how this avoids the problem of things like
textconv caching trying to write an object that references one of
these, though.  Should we just manually disable textconv caching
during a remerge-diff?)

> > We also need one more thing -- `git log --remerge-diff` can cause the
> > temporary object store to fill up with loose objects.  Rather than
> > trying to gc that are known garbage anyway, we simply want to know the
> > location of the temporary object store so we can purge the loose objects
> > after each merge.
>
> This paragraph confused me. At first I thought you were talking about
> how to avoid calling "gc --auto" because we don't want to waste time
> thinking all those loose objects were worth gc-ing. But we wouldn't do
> that anyway (git-log does not expect to make objects so doesn't call it
> at all, and anyway you'd expect it to happen at the end of a process,
> after we've already removed the tmp_objdir).
>
> But I think you just mean: we can build up a bunch of loose objects,
> which is inefficient. We don't want to gc them, because that's even less
> efficient. So we want to clean them out between merges.

Yes, sorry that wasn't clear.

> But I don't see any code to do that here. I guess that's maybe why
> you've added tmp_objdir_path(), to find them later. But IMHO this would
> be much better encapsulated as tmp_objdir_clear_objects() or something.
>
> But simpler still: is there any reason not to just drop and re-create
> the tmp-objdir for each merge? It's not very expensive to do so (and
> certainly not compared to the cost of actually writing out the objects).

Ooh, indeed.  I like that, and it turns out that the majority of the
cost of dropping the tmp_objdir is the recursive directory removal,
that I wanted anyway as part of me inbetween-merges cleanup.

But, of course, if we go the pretend_object_file() route then the
tmp_objdir() stuff probably becomes moot.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/7] merge-ort: add ability to record conflict messages in a file
  2021-09-28 22:29   ` Jeff King
@ 2021-09-29  6:25     ` Elijah Newren
  2021-09-29 16:14       ` Junio C Hamano
  2021-09-30  7:58       ` Jeff King
  0 siblings, 2 replies; 64+ messages in thread
From: Elijah Newren @ 2021-09-29  6:25 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Tue, Sep 28, 2021 at 3:29 PM Jeff King <peff@peff.net> wrote:
>
> On Tue, Aug 31, 2021 at 02:26:35AM +0000, Elijah Newren via GitGitGadget wrote:
>
> > There are several considerations here:
> >   * We have to pick file(s) where we write these conflict messages too
> >   * We want to make it as clear as possible that it's not a real file
> >     but a set of messages about another file
> >   * We want conflict messages about a file to appear near the file in
> >     question in a diff, preferably immediately preceding the file in
> >     question
> >   * Related to the above, per-file conflict messages are preferred
> >     over lumping all conflict messages into one big file
> >
> > To achive the above:
> >   * We put the conflict messages for $filename in
> >       $filename[0:-1] + " " + $filename[-1] + ".conflict_msg"
> >     or, in words, we insert a space before the final character of
> >     the filename and then also add ".conflict_msg" at the end.
>
> It took me a minute to understand the space thing. I thought at first it
> was about avoiding conflicts with existing names (and while it might
> help in practice, it's not a guarantee). But I think it's about the
> "appear preceding the file" goal. The space sorts before any other
> printable character in the final position.

Yeah, it's all about the ordering.  I guess it helps slightly with
conflict avoidance, but I cannot rely on it; I have to check for
colliding files and potentially tweak the filename further.

> That's...simultaneously clever and gross. My biggest complaint is that
> the space looks like a bug in the output.

Junio had basically the same reaction[*].  :-)

[*] https://lore.kernel.org/git/xmqqk0k0qkmv.fsf@gitster.g/

> Using another character like "." might not be too bad, as it's also
> fairly early in the ascii table. But it's really this "do it before the
> last character" thing that is key to getting the ordering right.
>
> Just brainstorming some alternatives:
>
>  - we have diff.orderFile, etc. Could we stuff this data into a less
>    confusing name (even just "$filename.conflict_msg"), and then provide
>    a custom ordering to the diff code? I think it could be done by
>    generating a static ordering ahead of time, but it might even just be
>    possible to tell diffcore_order() to take the ".conflict_msg"
>    extension into account in its comparison function.

I can't just go on the ".conflict_msg" extension.  As you noted above,
this scheme is not sufficient for avoiding collisions.  So I need to
append extra "cruft" to the name in the case of collisions -- meaning
we can't special case on just that extension.

I also don't like how diff.orderFile provides a global ordering of the
files listed, rather than providing some scheme for relative
orderings.  That'd either force me to precompute the diff to determine
all the files that were different so I can list _all_ of them, or put
up with the fact that the files with non-content conflicts will be
listed very first in the output, even if their name is
'zee-last-file.c' -- surprising users at the output ordering.

This also means that if the user had their own ordering defined, then
I'm overriding it and messing up their ordering, which might be
problematic.

So, I'm not so sure about this solution; it feels like it introduces
bigger holes than the ugly space character it is fixing.

>  - there can be other non-diff data between the individual segments. For
>    example, "patch" will skip over non-diff lines. And certainly in Git
>    we have our own custom headers. I'm wondering if we could attach
>    these annotations to the diff-pair somehow, and then show something
>    like:
>
>      diff --git a/foo.c b/foo.c
>      index 1234abcd..5678cdef 100644
>      conflict modify/delete foo.c

A couple things here...

First, I'm not so sure I like the abbreviation here.  Just knowing
"modify/delete" might be enough in some cases, but I'd rather have the
full messages that would have been printed to the console, e.g.:

CONFLICT (modify/delete): foo.c deleted in HASH1 (SHORT
SUMMARY1) and modified in HASH2 (SHORT SUMMARY 2).  Version HASH2
(SHORT SUMMARY2) of  foo.c left in tree.

because I think the commit references are useful context.  That extra
context might be of little use for many modify/delete conflicts, but
is much more important for conflicts involving renames; e.g.
"rename/rename" is much less useful than being able to know the
original name of the file and with which parent commit each filename
is associated.  So, that raises the question: could we pack all that
information from the full conflict notice into these conflict
header(s)?  (And do we have to special case the code to print it all
on one line when doing the remerge-diff since the diff output needs
them to be one-line headers?)

Second, what about when there are multiple non-content conflict types
for a single file, e.g. rename/delete + rename/add + modify/delete +
mode conflict + unmergeable binary?  (Yes, I think it's possible for
one path to have all five of those: (1) source file deleted on one
side, renamed on the other, (2) rename target on one side matches new
file added on other side of history, (3) renamed file also had its
contents modified, thus modify vs. delete, (4) added file on other
side of history had a different mode, (5) added file on other side of
history is a binary.)  Do we just use multiple conflict headers?

Third, what about the cases where there is no diff, just conflict
headers?  (I suspect many modify/delete or rename/delete or binary
files may end up in such a situation.)

I don't think any of those are deal breakers, but it means more work,
and maybe also other forms of ugliness.

>      --- a/foo.c
>      +++ b/foo.c
>      @@ some actual diff starts here @@
>
> Obviously such a thing can't really be applied. But then you wouldn't
> want to apply the addition of "my.file e.conflict_msg" either.

Nit: "my.fil e.conflict_msg", not "my.file e.conflict_msg" (the 'e' in
'file' is not repeated, otherwise the auxiliary file wouldn't sort
before its companion file)

> I dunno. The latter especially is definitely more work, and requires a
> bit more cooperation between the merge and diff code. In particular, you
> can't just feed a straight tree to the diff anymore. We have to hold
> back the annotations, and then apply them to the resulting diff. But I
> think the output is much more pleasing to the eye.

It's certainly an interesting idea.  It's a lot more work, it involves
the inability to feed a straight tree to a diff would require piping
things through several different layers (merge -> log -> diff, and
possible multiple diff layers), it may mean we need special handling
for when there are only conflict headers for a file with no file
differences, the length of the conflict headers could be comically
long, and it's all essentially for what is a rather uncommon case
anyway.  But, on the plus side, it does avoid the rather ugly space.

I'll have to think about it.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/7] merge-ort: add ability to record conflict messages in a file
  2021-09-29  6:25     ` Elijah Newren
@ 2021-09-29 16:14       ` Junio C Hamano
  2021-09-29 16:31         ` Elijah Newren
  2021-09-30  7:58       ` Jeff King
  1 sibling, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2021-09-29 16:14 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Jeff King, Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

Elijah Newren <newren@gmail.com> writes:

> I also don't like how diff.orderFile provides a global ordering of the
> files listed, rather than providing some scheme for relative
> orderings.  That'd either force me to precompute the diff to determine
> all the files that were different so I can list _all_ of them,...

Don't we determine all the files that would be listed in the diff
anyway?  The diffcore pipeline collects all the different filepairs,
matches them up for rename/copy detection, and finally do the output
formatting for each individual filepair.

> So, I'm not so sure about this solution; it feels like it introduces
> bigger holes than the ugly space character it is fixing.
>
>>  - there can be other non-diff data between the individual segments. For
>>    example, "patch" will skip over non-diff lines. And certainly in Git
>>    we have our own custom headers. I'm wondering if we could attach
>>    these annotations to the diff-pair somehow, and then show something
>>    like:
>>
>>      diff --git a/foo.c b/foo.c
>>      index 1234abcd..5678cdef 100644
>>      conflict modify/delete foo.c
>
> A couple things here...
>
> First, I'm not so sure I like the abbreviation here.  Just knowing
> "modify/delete" might be enough in some cases, but I'd rather have the
> full messages that would have been printed to the console...
>
> Second, what about when there are multiple ...
>
> Third, what about the cases where there is no diff, ...

None of the above seems like a problem to me at least from the
presentation and consumption sides.  There is no rule that extended
diff headers has to fit on a 72-char line, cannot use line folding
by inserting LF-SP in the middle of a logical line, and there
already is at least one case we give an extended diff header without
a single line of content change (namely, "chmod +x README").

Thanks.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/7] merge-ort: add ability to record conflict messages in a file
  2021-09-29 16:14       ` Junio C Hamano
@ 2021-09-29 16:31         ` Elijah Newren
  0 siblings, 0 replies; 64+ messages in thread
From: Elijah Newren @ 2021-09-29 16:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Wed, Sep 29, 2021 at 9:14 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > I also don't like how diff.orderFile provides a global ordering of the
> > files listed, rather than providing some scheme for relative
> > orderings.  That'd either force me to precompute the diff to determine
> > all the files that were different so I can list _all_ of them,...
>
> Don't we determine all the files that would be listed in the diff
> anyway?  The diffcore pipeline collects all the different filepairs,
> matches them up for rename/copy detection, and finally do the output
> formatting for each individual filepair.

merge-ort is missing one side of the diff.  It only has the parents of
the merge commit, and their merge base.  So, merge-ort would now need
to know about an additional commit (the original merge), and compute
the diff between that and the merge-result it is creating, in advance
of an external caller (log) calling diff to compute the result between
those two trees.

> > So, I'm not so sure about this solution; it feels like it introduces
> > bigger holes than the ugly space character it is fixing.
> >
> >>  - there can be other non-diff data between the individual segments. For
> >>    example, "patch" will skip over non-diff lines. And certainly in Git
> >>    we have our own custom headers. I'm wondering if we could attach
> >>    these annotations to the diff-pair somehow, and then show something
> >>    like:
> >>
> >>      diff --git a/foo.c b/foo.c
> >>      index 1234abcd..5678cdef 100644
> >>      conflict modify/delete foo.c
> >
> > A couple things here...
> >
> > First, I'm not so sure I like the abbreviation here.  Just knowing
> > "modify/delete" might be enough in some cases, but I'd rather have the
> > full messages that would have been printed to the console...
> >
> > Second, what about when there are multiple ...
> >
> > Third, what about the cases where there is no diff, ...
>
> None of the above seems like a problem to me at least from the
> presentation and consumption sides.  There is no rule that extended
> diff headers has to fit on a 72-char line, cannot use line folding
> by inserting LF-SP in the middle of a logical line, and there
> already is at least one case we give an extended diff header without
> a single line of content change (namely, "chmod +x README").

Cool, that's good to hear.  I'll look into it.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-29  5:05     ` Elijah Newren
@ 2021-09-30  7:26       ` Jeff King
  2021-09-30  7:46         ` Jeff King
                           ` (3 more replies)
  0 siblings, 4 replies; 64+ messages in thread
From: Jeff King @ 2021-09-30  7:26 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Tue, Sep 28, 2021 at 10:05:31PM -0700, Elijah Newren wrote:

> So there are other parts of running `git log` that are not read-only,
> besides what I'm doing.  That's unfortunate.  Can we just disable
> those things in this section of the code?

It might be possible for the cache (we'd still generate entries, but
just not write them). It's "just" an optimization, though it's possible
there are cases where it's really annoying to do so. I'd imagine those
are a stretch. I do use textconv to decrypt gpg blobs with a key on a
hardware token, for example. I don't require a touch for each one,
though, nor would I use textconv caching anyway (since the point is not
to store the unencrypted contents).

It wouldn't necessarily be OK to disable object writes in general.
Though I'd think since a diff is _conceptually_ a read-only operation
that any writes it wants to do would be in the same caching /
optimization boat.  I'd also worry about the maintenance cost of having
to annotate any such case.

> Also, I think it makes sense that textconv caching causes a problem
> via writing new blobs and trees *and* refs that reference these blobs
> and trees.  However, I'm not sure I'm really grasping it, because I
> don't see how your solutions are safe either.  I'll mention more after
> each one.

Heh. So I originally wrote more on this in my earlier email, but worried
I was getting too into the weeds and would confuse you, so I pared it
down. But I managed to be confusing anyway. :)

Here's one mental model. If you have a temporary object store, then
these are the bad things that can happen:

  1. Some code may write an object to it, thinking the data is saved,
     but we later throw it away.

  2. Some code may _skip_ writing an object, thinking that we already
     have that object available (when in fact we'll later throw it
     away).

  3. Some code may record the fact that we have an object in memory, and
     then later (after the temporary object store is not in use), may
     point to it when writing an object or reference.

Pointing the primary object-dir at a temporary directory (like your
patch does) means we're subject to all three.

Problem (1) happens because unrelated code may not realize we've swapped
out the primary object writing code for this throw-away directory. But
we could require writers to explicitly indicate that they want to write
to the temporary area, while allowing other writes to go to the regular
object store. This could be done with a flag or variant of
write_object_file(). Or if the temporary area isn't a regular object
directory at all (like the whole tempfile / pretend thing), then this
happens automatically, because the regular object-write paths don't even
know about our temporary area.

Problem (2) is mostly handled inside write_object_file(). It will
optimize out writes of objects we already have. So it needs to know when
we have an object "for real", and when we just have it in our temporary
store. If we make the tmp_objdir the primary, then it doesn't know this
(though we could teach it to check. In the pretend_object_file() system,
it already knows this (it uses the "freshen" code paths which actually
want to find the entry in the filesystem).

Problem (3) is the hardest one, because we don't really distinguish
between readers and writers. So if I call has_object_file(), is it
because I want to access the object? Or is it because I want to generate
a new reference to it, which must ensure that it exists? One of those is
OK to look at the tmp_objdir (and indeed, is the whole point of making
it in the first place), and the other is the first step to corrupting
the repository. ;)

With the environment variables we set up for running external commands
(like hooks) in the receive-pack quarantine, one bit of safety we have
is to prevent ref writes entirely in those commands. That works
reasonably well. Even if they may write new objects that might get
thrown away (if the push is rejected), they'd ultimately need to point
to those objects with a ref.  And because those hooks spend their
_entire_ run-time in the quarantine state, they can't ever write a ref.

Whereas doing it in-process is a little dicier. Right now the textconv
cache writes out a new tree for every entry we add, so it happens to do
the ref write while the tmp-objdir is still in place. But as this
comment notes:

  $ git grep -B6 notes_cache_write diff.c
  diff.c-         /*
  diff.c-          * we could save up changes and flush them all at the end,
  diff.c-          * but we would need an extra call after all diffing is done.
  diff.c-          * Since generating a cache entry is the slow path anyway,
  diff.c-          * this extra overhead probably isn't a big deal.
  diff.c-          */
  diff.c:         notes_cache_write(textconv->cache);

this is slow, and it would be perfectly reasonable to flush the cache at
the end of the process (e.g., using an atexit() handler). In which case
we'd hit this problem (3) exactly: we'd generate and remember objects
during the tmp-objdir period, but only later actually reference them.
This is more likely than the receive-pack hook case, because we're doing
it all in process.

> > If you remove the tmp_objdir as the primary as soon as you're done with
> > the merge, but before you run the diff, you might be OK, though.
> 
> It has to be after I run the diff, because the diff needs access to
> the temporary files to diff against them.

Right, of course. I was too fixated on the object-write part, forgetting
that the whole point of the exercise is to later read them back. :)

> > If not, then I think the solution is probably not to install this as the
> > "primary", but rather:
> >
> >   - do the specific remerge-diff writes we want using a special "write
> >     to this object dir" variant of write_object_file()
> >
> >   - install the tmp_objdir as an alternate, so the reading side (which
> >     is diff code that doesn't otherwise know about our remerge-diff
> >     trickery) can find them
> >
> > And that lets other writers avoid writing into the temporary area
> > accidentally.
> 
> Doesn't this have the same problem?  If something similar to textconv
> caching were to create new trees that referenced the existing blobs
> and then added a ref that referred to that tree, then you have the
> exact same problem, right?  The new tree and the new ref would be
> corrupt as soon as the tmp-objdir went away.  Whether or not other
> callers write to the temporary area isn't the issue, it's whether they
> write refs that can refer to anything from the temporary area.  Or am
> I missing something?

The key thing here is in the first step, where remerge-diff is
explicitly saying "I want to write to this temporary object-dir". But
the textconv-cache code does not; it gets to write to the normal spot.
So it avoids problem (1).

You're right that it does not avoid problem (3) exactly. But triggering
that would require some code not just writing other objects or
references while the tmp-objdir is in place, but specifically
referencing the objects that remerge-diff did put into the tmp-objdir.
That seems a lot less likely to me (because the thing we're most worried
about is unrelated code that just happens to write while the tmp-objdir
is in place).

> > In that sense this is kind of like the pretend_object_file() interface,
> > except that it's storing the objects on disk instead of in memory. Of
> > course another way of doing that would be to stuff the object data into
> > tempfiles and just put pointers into the in-memory cached_objects array.
> >
> > It's also not entirely foolproof (nor is the existing
> > pretend_object_file()). Any operation which is fooled into thinking we
> > have object X because it's in the fake-object list or the tmp_objdir may
> > reference that object erroneously, creating a corruption. But it's still
> > safer than allowing arbitrary writes into the tmp_objdir.
> 
> Why is the pretend_object_file() interface safer?  Does it disable the
> textconv caching somehow?  I don't see why it helps avoid this
> problem.

So hopefully it's clearer now from what I wrote above, but just
connecting the dots:

  1. Unrelated code calling write_object_file() would still write real,
     durable objects, as usual.

  2. The write_object_file() "skip this write" optimization already
     ignores the pretend_object_file() objects while checking "do we
     have this object".

> Interesting.  I'm not familiar with the pretend_object_file() code,
> but I think I catch what you're saying.  I don't see anything in
> object-file.c that allows removing items from the cache, as I would
> want to do, but I'm sure I could add something.

Right, it's pretty limited now, and there's only one caller. And in fact
I've wanted to get rid of it, exactly because it can create the danger
we're discussing here. But I still think it's _less_ dangerous than
fully replacing the object directory.

> (I'm still not sure how this avoids the problem of things like
> textconv caching trying to write an object that references one of
> these, though.  Should we just manually disable textconv caching
> during a remerge-diff?)

It can't avoid it completely, but in general, the textconv cache should
be writing and referencing its own objects. Even if they happen to be
the same as something remerge-diff generated, it won't know that until
it has tried to write it (and so that's why having write_object_file()
continue to really write the object is important).


Hopefully that clears up my line of thinking. I do think a lot of this
is kind of theoretical, in the sense that:

  - textconv caching is an obscure feature that we _could_ just disable
    during remerge-diffs

  - there's a reasonable chance that there isn't any other code that
    wants to write during a diff

But this is all scary and error-prone enough that I'd prefer an approach
that has the "least surprise". So any solution where random code calling
write_object_file() _can't_ accidentally write to a throw-away directory
seems like a safer less surprising thing.

It does mean that the remerge-diff code needs to be explicit in its
object writes to say "this needs to go to the temporary area" (whether
it's a flag, or calling into a totally different function or subsystem).
I'm hoping that's not hard to do (because its writes are done explicitly
by the remerge-diff code), but I worry that it may be (because you are
relying on more generic tree code to the write under the hood).

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-29  4:08     ` Junio C Hamano
@ 2021-09-30  7:33       ` Jeff King
  2021-09-30 13:16         ` Ævar Arnfjörð Bjarmason
  2021-10-01  4:26         ` Elijah Newren
  0 siblings, 2 replies; 64+ messages in thread
From: Jeff King @ 2021-09-30  7:33 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, git, Jonathan Nieder,
	Sergey Organov, Elijah Newren

On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >   Side note: The pretend_object_file() approach is actually even better,
> >   because we know the object is fake. So it does not confuse
> >   write_object_file()'s "do we already have this object" freshening
> >   check.
> >
> >   I suspect it could even be made faster than the tmp_objdir approach.
> >   From our perspective, these objects really are tempfiles. So we could
> >   write them as such, not worrying about things like fsyncing them,
> >   naming them into place, etc. We could just write them out, then mmap
> >   the results, and put the pointers into cached_objects (currently it
> >   insists on malloc-ing a copy of the input buffer, but that seems like
> >   an easy extension to add).
> >
> >   In fact, I think you could get away with just _one_ tempfile per
> >   merge. Open up one tempfile. Write out all of the objects you want to
> >   "store" into it in sequence, and record the lseek() offsets before and
> >   after for each object. Then mmap the whole result, and stuff the
> >   appropriate pointers (based on arithmetic with the offsets) into the
> >   cached_objects list.
> 
> Cute.  The remerge diff code path creates a full tree that records
> the mechanical merge result.  By hooking into the lowest layer of
> write_object() interface, we'd serialize all objects in such a tree
> in the order they are computed (bottom up from the leaf level, I'd
> presume) into a single flat file ;-)

I do still like this approach, but just two possible gotchas I was
thinking of:

 - This side-steps all of our usual code for getting object data into
   memory. In general, I'd expect this content to not be too enormous,
   but it _could_ be if there are many / large blobs in the result. So
   we may end up with large maps. Probably not a big deal on modern
   64-bit systems. Maybe an issue on 32-bit systems, just because of
   virtual address space.

   Likewise, we do support systems with NO_MMAP. They'd work here, but
   it would probably mean putting all that object data into the heap. I
   could live with that, given how rare such systems are these days, and
   that it only matters if you're using --remerge-diff with big blobs.

 - I wonder to what degree --remerge-diff benefits from omitting writes
   for objects we already have. I.e., if you are writing out a whole
   tree representing the conflicted state, then you don't want to write
   all of the trees that aren't interesting. Hopefully the code is
   already figuring out which paths the merge even touched, and ignoring
   the rest. It probably benefits performance-wise from
   write_object_file() deciding to skip some object writes, as well
   (e.g., for resolutions which the final tree already took, as they'd
   be in the merge commit). The whole pretend-we-have-this-object thing
   may want to likewise make sure we don't write out objects that we
   already have in the real odb.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-29  4:22     ` Elijah Newren
@ 2021-09-30  7:41       ` Jeff King
  2021-09-30 14:17       ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 64+ messages in thread
From: Jeff King @ 2021-09-30  7:41 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Ævar Arnfjörð Bjarmason,
	Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov, Neeraj Singh

On Tue, Sep 28, 2021 at 09:22:32PM -0700, Elijah Newren wrote:

> > Not your fault & this pre-dates your patch, but FWIW I prefer our APIs
> > that don't have these "hidden struct" shenanigans (like say "struct
> > string_list") so a caller could just access this, we can just declare it
> > "const" appropriately.
> >
> > We're also all in-tree friends here, so having various accessors for no
> > reason other than to access struct members seems a bit too much.
> 
> That's a fair point, but just to play counterpoint for a minute...
> 
> FWIW, I dislike when our public facing APIs are polluted with all
> kinds of internal details.  merge-recursive being a case in point.
> When writing merge-ort, although I had a struct with public fields,
> that struct also contained an opaque struct (pointer) within it to
> hide several private fields.  (I would have liked to hide or remove a
> few more fields, but couldn't do so while the merge_recursive_options
> struct was shared between both merge-ort and merge-recursive.)
> 
> That said, I agree it can certainly be overdone, and tmp_objdir is
> pretty simple.  However, sometimes even in simple cases I like when
> folks make use of an opaque struct because it means folks put some
> effort into thinking more about the API that should be exposed.
> That's something we as a project have often overlooked in the past, as
> evidenced both by our one-shot usage mentality, and the existence of
> external projects like libgit2 attempting to correct this design
> shortcoming.  I'd like git to move more towards being structured as a
> reusable library as well as a useful command-line tool.

Right, it was definitely a conscious decision to keep the tmp-objdir API
as slim as possible, just because it's such a complicated and confusing
thing in the first place.

For something like a strbuf, giving direct access to the fields makes
sense. Exposing the details of how the struct works (like accessing
".buf" as a NUL-terminated string) are part of its usefulness.

But tmp_objdir represents a more abstract concept, and I wanted to
insulate callers from the details.

That said, the notion of "this is the path of the objdir" is not that
contentious, so I don't mind it too much (but it would be a jump to
exposing the details at all).

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30  7:26       ` Jeff King
@ 2021-09-30  7:46         ` Jeff King
  2021-09-30 20:06           ` Junio C Hamano
  2021-10-01  2:31           ` Elijah Newren
  2021-09-30 19:25         ` Neeraj Singh
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 64+ messages in thread
From: Jeff King @ 2021-09-30  7:46 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Thu, Sep 30, 2021 at 03:26:42AM -0400, Jeff King wrote:

> > > If you remove the tmp_objdir as the primary as soon as you're done with
> > > the merge, but before you run the diff, you might be OK, though.
> > 
> > It has to be after I run the diff, because the diff needs access to
> > the temporary files to diff against them.
> 
> Right, of course. I was too fixated on the object-write part, forgetting
> that the whole point of the exercise is to later read them back. :)

Ah, no, I remember what I was trying to say here. The distinction is
between "remove the tmp_objdir" and "remove it as the primary".

I.e., if you do this:

  1. create tmp_objdir

  2. make tmp_objdir primary for writes

  3. run the "merge" half of remerge-diff, writing objects into the
     temporary space

  4. stop having tmp_objdir as the primary; instead make it an alternate

  5. run the diff

  6. remove tmp_objdir totally

Then step 5 can't accidentally write objects into the temporary space,
but it can still read them. So it's not entirely safe, but it's safer,
and it would be a much smaller change.

Some ways it could go wrong:

  - is it possible for the merge code to ever write an object? I kind of
    wonder if we'd ever do any cache-able transformations as part of a
    content-level merge. I don't think we do now, though.

  - in step 5, write_object_file() may still be confused by the presence
    of the to-be-thrown-away objects in the alternate. This is pretty
    unlikely, as it implies that the remerge-diff wrote a blob or tree
    that is byte-identical to something that the diff wants to write.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/7] merge-ort: add ability to record conflict messages in a file
  2021-09-29  6:25     ` Elijah Newren
  2021-09-29 16:14       ` Junio C Hamano
@ 2021-09-30  7:58       ` Jeff King
  2021-09-30  8:09         ` Ævar Arnfjörð Bjarmason
  2021-10-01  2:07         ` Elijah Newren
  1 sibling, 2 replies; 64+ messages in thread
From: Jeff King @ 2021-09-30  7:58 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Tue, Sep 28, 2021 at 11:25:20PM -0700, Elijah Newren wrote:

> > Just brainstorming some alternatives:
> >
> >  - we have diff.orderFile, etc. Could we stuff this data into a less
> >    confusing name (even just "$filename.conflict_msg"), and then provide
> >    a custom ordering to the diff code? I think it could be done by
> >    generating a static ordering ahead of time, but it might even just be
> >    possible to tell diffcore_order() to take the ".conflict_msg"
> >    extension into account in its comparison function.
> 
> I can't just go on the ".conflict_msg" extension.  As you noted above,
> this scheme is not sufficient for avoiding collisions.  So I need to
> append extra "cruft" to the name in the case of collisions -- meaning
> we can't special case on just that extension.

Sure, but we can call it filename.conflict_msg.1, etc, and the sort code
can pattern-match. It can never be fully robust (if you really did have
a foo.conflict_msg, we'd sort it differently), but I think it's OK if
the worst-case is that pathological trees get ordered slightly
sub-optimally).

That said, I think it could also make sense to annotate the conflict
files by giving them an unusual set of mode bits. That would be easier
to recognize (and while real trees _could_ have silly modes, we do
complain about them in fsck, so it shouldn't happen in practice).

> I also don't like how diff.orderFile provides a global ordering of the
> files listed, rather than providing some scheme for relative
> orderings.  That'd either force me to precompute the diff to determine
> all the files that were different so I can list _all_ of them, or put
> up with the fact that the files with non-content conflicts will be
> listed very first in the output, even if their name is
> 'zee-last-file.c' -- surprising users at the output ordering.
> 
> This also means that if the user had their own ordering defined, then
> I'm overriding it and messing up their ordering, which might be
> problematic.

Agreed. I do think it may be too horrible unless you teach
diffcore_order() to actually understand your annotations in code.  But
that wouldn't be too hard if it's done in the mode bits.

But I do think anything that avoids these pseudo-files is going to be a
lot cleaner overall.

> >  - there can be other non-diff data between the individual segments. For
> >    example, "patch" will skip over non-diff lines. And certainly in Git
> >    we have our own custom headers. I'm wondering if we could attach
> >    these annotations to the diff-pair somehow, and then show something
> >    like:
> >
> >      diff --git a/foo.c b/foo.c
> >      index 1234abcd..5678cdef 100644
> >      conflict modify/delete foo.c
> 
> A couple things here...

I think Junio already indicated that we can pretty much make this look
like whatever we want. As soon as any reader sees "conflict", it should
happily ignore the rest as something it doesn't know about. And my short
example here was just meant to be illustrative. I agree it probably
needs more details (and the whole CONFLICT line that usually goes to
stderr is probably the best thing).

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/7] merge-ort: add ability to record conflict messages in a file
  2021-09-30  7:58       ` Jeff King
@ 2021-09-30  8:09         ` Ævar Arnfjörð Bjarmason
  2021-10-01  2:07         ` Elijah Newren
  1 sibling, 0 replies; 64+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-30  8:09 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren, Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov


On Thu, Sep 30 2021, Jeff King wrote:

> [...]
> That said, I think it could also make sense to annotate the conflict
> files by giving them an unusual set of mode bits. That would be easier
> to recognize (and while real trees _could_ have silly modes, we do
> complain about them in fsck, so it shouldn't happen in practice).

We don't complain about them in fsck. I've been meaning to get to fixing
it, but my fixes were contingent on the rather big "tree walk mode bits"
series: https://lore.kernel.org/git/87wnu0r8tq.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30  7:33       ` Jeff King
@ 2021-09-30 13:16         ` Ævar Arnfjörð Bjarmason
  2021-09-30 21:00           ` Jeff King
  2021-10-01  3:11           ` Elijah Newren
  2021-10-01  4:26         ` Elijah Newren
  1 sibling, 2 replies; 64+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-30 13:16 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Elijah Newren via GitGitGadget, git,
	Jonathan Nieder, Sergey Organov, Elijah Newren


On Thu, Sep 30 2021, Jeff King wrote:

> On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote:
>
>> Jeff King <peff@peff.net> writes:
>> 
>> >   Side note: The pretend_object_file() approach is actually even better,
>> >   because we know the object is fake. So it does not confuse
>> >   write_object_file()'s "do we already have this object" freshening
>> >   check.
>> >
>> >   I suspect it could even be made faster than the tmp_objdir approach.
>> >   From our perspective, these objects really are tempfiles. So we could
>> >   write them as such, not worrying about things like fsyncing them,
>> >   naming them into place, etc. We could just write them out, then mmap
>> >   the results, and put the pointers into cached_objects (currently it
>> >   insists on malloc-ing a copy of the input buffer, but that seems like
>> >   an easy extension to add).
>> >
>> >   In fact, I think you could get away with just _one_ tempfile per
>> >   merge. Open up one tempfile. Write out all of the objects you want to
>> >   "store" into it in sequence, and record the lseek() offsets before and
>> >   after for each object. Then mmap the whole result, and stuff the
>> >   appropriate pointers (based on arithmetic with the offsets) into the
>> >   cached_objects list.
>> 
>> Cute.  The remerge diff code path creates a full tree that records
>> the mechanical merge result.  By hooking into the lowest layer of
>> write_object() interface, we'd serialize all objects in such a tree
>> in the order they are computed (bottom up from the leaf level, I'd
>> presume) into a single flat file ;-)
>
> I do still like this approach, but just two possible gotchas I was
> thinking of:
>
>  - This side-steps all of our usual code for getting object data into
>    memory. In general, I'd expect this content to not be too enormous,
>    but it _could_ be if there are many / large blobs in the result. So
>    we may end up with large maps. Probably not a big deal on modern
>    64-bit systems. Maybe an issue on 32-bit systems, just because of
>    virtual address space.
>
>    Likewise, we do support systems with NO_MMAP. They'd work here, but
>    it would probably mean putting all that object data into the heap. I
>    could live with that, given how rare such systems are these days, and
>    that it only matters if you're using --remerge-diff with big blobs.
>
>  - I wonder to what degree --remerge-diff benefits from omitting writes
>    for objects we already have. I.e., if you are writing out a whole
>    tree representing the conflicted state, then you don't want to write
>    all of the trees that aren't interesting. Hopefully the code is
>    already figuring out which paths the merge even touched, and ignoring
>    the rest. It probably benefits performance-wise from
>    write_object_file() deciding to skip some object writes, as well
>    (e.g., for resolutions which the final tree already took, as they'd
>    be in the merge commit). The whole pretend-we-have-this-object thing
>    may want to likewise make sure we don't write out objects that we
>    already have in the real odb.

I haven't benchmarked since my core.checkCollisions RFC patch[1]
resulted in the somewhat related loose object cache patch from you, and
not with something like the midx, but just a note that on some setups
just writing things out is faster than exhaustively checking if we
absolutely need to write things out.

I also wonder how much if anything writing out the one file v.s. lots of
loose objects is worthwhile on systems where we could write out those
loose objects on a ramdisk, which is commonly available on e.g. Linux
distros these days out of the box. If you care about performance but not
about your transitory data using a ramdisk is generally much better than
any other potential I/O optimization.

Finally, and I don't mean to throw a monkey wrench into this whole
discussion, so take this as a random musing: I wonder how much faster
this thing could be on its second run if instead of avoiding writing to
the store & cleaning up, it just wrote to the store, and then wrote
another object keyed on the git version and any revision paramaters
etc., and then pretty much just had to do a "git cat-file -p <that-obj>"
to present the result to the user :)

I suppose that would be throwing a lot more work at an eventual "git gc"
than we ever do now, so maybe it's a bit crazy, but I think it might be
an interesting direction in general to (ab)use either the primary or
some secondary store in the .git dir as a semi-permanent cache of
resolved queries from the likes of "git log".

1. https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-29  4:22     ` Elijah Newren
  2021-09-30  7:41       ` Jeff King
@ 2021-09-30 14:17       ` Ævar Arnfjörð Bjarmason
  2021-10-01  3:55         ` Elijah Newren
  1 sibling, 1 reply; 64+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-30 14:17 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Sergey Organov, Neeraj Singh


On Tue, Sep 28 2021, Elijah Newren wrote:

> On Tue, Sep 28, 2021 at 1:00 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:
>>
>> I commented just now on how this API is duplicated between here &
>> another in-flight series in
>> https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/; Just
>> comments on this patch here:
>>
>> > diff --git a/tmp-objdir.c b/tmp-objdir.c
>> > index b8d880e3626..9f75a75d1c0 100644
>> > --- a/tmp-objdir.c
>> > +++ b/tmp-objdir.c
>> > @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t)
>> >       return t->env.v;
>> >  }
>> >
>> > +const char *tmp_objdir_path(struct tmp_objdir *t)
>> > +{
>> > +     return t->path.buf;
>> > +}
>>
>> Not your fault & this pre-dates your patch, but FWIW I prefer our APIs
>> that don't have these "hidden struct" shenanigans (like say "struct
>> string_list") so a caller could just access this, we can just declare it
>> "const" appropriately.
>>
>> We're also all in-tree friends here, so having various accessors for no
>> reason other than to access struct members seems a bit too much.
>
> That's a fair point, but just to play counterpoint for a minute...
>
> FWIW, I dislike when our public facing APIs are polluted with all
> kinds of internal details.  merge-recursive being a case in point.
> When writing merge-ort, although I had a struct with public fields,
> that struct also contained an opaque struct (pointer) within it to
> hide several private fields.  (I would have liked to hide or remove a
> few more fields, but couldn't do so while the merge_recursive_options
> struct was shared between both merge-ort and merge-recursive.)
>
> That said, I agree it can certainly be overdone, and tmp_objdir is
> pretty simple.  However, sometimes even in simple cases I like when
> folks make use of an opaque struct because it means folks put some
> effort into thinking more about the API that should be exposed.
> That's something we as a project have often overlooked in the past, as
> evidenced both by our one-shot usage mentality, and the existence of
> external projects like libgit2 attempting to correct this design
> shortcoming.  I'd like git to move more towards being structured as a
> reusable library as well as a useful command-line tool.

[This is somewhat of a continuation of
https://lore.kernel.org/git/87lf3etaih.fsf@evledraar.gmail.com/].

I like our APIs being usable, but right now we're not promising any
maintenance of APIs that work the same as git v2.30.0 or whatever.

I think some external users directly link to libgit.a, but they're
tracking whatever churn we have from release to release, just like
someone maintaining out-of-tree linux kernel drivers.

For maintenance of git.git itself I think it's going to stay like that
for any foreseeable future[1].

And that's before we even get into the practicalities of git's license
likely not winning over many (if any) libgit2 users. I think those users
are if anything migrating from libgit2 to git's plumbing CLI, or in some
cases folding what they needed into git.git itself (or just staying with
libgit2).

So assuming all of that I really don't see the point in general of
having a "foo" field that has the equivalent of get_foo() and set_foo()
accessors in the context of git.git.

That's something that's really useful if you're maintaining API and ABI
compatibility, but for us it's usually just verbosity. All the users are
in-tree, so just add "const" as appropriate, or perhaps we should give
the field a "priv" name or whatever.

Obviously not everything should be done that way, e.g. being able to
have trace2 start/end points for something you'd otherwise push/clear
from a "struct string_list" or whatever directly *is* usable, so it's
not black and white.

I'm just saying thaht using convoluted patterns *just* to get in the way
of some hypothetical code that's going to mess with your internals can
make things harder in other areas. E.g. the init/release pattern via
macros I mentioned in the linked E-Mail above.

> [...]
>> In this particular case I hadn't understood on a previous reading of
>> tmp-objdir.c why this "path" isn't a statically allocated "char
>> path[PATH_MAX]", or a least we that hardcoded "1024" should be a
>> PATH_MAX, as it is in some other cases.
>
> Actually, PATH_MAX shouldn't be used:
>
> https://lore.kernel.org/git/7d1237c7-5a83-d766-7d93-5f0d59166067@web.de/
> https://lore.kernel.org/git/20180720180427.GE22486@sigill.intra.peff.net/

Thanks. I didn't know about that. I've fixed (some of that) up in some
semi-related WIP work I've got.

1. Aside: I *do* think it would be really useful for us to expose a
   stable C API eventually, using some of the same license
   workarounds/shenanigans (depending on your view) as the Linux kernel
   did. I.e. to say that "we're GPLv2, but this small API is the
   equivalent of our syscall interface".

   Rather than that API being something that would look anything like
   internal git.git users it would basically be a run_command() + Perl's
   unpack/pack + output formatter on steroids.

   I.e. we'd be able to invoke say a 'git cat-file --batch' in the same
   process as the caller by essentially calling cmd_cat_file()
   directly. Just with some callback replacements for "provide this on
   stdin" and "this line accepts stdout" without actually involving
   different processes or I/O.

   You'd then have the equivalent of a --format output that would just
   so happen to be in your OS's / C ABI's understood struct
   format. Similar to "perldoc -f pack".

   The maintenance burden of such an API would be trivial. It would
   basically be a handful of functions just to setup input/output
   callbacks.

   Most of the interface would be a rather convoluted construction of
   command-line parameters (like our own run_command() use), not ideal,
   but most users are after structured data and eliminating process
   overhead, and they'd get that.

   We would need to replace fprintf() and the like with some structured
   formatting mechanism, but that would be useful for non-API users as
   new commands learning a shiny new --format parameter. You could even
   use such a --format to have a non-C-API user get the same struct
   output, perhaps for direct mapping into memory, say if you didn't
   trust the built-in/command in question to not leak memory yet.

   I don't know if this has been suggested before, just my 0.02. It's
   also partially why I'm interested in making even-built-ins not leak
   memory with some of my parallel SANITIZE=leak work.

   Once you've got that you can usually call the cmd_whatever() in a
   loop without much effort, and once you've got that, well, see above
   :)

   

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 4/7] merge-ort: capture and print ll-merge warnings in our preferred fashion
  2021-08-31  2:26 ` [PATCH 4/7] merge-ort: capture and print ll-merge warnings in our preferred fashion Elijah Newren via GitGitGadget
  2021-09-28 22:39   ` Jeff King
@ 2021-09-30 16:53   ` Ævar Arnfjörð Bjarmason
  2021-10-01  1:54     ` Elijah Newren
  1 sibling, 1 reply; 64+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-30 16:53 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jeff King, Jonathan Nieder, Sergey Organov, Elijah Newren


On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:

> @@ -108,8 +108,14 @@ test_expect_success 'refuse to merge binary files' '
>  	printf "\0\0" >binary-file &&
>  	git add binary-file &&
>  	git commit -m binary2 &&
> -	test_must_fail git merge F >merge.out 2>merge.err &&
> -	grep "Cannot merge binary files: binary-file (HEAD vs. F)" merge.err
> +	if test "$GIT_TEST_MERGE_ALGORITHM" = ort
> +	then
> +		test_must_fail git merge F >merge.out &&
> +		grep "Cannot merge binary files: binary-file (HEAD vs. F)" merge.out
> +	else
> +		test_must_fail git merge F >merge.out 2>merge.err &&
> +		grep "Cannot merge binary files: binary-file (HEAD vs. F)" merge.err
> +	fi
>  '

To save readers from eyeballing if a single character has changed here,
which I don't think it has, just do:

if ...
then
	cmd >actual
else
	other cmd >actual
fi &&
grep [...]

I.e. no need to duplicate the "grep" here just because of merge.out v.s. merge.err.

[...]

> -	test_must_fail git merge bin-main 2>stderr &&
> -	grep -i "warning.*cannot merge.*HEAD vs. bin-main" stderr
> +	if test "$GIT_TEST_MERGE_ALGORITHM" = ort
> +	then
> +		test_must_fail git merge bin-main >stdout &&
> +		grep -i "warning.*cannot merge.*HEAD vs. bin-main" stdout
> +	else
> +		test_must_fail git merge bin-main 2>stderr &&
> +		grep -i "warning.*cannot merge.*HEAD vs. bin-main" stderr
> +	fi


ditto.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30  7:26       ` Jeff King
  2021-09-30  7:46         ` Jeff King
@ 2021-09-30 19:25         ` Neeraj Singh
  2021-09-30 20:19           ` Junio C Hamano
  2021-09-30 20:00         ` Junio C Hamano
  2021-10-01  2:23         ` Elijah Newren
  3 siblings, 1 reply; 64+ messages in thread
From: Neeraj Singh @ 2021-09-30 19:25 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren, Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Thu, Sep 30, 2021 at 03:26:42AM -0400, Jeff King wrote:
> On Tue, Sep 28, 2021 at 10:05:31PM -0700, Elijah Newren wrote:
> 
> > So there are other parts of running `git log` that are not read-only,
> > besides what I'm doing.  That's unfortunate.  Can we just disable
> > those things in this section of the code?
> 
> It might be possible for the cache (we'd still generate entries, but
> just not write them). It's "just" an optimization, though it's possible
> there are cases where it's really annoying to do so. I'd imagine those
> are a stretch. I do use textconv to decrypt gpg blobs with a key on a
> hardware token, for example. I don't require a touch for each one,
> though, nor would I use textconv caching anyway (since the point is not
> to store the unencrypted contents).
> 
> It wouldn't necessarily be OK to disable object writes in general.
> Though I'd think since a diff is _conceptually_ a read-only operation
> that any writes it wants to do would be in the same caching /
> optimization boat.  I'd also worry about the maintenance cost of having
> to annotate any such case.
> 
> > Also, I think it makes sense that textconv caching causes a problem
> > via writing new blobs and trees *and* refs that reference these blobs
> > and trees.  However, I'm not sure I'm really grasping it, because I
> > don't see how your solutions are safe either.  I'll mention more after
> > each one.
> 
> Heh. So I originally wrote more on this in my earlier email, but worried
> I was getting too into the weeds and would confuse you, so I pared it
> down. But I managed to be confusing anyway. :)
> 
> Here's one mental model. If you have a temporary object store, then
> these are the bad things that can happen:
> 
>   1. Some code may write an object to it, thinking the data is saved,
>      but we later throw it away.
> 
>   2. Some code may _skip_ writing an object, thinking that we already
>      have that object available (when in fact we'll later throw it
>      away).
> 
>   3. Some code may record the fact that we have an object in memory, and
>      then later (after the temporary object store is not in use), may
>      point to it when writing an object or reference.
> 
> Pointing the primary object-dir at a temporary directory (like your
> patch does) means we're subject to all three.
> 
> Problem (1) happens because unrelated code may not realize we've swapped
> out the primary object writing code for this throw-away directory. But
> we could require writers to explicitly indicate that they want to write
> to the temporary area, while allowing other writes to go to the regular
> object store. This could be done with a flag or variant of
> write_object_file(). Or if the temporary area isn't a regular object
> directory at all (like the whole tempfile / pretend thing), then this
> happens automatically, because the regular object-write paths don't even
> know about our temporary area.
> 
> Problem (2) is mostly handled inside write_object_file(). It will
> optimize out writes of objects we already have. So it needs to know when
> we have an object "for real", and when we just have it in our temporary
> store. If we make the tmp_objdir the primary, then it doesn't know this
> (though we could teach it to check. In the pretend_object_file() system,
> it already knows this (it uses the "freshen" code paths which actually
> want to find the entry in the filesystem).
> 
> Problem (3) is the hardest one, because we don't really distinguish
> between readers and writers. So if I call has_object_file(), is it
> because I want to access the object? Or is it because I want to generate
> a new reference to it, which must ensure that it exists? One of those is
> OK to look at the tmp_objdir (and indeed, is the whole point of making
> it in the first place), and the other is the first step to corrupting
> the repository. ;)
> 

This is a good taxonomy. I think these problems apply to any transactional
system. If you keep state outside of the transaction, then you may wind up
with inconsistency. Problem (3) is the worst and kind of subsumes (2). If
we think an object is durable with a specific oid, we probably are going to
write that oid out somewhere else that is outside of the transaction. If
we write it to the ODB (e.g. in a tree or some blob) without first caching
it in memory, the references in the ODB will be equally durable as the referees.


> With the environment variables we set up for running external commands
> (like hooks) in the receive-pack quarantine, one bit of safety we have
> is to prevent ref writes entirely in those commands. That works
> reasonably well. Even if they may write new objects that might get
> thrown away (if the push is rejected), they'd ultimately need to point
> to those objects with a ref.  And because those hooks spend their
> _entire_ run-time in the quarantine state, they can't ever write a ref.
> 
> Whereas doing it in-process is a little dicier. Right now the textconv
> cache writes out a new tree for every entry we add, so it happens to do
> the ref write while the tmp-objdir is still in place. But as this
> comment notes:
> 
>   $ git grep -B6 notes_cache_write diff.c
>   diff.c-         /*
>   diff.c-          * we could save up changes and flush them all at the end,
>   diff.c-          * but we would need an extra call after all diffing is done.
>   diff.c-          * Since generating a cache entry is the slow path anyway,
>   diff.c-          * this extra overhead probably isn't a big deal.
>   diff.c-          */
>   diff.c:         notes_cache_write(textconv->cache);
> 
> this is slow, and it would be perfectly reasonable to flush the cache at
> the end of the process (e.g., using an atexit() handler). In which case
> we'd hit this problem (3) exactly: we'd generate and remember objects
> during the tmp-objdir period, but only later actually reference them.
> This is more likely than the receive-pack hook case, because we're doing
> it all in process.
> 

Is there state we care about besides objects and refs? Maybe we need to
introduce a transactional system for the entire repository like we have
for the refs-db.  In-process code that cares about oids that might be
only transactionally present can register a callback for when the transaction
commits or rollsback.  I'm not volunteering to write this, btw :).

Right now, disabling Ref updates is a pretty good compromise, since that's the
one definite piece of durable state outside the ODB that can point in. In memory
state is generally hard to solve and could wind up persisting in either the ODB
or the Refs or both.

> > > If you remove the tmp_objdir as the primary as soon as you're done with
> > > the merge, but before you run the diff, you might be OK, though.
> > 
> > It has to be after I run the diff, because the diff needs access to
> > the temporary files to diff against them.
> 
> Right, of course. I was too fixated on the object-write part, forgetting
> that the whole point of the exercise is to later read them back. :)
> 
> > > If not, then I think the solution is probably not to install this as the
> > > "primary", but rather:
> > >
> > >   - do the specific remerge-diff writes we want using a special "write
> > >     to this object dir" variant of write_object_file()
> > >
> > >   - install the tmp_objdir as an alternate, so the reading side (which
> > >     is diff code that doesn't otherwise know about our remerge-diff
> > >     trickery) can find them
> > >
> > > And that lets other writers avoid writing into the temporary area
> > > accidentally.
> > 
> > Doesn't this have the same problem?  If something similar to textconv
> > caching were to create new trees that referenced the existing blobs
> > and then added a ref that referred to that tree, then you have the
> > exact same problem, right?  The new tree and the new ref would be
> > corrupt as soon as the tmp-objdir went away.  Whether or not other
> > callers write to the temporary area isn't the issue, it's whether they
> > write refs that can refer to anything from the temporary area.  Or am
> > I missing something?
> 
> The key thing here is in the first step, where remerge-diff is
> explicitly saying "I want to write to this temporary object-dir". But
> the textconv-cache code does not; it gets to write to the normal spot.
> So it avoids problem (1).
> 
> You're right that it does not avoid problem (3) exactly. But triggering
> that would require some code not just writing other objects or
> references while the tmp-objdir is in place, but specifically
> referencing the objects that remerge-diff did put into the tmp-objdir.
> That seems a lot less likely to me (because the thing we're most worried
> about is unrelated code that just happens to write while the tmp-objdir
> is in place).
> 
> > > In that sense this is kind of like the pretend_object_file() interface,
> > > except that it's storing the objects on disk instead of in memory. Of
> > > course another way of doing that would be to stuff the object data into
> > > tempfiles and just put pointers into the in-memory cached_objects array.
> > >
> > > It's also not entirely foolproof (nor is the existing
> > > pretend_object_file()). Any operation which is fooled into thinking we
> > > have object X because it's in the fake-object list or the tmp_objdir may
> > > reference that object erroneously, creating a corruption. But it's still
> > > safer than allowing arbitrary writes into the tmp_objdir.
> > 
> > Why is the pretend_object_file() interface safer?  Does it disable the
> > textconv caching somehow?  I don't see why it helps avoid this
> > problem.
> 
> So hopefully it's clearer now from what I wrote above, but just
> connecting the dots:
> 
>   1. Unrelated code calling write_object_file() would still write real,
>      durable objects, as usual.
> 
>   2. The write_object_file() "skip this write" optimization already
>      ignores the pretend_object_file() objects while checking "do we
>      have this object".
> 
> > Interesting.  I'm not familiar with the pretend_object_file() code,
> > but I think I catch what you're saying.  I don't see anything in
> > object-file.c that allows removing items from the cache, as I would
> > want to do, but I'm sure I could add something.
> 
> Right, it's pretty limited now, and there's only one caller. And in fact
> I've wanted to get rid of it, exactly because it can create the danger
> we're discussing here. But I still think it's _less_ dangerous than
> fully replacing the object directory.
> 
> > (I'm still not sure how this avoids the problem of things like
> > textconv caching trying to write an object that references one of
> > these, though.  Should we just manually disable textconv caching
> > during a remerge-diff?)
> 
> It can't avoid it completely, but in general, the textconv cache should
> be writing and referencing its own objects. Even if they happen to be
> the same as something remerge-diff generated, it won't know that until
> it has tried to write it (and so that's why having write_object_file()
> continue to really write the object is important).
> 
> 
> Hopefully that clears up my line of thinking. I do think a lot of this
> is kind of theoretical, in the sense that:
> 
>   - textconv caching is an obscure feature that we _could_ just disable
>     during remerge-diffs
> 
>   - there's a reasonable chance that there isn't any other code that
>     wants to write during a diff
> 
> But this is all scary and error-prone enough that I'd prefer an approach
> that has the "least surprise". So any solution where random code calling
> write_object_file() _can't_ accidentally write to a throw-away directory
> seems like a safer less surprising thing.
> 

There is something nice to be said for having the cached state go away
with the ephemeral objects, since that caching refers to objects that
may never be seen again.

Any solution which makes in-flight transactional state visible (either as
the primary writing spot or as a readable alternate) will suffer from state
leakage for non-transactionally-aware readers/writers.

The bulk-fsync code has some advantage in that it is more constrained in what
it's calling while the tmp objdir is active. Also, in typical cases, the
tmp objdir will be committed rather than discarded. But I believe it suffers
from the same essential issue.

Can we press forward, and just try to make sure all the in-process stuff
is aware of temporary object state?  Maybe we can hook up the temp odb stuff
with the refsdb transactions into a single system.

We can make the tmp object directory stuff more prevalent by using it in
git-apply and the merge stategy, and by using bulk-checkin while writing out
the cached tree and commits. This should flush out any issues and keep other
parts of the codebase honest.

> It does mean that the remerge-diff code needs to be explicit in its
> object writes to say "this needs to go to the temporary area" (whether
> it's a flag, or calling into a totally different function or subsystem).
> I'm hoping that's not hard to do (because its writes are done explicitly
> by the remerge-diff code), but I worry that it may be (because you are
> relying on more generic tree code to the write under the hood).
> 

I don't think that really helps the trickiest problems. The textconv cache,
for instance, may now be writing notes that refer to nonexistent objects.
That's fine from a correctness point of view, but now we're polluting the repo
with state that won't be gc'ed and won't ever be accessed.

IMO, it might be better for the textconv cache to have a single notes root
tree (rather than a notes 'branch' with a commit history). We would want to
put a ref to this tree somewhere inside the object directory itself so that
it's transactionally consistent with the rest of the object state. For instance,
we could store the current textconv notes tree root in .git/objects/info/textconv-cache.

Thanks,
Neeraj


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30  7:26       ` Jeff King
  2021-09-30  7:46         ` Jeff King
  2021-09-30 19:25         ` Neeraj Singh
@ 2021-09-30 20:00         ` Junio C Hamano
  2021-10-01  2:23         ` Elijah Newren
  3 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2021-09-30 20:00 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren, Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

Jeff King <peff@peff.net> writes:

>> > If not, then I think the solution is probably not to install this as the
>> > "primary", but rather:
>> >
>> >   - do the specific remerge-diff writes we want using a special "write
>> >     to this object dir" variant of write_object_file()
>> >
>> >   - install the tmp_objdir as an alternate, so the reading side (which
>> >     is diff code that doesn't otherwise know about our remerge-diff
>> >     trickery) can find them
>> >
>> > And that lets other writers avoid writing into the temporary area
>> > accidentally.
>> 
>> ...
>
> The key thing here is in the first step, where remerge-diff is
> explicitly saying "I want to write to this temporary object-dir". But
> the textconv-cache code does not; it gets to write to the normal spot.
> So it avoids problem (1).
>
> You're right that it does not avoid problem (3) exactly. But triggering
> that would require some code not just writing other objects or
> references while the tmp-objdir is in place, but specifically
> referencing the objects that remerge-diff did put into the tmp-objdir.
> That seems a lot less likely to me (because the thing we're most worried
> about is unrelated code that just happens to write while the tmp-objdir
> is in place).

I do not offhand remember if a new object creation codepath that
avoids writing an object that we already have considers an object we
find in an alternate as "we have it".  If it does not and we make a
duplicated copy in our primary object store, then it would be OK,
but otherwise, those that know tmp_objdir is an alternate may still
try to write a new object and find that the same object already
exists in an alternate.  Once tmp_objdir is gone, we would end up
with a corrupt repository, right?

freshen_loose_object() seems to go to check_and_freshen() which
consult check_and_freshen_nonlocal() before returning yes/no, so
having the same object in an alternate does seem to count as having
one.

> So hopefully it's clearer now from what I wrote above, but just
> connecting the dots:
>
>   1. Unrelated code calling write_object_file() would still write real,
>      durable objects, as usual.
>
>   2. The write_object_file() "skip this write" optimization already
>      ignores the pretend_object_file() objects while checking "do we
>      have this object".

Yup for (2)---the pretend interface is safer than alternate in that
sense.

But the pretend interface was designed to just hold a handful of
phoney objects in core.  I am not sure if it is a good fit for this
use case.

> But this is all scary and error-prone enough that I'd prefer an approach
> that has the "least surprise". So any solution where random code calling
> write_object_file() _can't_ accidentally write to a throw-away directory
> seems like a safer less surprising thing.
>
> It does mean that the remerge-diff code needs to be explicit in its
> object writes to say "this needs to go to the temporary area" (whether
> it's a flag, or calling into a totally different function or subsystem).

... and also writes by others that happens to write the same object
as what is already in the temporary area should be made to the real
object store.  Then we would be much safer.

> I'm hoping that's not hard to do (because its writes are done explicitly
> by the remerge-diff code), but I worry that it may be (because you are
> relying on more generic tree code to the write under the hood).

Thanks.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30  7:46         ` Jeff King
@ 2021-09-30 20:06           ` Junio C Hamano
  2021-10-01  3:59             ` Elijah Newren
  2021-10-01  2:31           ` Elijah Newren
  1 sibling, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2021-09-30 20:06 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren, Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

Jeff King <peff@peff.net> writes:

>   - is it possible for the merge code to ever write an object? I kind of
>     wonder if we'd ever do any cache-able transformations as part of a
>     content-level merge. I don't think we do now, though.

How is virtual merge base, result of an inner merge that recursively
merging two merge bases, fed to an outer merge as the ancestor?
Isn't it written as a tree object by the inner merge as a tree with
full of blob objects, so the outer merge does not have to treat a
merge base tree that is virtual any specially from a sole merge
base?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30 19:25         ` Neeraj Singh
@ 2021-09-30 20:19           ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2021-09-30 20:19 UTC (permalink / raw)
  To: Neeraj Singh
  Cc: Jeff King, Elijah Newren, Elijah Newren via GitGitGadget,
	Git Mailing List, Jonathan Nieder, Sergey Organov

Neeraj Singh <nksingh85@gmail.com> writes:

> Is there state we care about besides objects and refs? Maybe we need to
> introduce a transactional system for the entire repository like we have
> for the refs-db.  In-process code that cares about oids that might be
> only transactionally present can register a callback for when the transaction
> commits or rollsback.  I'm not volunteering to write this, btw :).

I take it to mean by "objects and refs" that we care only about
objects that are reachable from the refs?  The index (the primary
index array plus extensions that record object names) also has
anchoring points we care about (ask "git fsck" what they are).

> We can make the tmp object directory stuff more prevalent by using it in
> git-apply and the merge stategy, and by using bulk-checkin while writing out
> the cached tree and commits. This should flush out any issues and keep other
> parts of the codebase honest.

I wonder what's so wrong about cruft objects in the main object
store that may be gc-ed away laster, though.  IOW, I wonder if it is
simpler to treat the quarantine while accepting dubious foreign data
via "git push" in receive-pack as an exception, not a norm.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30 13:16         ` Ævar Arnfjörð Bjarmason
@ 2021-09-30 21:00           ` Jeff King
  2021-10-01  3:11           ` Elijah Newren
  1 sibling, 0 replies; 64+ messages in thread
From: Jeff King @ 2021-09-30 21:00 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Elijah Newren via GitGitGadget, git,
	Jonathan Nieder, Sergey Organov, Elijah Newren

On Thu, Sep 30, 2021 at 03:16:19PM +0200, Ævar Arnfjörð Bjarmason wrote:

> I also wonder how much if anything writing out the one file v.s. lots of
> loose objects is worthwhile on systems where we could write out those
> loose objects on a ramdisk, which is commonly available on e.g. Linux
> distros these days out of the box. If you care about performance but not
> about your transitory data using a ramdisk is generally much better than
> any other potential I/O optimization.

I'd think in general we won't be using a ramdisk, because tmp_objdir is
putting its directory inside $GIT_DIR/objects. It doesn't _have_ to, but
using a ramdisk works against its original purpose (which was to store
potentially quite a lot of data from an incoming push, and to be able to
rename it cheaply into its final resting place).

It would probably not be too hard to provide a flag that indicates the
intended use, though (and then we decide where to create the temporary
directory based on that).

> Finally, and I don't mean to throw a monkey wrench into this whole
> discussion, so take this as a random musing: I wonder how much faster
> this thing could be on its second run if instead of avoiding writing to
> the store & cleaning up, it just wrote to the store, and then wrote
> another object keyed on the git version and any revision paramaters
> etc., and then pretty much just had to do a "git cat-file -p <that-obj>"
> to present the result to the user :)
> 
> I suppose that would be throwing a lot more work at an eventual "git gc"
> than we ever do now, so maybe it's a bit crazy, but I think it might be
> an interesting direction in general to (ab)use either the primary or
> some secondary store in the .git dir as a semi-permanent cache of
> resolved queries from the likes of "git log".

I don't think it's crazy to just write the objects to the main object
store. We already generate cruft objects for some other operations
(Junio asked elsewhere in the thread about virtual trees for recursive
merges; I don't know the answer offhand, but I'd guess we do there).
They do get cleaned up eventually.

I'm not sure it helps performance much by itself. In a merge (or even
just writing a tree out from the index), by the time you realize you
already have the object, you've done most of the work to generate it.

I think what you're describing is to make some kind of cache structure
on top. That might be sensible (and indeed, the index already does this
with the cachetree extension). But it can also easily come later if the
objects are just in the regular odb.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 4/7] merge-ort: capture and print ll-merge warnings in our preferred fashion
  2021-09-30 16:53   ` Ævar Arnfjörð Bjarmason
@ 2021-10-01  1:54     ` Elijah Newren
  2021-10-01  7:23       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 64+ messages in thread
From: Elijah Newren @ 2021-10-01  1:54 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Sergey Organov

On Thu, Sep 30, 2021 at 9:53 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:
>
> > @@ -108,8 +108,14 @@ test_expect_success 'refuse to merge binary files' '
> >       printf "\0\0" >binary-file &&
> >       git add binary-file &&
> >       git commit -m binary2 &&
> > -     test_must_fail git merge F >merge.out 2>merge.err &&
> > -     grep "Cannot merge binary files: binary-file (HEAD vs. F)" merge.err
> > +     if test "$GIT_TEST_MERGE_ALGORITHM" = ort
> > +     then
> > +             test_must_fail git merge F >merge.out &&
> > +             grep "Cannot merge binary files: binary-file (HEAD vs. F)" merge.out
> > +     else
> > +             test_must_fail git merge F >merge.out 2>merge.err &&
> > +             grep "Cannot merge binary files: binary-file (HEAD vs. F)" merge.err
> > +     fi
> >  '
>
> To save readers from eyeballing if a single character has changed here,
> which I don't think it has, just do:
>
> if ...
> then
>         cmd >actual
> else
>         other cmd >actual

Do you mean
   other cmd >/dev/null 2>actual
?


> fi &&
> grep [...]
>
> I.e. no need to duplicate the "grep" here just because of merge.out v.s. merge.err.
>
> [...]
>
> > -     test_must_fail git merge bin-main 2>stderr &&
> > -     grep -i "warning.*cannot merge.*HEAD vs. bin-main" stderr
> > +     if test "$GIT_TEST_MERGE_ALGORITHM" = ort
> > +     then
> > +             test_must_fail git merge bin-main >stdout &&
> > +             grep -i "warning.*cannot merge.*HEAD vs. bin-main" stdout
> > +     else
> > +             test_must_fail git merge bin-main 2>stderr &&
> > +             grep -i "warning.*cannot merge.*HEAD vs. bin-main" stderr
> > +     fi
>
>
> ditto.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/7] merge-ort: add ability to record conflict messages in a file
  2021-09-30  7:58       ` Jeff King
  2021-09-30  8:09         ` Ævar Arnfjörð Bjarmason
@ 2021-10-01  2:07         ` Elijah Newren
  2021-10-01  5:28           ` Jeff King
  1 sibling, 1 reply; 64+ messages in thread
From: Elijah Newren @ 2021-10-01  2:07 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Thu, Sep 30, 2021 at 12:58 AM Jeff King <peff@peff.net> wrote:
>
> On Tue, Sep 28, 2021 at 11:25:20PM -0700, Elijah Newren wrote:
>
> > > Just brainstorming some alternatives:
> > >
> > >  - we have diff.orderFile, etc. Could we stuff this data into a less
> > >    confusing name (even just "$filename.conflict_msg"), and then provide
> > >    a custom ordering to the diff code? I think it could be done by
> > >    generating a static ordering ahead of time, but it might even just be
> > >    possible to tell diffcore_order() to take the ".conflict_msg"
> > >    extension into account in its comparison function.
> >
> > I can't just go on the ".conflict_msg" extension.  As you noted above,
> > this scheme is not sufficient for avoiding collisions.  So I need to
> > append extra "cruft" to the name in the case of collisions -- meaning
> > we can't special case on just that extension.
>
> Sure, but we can call it filename.conflict_msg.1, etc, and the sort code
> can pattern-match. It can never be fully robust (if you really did have
> a foo.conflict_msg, we'd sort it differently), but I think it's OK if
> the worst-case is that pathological trees get ordered slightly
> sub-optimally).
>
> That said, I think it could also make sense to annotate the conflict
> files by giving them an unusual set of mode bits. That would be easier
> to recognize (and while real trees _could_ have silly modes, we do
> complain about them in fsck, so it shouldn't happen in practice).

I tried giving things unusual mode bits in early versions of merge-ort
for other reasons.  It doesn't work: canon_mode() will "fix" the
unusualness so there's no way for the reader to know that they had
unusual bits.

But, as you said later, avoiding these pseudo-files is going to be
cleaner anyway, so let's just do that.



> > I also don't like how diff.orderFile provides a global ordering of the
> > files listed, rather than providing some scheme for relative
> > orderings.  That'd either force me to precompute the diff to determine
> > all the files that were different so I can list _all_ of them, or put
> > up with the fact that the files with non-content conflicts will be
> > listed very first in the output, even if their name is
> > 'zee-last-file.c' -- surprising users at the output ordering.
> >
> > This also means that if the user had their own ordering defined, then
> > I'm overriding it and messing up their ordering, which might be
> > problematic.
>
> Agreed. I do think it may be too horrible unless you teach
> diffcore_order() to actually understand your annotations in code.  But
> that wouldn't be too hard if it's done in the mode bits.
>
> But I do think anything that avoids these pseudo-files is going to be a
> lot cleaner overall.
>
> > >  - there can be other non-diff data between the individual segments. For
> > >    example, "patch" will skip over non-diff lines. And certainly in Git
> > >    we have our own custom headers. I'm wondering if we could attach
> > >    these annotations to the diff-pair somehow, and then show something
> > >    like:
> > >
> > >      diff --git a/foo.c b/foo.c
> > >      index 1234abcd..5678cdef 100644
> > >      conflict modify/delete foo.c
> >
> > A couple things here...
>
> I think Junio already indicated that we can pretty much make this look
> like whatever we want. As soon as any reader sees "conflict", it should
> happily ignore the rest as something it doesn't know about. And my short
> example here was just meant to be illustrative. I agree it probably
> needs more details (and the whole CONFLICT line that usually goes to
> stderr is probably the best thing).
>
> -Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30  7:26       ` Jeff King
                           ` (2 preceding siblings ...)
  2021-09-30 20:00         ` Junio C Hamano
@ 2021-10-01  2:23         ` Elijah Newren
  3 siblings, 0 replies; 64+ messages in thread
From: Elijah Newren @ 2021-10-01  2:23 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Thu, Sep 30, 2021 at 12:26 AM Jeff King <peff@peff.net> wrote:
>
> On Tue, Sep 28, 2021 at 10:05:31PM -0700, Elijah Newren wrote:
>
> > So there are other parts of running `git log` that are not read-only,
> > besides what I'm doing.  That's unfortunate.  Can we just disable
> > those things in this section of the code?
>
> It might be possible for the cache (we'd still generate entries, but
> just not write them). It's "just" an optimization, though it's possible
> there are cases where it's really annoying to do so. I'd imagine those
> are a stretch. I do use textconv to decrypt gpg blobs with a key on a
> hardware token, for example. I don't require a touch for each one,
> though, nor would I use textconv caching anyway (since the point is not
> to store the unencrypted contents).
>
> It wouldn't necessarily be OK to disable object writes in general.
> Though I'd think since a diff is _conceptually_ a read-only operation
> that any writes it wants to do would be in the same caching /
> optimization boat.  I'd also worry about the maintenance cost of having
> to annotate any such case.
>
> > Also, I think it makes sense that textconv caching causes a problem
> > via writing new blobs and trees *and* refs that reference these blobs
> > and trees.  However, I'm not sure I'm really grasping it, because I
> > don't see how your solutions are safe either.  I'll mention more after
> > each one.
>
> Heh. So I originally wrote more on this in my earlier email, but worried
> I was getting too into the weeds and would confuse you, so I pared it
> down. But I managed to be confusing anyway. :)
>
> Here's one mental model. If you have a temporary object store, then
> these are the bad things that can happen:
>
>   1. Some code may write an object to it, thinking the data is saved,
>      but we later throw it away.
>
>   2. Some code may _skip_ writing an object, thinking that we already
>      have that object available (when in fact we'll later throw it
>      away).
>
>   3. Some code may record the fact that we have an object in memory, and
>      then later (after the temporary object store is not in use), may
>      point to it when writing an object or reference.
>
> Pointing the primary object-dir at a temporary directory (like your
> patch does) means we're subject to all three.
>
> Problem (1) happens because unrelated code may not realize we've swapped
> out the primary object writing code for this throw-away directory. But
> we could require writers to explicitly indicate that they want to write
> to the temporary area, while allowing other writes to go to the regular
> object store. This could be done with a flag or variant of
> write_object_file(). Or if the temporary area isn't a regular object
> directory at all (like the whole tempfile / pretend thing), then this
> happens automatically, because the regular object-write paths don't even
> know about our temporary area.
>
> Problem (2) is mostly handled inside write_object_file(). It will
> optimize out writes of objects we already have. So it needs to know when
> we have an object "for real", and when we just have it in our temporary
> store. If we make the tmp_objdir the primary, then it doesn't know this
> (though we could teach it to check. In the pretend_object_file() system,
> it already knows this (it uses the "freshen" code paths which actually
> want to find the entry in the filesystem).
>
> Problem (3) is the hardest one, because we don't really distinguish
> between readers and writers. So if I call has_object_file(), is it
> because I want to access the object? Or is it because I want to generate
> a new reference to it, which must ensure that it exists? One of those is
> OK to look at the tmp_objdir (and indeed, is the whole point of making
> it in the first place), and the other is the first step to corrupting
> the repository. ;)
>
> With the environment variables we set up for running external commands
> (like hooks) in the receive-pack quarantine, one bit of safety we have
> is to prevent ref writes entirely in those commands. That works
> reasonably well. Even if they may write new objects that might get
> thrown away (if the push is rejected), they'd ultimately need to point
> to those objects with a ref.  And because those hooks spend their
> _entire_ run-time in the quarantine state, they can't ever write a ref.
>
> Whereas doing it in-process is a little dicier. Right now the textconv
> cache writes out a new tree for every entry we add, so it happens to do
> the ref write while the tmp-objdir is still in place. But as this
> comment notes:
>
>   $ git grep -B6 notes_cache_write diff.c
>   diff.c-         /*
>   diff.c-          * we could save up changes and flush them all at the end,
>   diff.c-          * but we would need an extra call after all diffing is done.
>   diff.c-          * Since generating a cache entry is the slow path anyway,
>   diff.c-          * this extra overhead probably isn't a big deal.
>   diff.c-          */
>   diff.c:         notes_cache_write(textconv->cache);
>
> this is slow, and it would be perfectly reasonable to flush the cache at
> the end of the process (e.g., using an atexit() handler). In which case
> we'd hit this problem (3) exactly: we'd generate and remember objects
> during the tmp-objdir period, but only later actually reference them.
> This is more likely than the receive-pack hook case, because we're doing
> it all in process.
>
> > > If you remove the tmp_objdir as the primary as soon as you're done with
> > > the merge, but before you run the diff, you might be OK, though.
> >
> > It has to be after I run the diff, because the diff needs access to
> > the temporary files to diff against them.
>
> Right, of course. I was too fixated on the object-write part, forgetting
> that the whole point of the exercise is to later read them back. :)
>
> > > If not, then I think the solution is probably not to install this as the
> > > "primary", but rather:
> > >
> > >   - do the specific remerge-diff writes we want using a special "write
> > >     to this object dir" variant of write_object_file()
> > >
> > >   - install the tmp_objdir as an alternate, so the reading side (which
> > >     is diff code that doesn't otherwise know about our remerge-diff
> > >     trickery) can find them
> > >
> > > And that lets other writers avoid writing into the temporary area
> > > accidentally.
> >
> > Doesn't this have the same problem?  If something similar to textconv
> > caching were to create new trees that referenced the existing blobs
> > and then added a ref that referred to that tree, then you have the
> > exact same problem, right?  The new tree and the new ref would be
> > corrupt as soon as the tmp-objdir went away.  Whether or not other
> > callers write to the temporary area isn't the issue, it's whether they
> > write refs that can refer to anything from the temporary area.  Or am
> > I missing something?
>
> The key thing here is in the first step, where remerge-diff is
> explicitly saying "I want to write to this temporary object-dir". But
> the textconv-cache code does not; it gets to write to the normal spot.
> So it avoids problem (1).
>
> You're right that it does not avoid problem (3) exactly. But triggering
> that would require some code not just writing other objects or
> references while the tmp-objdir is in place, but specifically
> referencing the objects that remerge-diff did put into the tmp-objdir.
> That seems a lot less likely to me (because the thing we're most worried
> about is unrelated code that just happens to write while the tmp-objdir
> is in place).
>
> > > In that sense this is kind of like the pretend_object_file() interface,
> > > except that it's storing the objects on disk instead of in memory. Of
> > > course another way of doing that would be to stuff the object data into
> > > tempfiles and just put pointers into the in-memory cached_objects array.
> > >
> > > It's also not entirely foolproof (nor is the existing
> > > pretend_object_file()). Any operation which is fooled into thinking we
> > > have object X because it's in the fake-object list or the tmp_objdir may
> > > reference that object erroneously, creating a corruption. But it's still
> > > safer than allowing arbitrary writes into the tmp_objdir.
> >
> > Why is the pretend_object_file() interface safer?  Does it disable the
> > textconv caching somehow?  I don't see why it helps avoid this
> > problem.
>
> So hopefully it's clearer now from what I wrote above, but just
> connecting the dots:
>
>   1. Unrelated code calling write_object_file() would still write real,
>      durable objects, as usual.
>
>   2. The write_object_file() "skip this write" optimization already
>      ignores the pretend_object_file() objects while checking "do we
>      have this object".
>
> > Interesting.  I'm not familiar with the pretend_object_file() code,
> > but I think I catch what you're saying.  I don't see anything in
> > object-file.c that allows removing items from the cache, as I would
> > want to do, but I'm sure I could add something.
>
> Right, it's pretty limited now, and there's only one caller. And in fact
> I've wanted to get rid of it, exactly because it can create the danger
> we're discussing here. But I still think it's _less_ dangerous than
> fully replacing the object directory.
>
> > (I'm still not sure how this avoids the problem of things like
> > textconv caching trying to write an object that references one of
> > these, though.  Should we just manually disable textconv caching
> > during a remerge-diff?)
>
> It can't avoid it completely, but in general, the textconv cache should
> be writing and referencing its own objects. Even if they happen to be
> the same as something remerge-diff generated, it won't know that until
> it has tried to write it (and so that's why having write_object_file()
> continue to really write the object is important).
>
>
> Hopefully that clears up my line of thinking. I do think a lot of this
> is kind of theoretical, in the sense that:
>
>   - textconv caching is an obscure feature that we _could_ just disable
>     during remerge-diffs
>
>   - there's a reasonable chance that there isn't any other code that
>     wants to write during a diff
>
> But this is all scary and error-prone enough that I'd prefer an approach
> that has the "least surprise". So any solution where random code calling
> write_object_file() _can't_ accidentally write to a throw-away directory
> seems like a safer less surprising thing.

Yes, thanks for taking the time to go into this detail.

> It does mean that the remerge-diff code needs to be explicit in its
> object writes to say "this needs to go to the temporary area" (whether
> it's a flag, or calling into a totally different function or subsystem).
> I'm hoping that's not hard to do (because its writes are done explicitly
> by the remerge-diff code), but I worry that it may be (because you are
> relying on more generic tree code to the write under the hood).

All blob and tree objects written during merge-ort are done by
explicit write_object_file() calls that exist within merge-ort.c.  So
this approach should be doable, it just needs the external caller to
set some flag saying to write these objects elsewhere.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30  7:46         ` Jeff King
  2021-09-30 20:06           ` Junio C Hamano
@ 2021-10-01  2:31           ` Elijah Newren
  2021-10-01  5:21             ` Jeff King
  1 sibling, 1 reply; 64+ messages in thread
From: Elijah Newren @ 2021-10-01  2:31 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Thu, Sep 30, 2021 at 12:46 AM Jeff King <peff@peff.net> wrote:
>
> On Thu, Sep 30, 2021 at 03:26:42AM -0400, Jeff King wrote:
>
> > > > If you remove the tmp_objdir as the primary as soon as you're done with
> > > > the merge, but before you run the diff, you might be OK, though.
> > >
> > > It has to be after I run the diff, because the diff needs access to
> > > the temporary files to diff against them.
> >
> > Right, of course. I was too fixated on the object-write part, forgetting
> > that the whole point of the exercise is to later read them back. :)
>
> Ah, no, I remember what I was trying to say here. The distinction is
> between "remove the tmp_objdir" and "remove it as the primary".
>
> I.e., if you do this:
>
>   1. create tmp_objdir
>
>   2. make tmp_objdir primary for writes
>
>   3. run the "merge" half of remerge-diff, writing objects into the
>      temporary space
>
>   4. stop having tmp_objdir as the primary; instead make it an alternate
>
>   5. run the diff
>
>   6. remove tmp_objdir totally
>
> Then step 5 can't accidentally write objects into the temporary space,
> but it can still read them. So it's not entirely safe, but it's safer,
> and it would be a much smaller change.

Interesting.

> Some ways it could go wrong:
>
>   - is it possible for the merge code to ever write an object? I kind of
>     wonder if we'd ever do any cache-able transformations as part of a
>     content-level merge. I don't think we do now, though.

Yes, of course -- otherwise there would have been no need for the
tmp_objdir in the first place.  In particular, it needs to write
three-way-content merges of individual files, and it needs to write
new tree objects.  (And it needs to do this both for creating the
virtual merge bases if the merge is recursive, as well as doing it for
the outer merge.)

It doesn't write anything for caching reasons, such as line ending
normalizations (that's all kept in-memory and done on demand).

>   - in step 5, write_object_file() may still be confused by the presence
>     of the to-be-thrown-away objects in the alternate. This is pretty
>     unlikely, as it implies that the remerge-diff wrote a blob or tree
>     that is byte-identical to something that the diff wants to write.

That's one reason it could be confused.  The textconv filtering in
particular was creating a new object based on an existing one, and a
tree, and a ref.  What if there was some other form of caching or
statistic gathering that didn't write a new object based on an
existing one, but did add trees and especially refs that referenced
the existing object?  It's not that the diff wanted to write something
byte-identical to what the merge wrote, it's just that the diff wants
to reference the object somehow.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30 13:16         ` Ævar Arnfjörð Bjarmason
  2021-09-30 21:00           ` Jeff King
@ 2021-10-01  3:11           ` Elijah Newren
  2021-10-01  7:30             ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 64+ messages in thread
From: Elijah Newren @ 2021-10-01  3:11 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, Junio C Hamano, Elijah Newren via GitGitGadget,
	Git Mailing List, Jonathan Nieder, Sergey Organov

On Thu, Sep 30, 2021 at 6:31 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Thu, Sep 30 2021, Jeff King wrote:
>
> > On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote:
> >
> >> Jeff King <peff@peff.net> writes:
> >>
> >> >   Side note: The pretend_object_file() approach is actually even better,
> >> >   because we know the object is fake. So it does not confuse
> >> >   write_object_file()'s "do we already have this object" freshening
> >> >   check.
> >> >
> >> >   I suspect it could even be made faster than the tmp_objdir approach.
> >> >   From our perspective, these objects really are tempfiles. So we could
> >> >   write them as such, not worrying about things like fsyncing them,
> >> >   naming them into place, etc. We could just write them out, then mmap
> >> >   the results, and put the pointers into cached_objects (currently it
> >> >   insists on malloc-ing a copy of the input buffer, but that seems like
> >> >   an easy extension to add).
> >> >
> >> >   In fact, I think you could get away with just _one_ tempfile per
> >> >   merge. Open up one tempfile. Write out all of the objects you want to
> >> >   "store" into it in sequence, and record the lseek() offsets before and
> >> >   after for each object. Then mmap the whole result, and stuff the
> >> >   appropriate pointers (based on arithmetic with the offsets) into the
> >> >   cached_objects list.
> >>
> >> Cute.  The remerge diff code path creates a full tree that records
> >> the mechanical merge result.  By hooking into the lowest layer of
> >> write_object() interface, we'd serialize all objects in such a tree
> >> in the order they are computed (bottom up from the leaf level, I'd
> >> presume) into a single flat file ;-)
> >
> > I do still like this approach, but just two possible gotchas I was
> > thinking of:
> >
> >  - This side-steps all of our usual code for getting object data into
> >    memory. In general, I'd expect this content to not be too enormous,
> >    but it _could_ be if there are many / large blobs in the result. So
> >    we may end up with large maps. Probably not a big deal on modern
> >    64-bit systems. Maybe an issue on 32-bit systems, just because of
> >    virtual address space.
> >
> >    Likewise, we do support systems with NO_MMAP. They'd work here, but
> >    it would probably mean putting all that object data into the heap. I
> >    could live with that, given how rare such systems are these days, and
> >    that it only matters if you're using --remerge-diff with big blobs.
> >
> >  - I wonder to what degree --remerge-diff benefits from omitting writes
> >    for objects we already have. I.e., if you are writing out a whole
> >    tree representing the conflicted state, then you don't want to write
> >    all of the trees that aren't interesting. Hopefully the code is
> >    already figuring out which paths the merge even touched, and ignoring
> >    the rest. It probably benefits performance-wise from
> >    write_object_file() deciding to skip some object writes, as well
> >    (e.g., for resolutions which the final tree already took, as they'd
> >    be in the merge commit). The whole pretend-we-have-this-object thing
> >    may want to likewise make sure we don't write out objects that we
> >    already have in the real odb.
>
> I haven't benchmarked since my core.checkCollisions RFC patch[1]
> resulted in the somewhat related loose object cache patch from you, and
> not with something like the midx, but just a note that on some setups
> just writing things out is faster than exhaustively checking if we
> absolutely need to write things out.
>
> I also wonder how much if anything writing out the one file v.s. lots of
> loose objects is worthwhile on systems where we could write out those
> loose objects on a ramdisk, which is commonly available on e.g. Linux
> distros these days out of the box. If you care about performance but not
> about your transitory data using a ramdisk is generally much better than
> any other potential I/O optimization.
>
> Finally, and I don't mean to throw a monkey wrench into this whole
> discussion, so take this as a random musing: I wonder how much faster
> this thing could be on its second run if instead of avoiding writing to
> the store & cleaning up, it just wrote to the store, and then wrote

It'd be _much_ slower.  My first implementation in fact did that; it
just wrote objects to the store, left them there, and didn't bother to
do any auto-gcs.  It slowed down quite a bit as it ran.  Adding
auto-gc's during the run were really slow too. But stepping back,
gc'ing objects that I already knew were garbage seemed like a waste;
why not just prune them pre-emptively?  To do so, though, I'd have to
track all the individual objects I added to make sure I didn't prune
something else.  Following that idea and a few different attempts
eventually led me to the discovery of tmp_objdir.

In case it's not clear to you why just writing all the objects to the
normal store and leaving them there slows things down so much...

Let's say 1 in 10 merges had originally needed some kind of conflict
resolution (either in the outer merge or in the inner merge for the
virtual merge bases), meaning that 9 out of 10 merges traversed by
--remerge-diff don't write any objects.  Now for each merge for which
--remerge-diff does need to create conflicted blobs and new trees,
let's say it writes on average 3 blobs and 7 trees.  (I don't know the
real average numbers, it could well be ~5 total, but ~10 seems a
realistic first order approximation and it makes the math easy.)
Then, if we keep all objects we write, then `git log --remerge-diff`
on a history with 100,000 merge commits, will have added 100,000 loose
objects by the time it finishes.  That means that all diff and merge
operations slow down considerably as it runs due to all the extra
loose objects.

> another object keyed on the git version and any revision paramaters
> etc., and then pretty much just had to do a "git cat-file -p <that-obj>"
> to present the result to the user :)

So caching the full `git log ...` output, based on a hash of the
command line flags, and then merely re-showing it later?  And having
that output be invalidated as soon as any head advances?  Or are you
thinking of caching the output per-commit based on a hash of the other
command line flags...potentially slowing non-log operations down with
the huge number of loose objects?

> I suppose that would be throwing a lot more work at an eventual "git gc"
> than we ever do now, so maybe it's a bit crazy, but I think it might be
> an interesting direction in general to (ab)use either the primary or
> some secondary store in the .git dir as a semi-permanent cache of
> resolved queries from the likes of "git log".

If you do per-commit caching, and the user scrolls through enough
output (not hard to do, just searching the output for some string
often is enough), that "eventual" git-gc will be the very next git
operation.  If you cache the entire output, it'll be invalidated
pretty quickly.  So I don't see how this works.  Or am I
misunderstanding something you're suggesting here?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30 14:17       ` Ævar Arnfjörð Bjarmason
@ 2021-10-01  3:55         ` Elijah Newren
  0 siblings, 0 replies; 64+ messages in thread
From: Elijah Newren @ 2021-10-01  3:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Sergey Organov, Neeraj Singh

On Thu, Sep 30, 2021 at 7:39 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Tue, Sep 28 2021, Elijah Newren wrote:
>
> > On Tue, Sep 28, 2021 at 1:00 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >> On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:
> >>
> >> I commented just now on how this API is duplicated between here &
> >> another in-flight series in
> >> https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/; Just
> >> comments on this patch here:
> >>
> >> > diff --git a/tmp-objdir.c b/tmp-objdir.c
> >> > index b8d880e3626..9f75a75d1c0 100644
> >> > --- a/tmp-objdir.c
> >> > +++ b/tmp-objdir.c
> >> > @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t)
> >> >       return t->env.v;
> >> >  }
> >> >
> >> > +const char *tmp_objdir_path(struct tmp_objdir *t)
> >> > +{
> >> > +     return t->path.buf;
> >> > +}
> >>
> >> Not your fault & this pre-dates your patch, but FWIW I prefer our APIs
> >> that don't have these "hidden struct" shenanigans (like say "struct
> >> string_list") so a caller could just access this, we can just declare it
> >> "const" appropriately.
> >>
> >> We're also all in-tree friends here, so having various accessors for no
> >> reason other than to access struct members seems a bit too much.
> >
> > That's a fair point, but just to play counterpoint for a minute...
> >
> > FWIW, I dislike when our public facing APIs are polluted with all
> > kinds of internal details.  merge-recursive being a case in point.
> > When writing merge-ort, although I had a struct with public fields,
> > that struct also contained an opaque struct (pointer) within it to
> > hide several private fields.  (I would have liked to hide or remove a
> > few more fields, but couldn't do so while the merge_recursive_options
> > struct was shared between both merge-ort and merge-recursive.)
> >
> > That said, I agree it can certainly be overdone, and tmp_objdir is
> > pretty simple.  However, sometimes even in simple cases I like when
> > folks make use of an opaque struct because it means folks put some
> > effort into thinking more about the API that should be exposed.
> > That's something we as a project have often overlooked in the past, as
> > evidenced both by our one-shot usage mentality, and the existence of
> > external projects like libgit2 attempting to correct this design
> > shortcoming.  I'd like git to move more towards being structured as a
> > reusable library as well as a useful command-line tool.
>
> [This is somewhat of a continuation of
> https://lore.kernel.org/git/87lf3etaih.fsf@evledraar.gmail.com/].
>
> I like our APIs being usable, but right now we're not promising any
> maintenance of APIs that work the same as git v2.30.0 or whatever.
>
> I think some external users directly link to libgit.a, but they're
> tracking whatever churn we have from release to release, just like
> someone maintaining out-of-tree linux kernel drivers.
>
> For maintenance of git.git itself I think it's going to stay like that
> for any foreseeable future[1].
>
> And that's before we even get into the practicalities of git's license
> likely not winning over many (if any) libgit2 users. I think those users
> are if anything migrating from libgit2 to git's plumbing CLI, or in some
> cases folding what they needed into git.git itself (or just staying with
> libgit2).
>
> So assuming all of that I really don't see the point in general of
> having a "foo" field that has the equivalent of get_foo() and set_foo()
> accessors in the context of git.git.
>
> That's something that's really useful if you're maintaining API and ABI
> compatibility, but for us it's usually just verbosity. All the users are
> in-tree, so just add "const" as appropriate, or perhaps we should give
> the field a "priv" name or whatever.

It wasn't just a get or set accessors, and there was more involved
than just adding const:
  * There was a typechange from struct strbuf -> const char *; strbuf != char *
  * Having an API that _enforces_ accesses to be const provides more
safety.  Relying on callers to remember to declare their variable
access as const provides virtually none.

But more importantly than either of the above, tmp_objdir was locked
down because it was too easy to make big mistakes with, as Peff
pointed out.  That meant that folks would only use it through the API,
and if folks tried to do unusual things, they'd do so by adding new
API, which is easier to flag in code reviews and make sure the right
people get cc'ed than if folks just reach into some internals from
some other file.  And the new API that I added resulted in a
discussion that pointed out I didn't need to access that field at all
and should instead create and delete the tmp_objdir after every merge.

> Obviously not everything should be done that way, e.g. being able to
> have trace2 start/end points for something you'd otherwise push/clear
> from a "struct string_list" or whatever directly *is* usable, so it's
> not black and white.
>
> I'm just saying thaht using convoluted patterns *just* to get in the way
> of some hypothetical code that's going to mess with your internals can
> make things harder in other areas. E.g. the init/release pattern via
> macros I mentioned in the linked E-Mail above.

Fair points, but I just don't think that's relevant to the context
here; i.e. these aren't convoluted patterns *just* to get in the way.

Further, I don't see how this would adversely affect the *_INIT macros
you mentioned at all.  tmp_objdir is an opaque struct so you can only
declare a pointer to it.  Pointers being initialized to NULL is
handled quite nicely by the *_INIT macros.

> > [...]
> >> In this particular case I hadn't understood on a previous reading of
> >> tmp-objdir.c why this "path" isn't a statically allocated "char
> >> path[PATH_MAX]", or a least we that hardcoded "1024" should be a
> >> PATH_MAX, as it is in some other cases.
> >
> > Actually, PATH_MAX shouldn't be used:
> >
> > https://lore.kernel.org/git/7d1237c7-5a83-d766-7d93-5f0d59166067@web.de/
> > https://lore.kernel.org/git/20180720180427.GE22486@sigill.intra.peff.net/
>
> Thanks. I didn't know about that. I've fixed (some of that) up in some
> semi-related WIP work I've got.
>
> 1. Aside: I *do* think it would be really useful for us to expose a
>    stable C API eventually, using some of the same license
>    workarounds/shenanigans (depending on your view) as the Linux kernel
>    did. I.e. to say that "we're GPLv2, but this small API is the
>    equivalent of our syscall interface".
>
>    Rather than that API being something that would look anything like
>    internal git.git users it would basically be a run_command() + Perl's
>    unpack/pack + output formatter on steroids.
>
>    I.e. we'd be able to invoke say a 'git cat-file --batch' in the same
>    process as the caller by essentially calling cmd_cat_file()
>    directly. Just with some callback replacements for "provide this on
>    stdin" and "this line accepts stdout" without actually involving
>    different processes or I/O.
>
>    You'd then have the equivalent of a --format output that would just
>    so happen to be in your OS's / C ABI's understood struct
>    format. Similar to "perldoc -f pack".
>
>    The maintenance burden of such an API would be trivial. It would
>    basically be a handful of functions just to setup input/output
>    callbacks.
>
>    Most of the interface would be a rather convoluted construction of
>    command-line parameters (like our own run_command() use), not ideal,
>    but most users are after structured data and eliminating process
>    overhead, and they'd get that.
>
>    We would need to replace fprintf() and the like with some structured
>    formatting mechanism, but that would be useful for non-API users as
>    new commands learning a shiny new --format parameter. You could even
>    use such a --format to have a non-C-API user get the same struct
>    output, perhaps for direct mapping into memory, say if you didn't
>    trust the built-in/command in question to not leak memory yet.
>
>    I don't know if this has been suggested before, just my 0.02. It's
>    also partially why I'm interested in making even-built-ins not leak
>    memory with some of my parallel SANITIZE=leak work.
>
>    Once you've got that you can usually call the cmd_whatever() in a
>    loop without much effort, and once you've got that, well, see above
>    :)

This sounds really cool.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30 20:06           ` Junio C Hamano
@ 2021-10-01  3:59             ` Elijah Newren
  2021-10-01 16:36               ` Junio C Hamano
  0 siblings, 1 reply; 64+ messages in thread
From: Elijah Newren @ 2021-10-01  3:59 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Thu, Sep 30, 2021 at 1:06 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jeff King <peff@peff.net> writes:
>
> >   - is it possible for the merge code to ever write an object? I kind of
> >     wonder if we'd ever do any cache-able transformations as part of a
> >     content-level merge. I don't think we do now, though.
>
> How is virtual merge base, result of an inner merge that recursively
> merging two merge bases, fed to an outer merge as the ancestor?
> Isn't it written as a tree object by the inner merge as a tree with
> full of blob objects, so the outer merge does not have to treat a
> merge base tree that is virtual any specially from a sole merge
> base?

Yes.  And that's not unique to the inner merges; it also writes new
blob and tree objects for the outer merge, and then passes the
resulting toplevel tree id out as the result.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-09-30  7:33       ` Jeff King
  2021-09-30 13:16         ` Ævar Arnfjörð Bjarmason
@ 2021-10-01  4:26         ` Elijah Newren
  2021-10-01  5:27           ` Jeff King
  1 sibling, 1 reply; 64+ messages in thread
From: Elijah Newren @ 2021-10-01  4:26 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Thu, Sep 30, 2021 at 12:33 AM Jeff King <peff@peff.net> wrote:
>
> On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote:
>
> > Jeff King <peff@peff.net> writes:
> >
> > >   Side note: The pretend_object_file() approach is actually even better,
> > >   because we know the object is fake. So it does not confuse
> > >   write_object_file()'s "do we already have this object" freshening
> > >   check.
> > >
> > >   I suspect it could even be made faster than the tmp_objdir approach.
> > >   From our perspective, these objects really are tempfiles. So we could
> > >   write them as such, not worrying about things like fsyncing them,
> > >   naming them into place, etc. We could just write them out, then mmap
> > >   the results, and put the pointers into cached_objects (currently it
> > >   insists on malloc-ing a copy of the input buffer, but that seems like
> > >   an easy extension to add).
> > >
> > >   In fact, I think you could get away with just _one_ tempfile per
> > >   merge. Open up one tempfile. Write out all of the objects you want to
> > >   "store" into it in sequence, and record the lseek() offsets before and
> > >   after for each object. Then mmap the whole result, and stuff the
> > >   appropriate pointers (based on arithmetic with the offsets) into the
> > >   cached_objects list.
> >
> > Cute.  The remerge diff code path creates a full tree that records
> > the mechanical merge result.  By hooking into the lowest layer of
> > write_object() interface, we'd serialize all objects in such a tree
> > in the order they are computed (bottom up from the leaf level, I'd
> > presume) into a single flat file ;-)
>
> I do still like this approach, but just two possible gotchas I was
> thinking of:
>
>  - This side-steps all of our usual code for getting object data into
>    memory. In general, I'd expect this content to not be too enormous,
>    but it _could_ be if there are many / large blobs in the result. So
>    we may end up with large maps. Probably not a big deal on modern
>    64-bit systems. Maybe an issue on 32-bit systems, just because of
>    virtual address space.
>
>    Likewise, we do support systems with NO_MMAP. They'd work here, but
>    it would probably mean putting all that object data into the heap. I
>    could live with that, given how rare such systems are these days, and
>    that it only matters if you're using --remerge-diff with big blobs.

Um, I'm starting to get uncomfortable with this pretend_object stuff.
Part of the reason that merge-ort isn't truly "in memory" despite
attempting to do exactly that, was because for large enough repos with
enough files modified on both sides, I wasn't comfortable assuming
that all new files from three-way content merges and all new trees fit
into memory.  I'm sure we'd be fine with current-day linux kernel
sized repos.  No big deal.  In fact, most merges probably don't add
more than a few dozen new files.  But for microsoft-sized repos, and
with repos tending to grow over time, more so when the tools
themselves scale nicely (which we've all been working on enabling),
makes me worry there might be enough new objects within a single merge
(especially given the recursive inner merges) that we might need to
worry about this.

>  - I wonder to what degree --remerge-diff benefits from omitting writes
>    for objects we already have. I.e., if you are writing out a whole
>    tree representing the conflicted state, then you don't want to write
>    all of the trees that aren't interesting. Hopefully the code is
>    already figuring out which paths the merge even touched, and ignoring
>    the rest.

Not only do you want to avoid writing all of the trees that aren't
interesting, you also want to avoid traversing into them in the first
place and avoid doing trivial file merges for each entry underneath.
Sadly, merge-recursive did anyway.  Because renames and directory
renames can sometimes throw a big wrench in the desire to avoid
traversing into such directories.  Figuring out how to avoid it was
kind of tricky, but merge-ort definitely handles this when it can
safely do so; see the cover letter for my trivial directory resolution
series[1].

[1] https://lore.kernel.org/git/pull.988.v4.git.1626841444.gitgitgadget@gmail.com/

>    It probably benefits performance-wise from
>    write_object_file() deciding to skip some object writes, as well
>    (e.g., for resolutions which the final tree already took, as they'd
>    be in the merge commit).

Yes, it will also benefit from that, but the much bigger side is
avoiding needlessly recursing into directories unchanged on (at least)
one side.

>    The whole pretend-we-have-this-object thing
>    may want to likewise make sure we don't write out objects that we
>    already have in the real odb.

Right, so I'd have to copy the relevant logic from write_object_file()
-- I think that means instead of write_object_file()'s calls to
freshen_packed_object() and freshen_loose_object() that I instead call
find_pack_entry() and make has_loose_object() in object-file.c not be
static and then call it.  Does that sound right?

Of course, that's assuming we're okay with this pretend_object thing,
which I'm starting to worry about.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-10-01  2:31           ` Elijah Newren
@ 2021-10-01  5:21             ` Jeff King
  0 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2021-10-01  5:21 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Thu, Sep 30, 2021 at 07:31:44PM -0700, Elijah Newren wrote:

> > Some ways it could go wrong:
> >
> >   - is it possible for the merge code to ever write an object? I kind of
> >     wonder if we'd ever do any cache-able transformations as part of a
> >     content-level merge. I don't think we do now, though.
> 
> Yes, of course -- otherwise there would have been no need for the
> tmp_objdir in the first place.  In particular, it needs to write
> three-way-content merges of individual files, and it needs to write
> new tree objects.  (And it needs to do this both for creating the
> virtual merge bases if the merge is recursive, as well as doing it for
> the outer merge.)

Right, sorry, I should have added: ...in addition to the merge-results
we're writing. What I'm getting at is whether there might be _other_
parts of the code that the merge code would invoke. Say, if a
.gitattribute caused us to convert a file's encoding in order to perform
a more semantic merge, and we wanted to store that result somehow (e.g.,
in a similar cache).

I don't think anything like that exists now, but I don't find it outside
the realm of possibility in the future. It's sort of the merge
equivalent of "textconv"; we have external diff and external merge
drivers, but being able to convert contents and feed them to the regular
text diff/merge code simplifies things.

> >   - in step 5, write_object_file() may still be confused by the presence
> >     of the to-be-thrown-away objects in the alternate. This is pretty
> >     unlikely, as it implies that the remerge-diff wrote a blob or tree
> >     that is byte-identical to something that the diff wants to write.
> 
> That's one reason it could be confused.  The textconv filtering in
> particular was creating a new object based on an existing one, and a
> tree, and a ref.  What if there was some other form of caching or
> statistic gathering that didn't write a new object based on an
> existing one, but did add trees and especially refs that referenced
> the existing object?  It's not that the diff wanted to write something
> byte-identical to what the merge wrote, it's just that the diff wants
> to reference the object somehow.

Yes, good point. It can help a little if the alternate-odb added by
tmp_objdir was marked with a flag to say "hey, this is temporary". That
would help write_object_file() decide not to depend on it. But the
problem is so much wider than that that I think it will always be a bit
dangerous; any code could say "can I access this object" and we don't
know if its intent is simply to read it, or to create a new object that
references it.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-10-01  4:26         ` Elijah Newren
@ 2021-10-01  5:27           ` Jeff King
  2021-10-01  7:43             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 64+ messages in thread
From: Jeff King @ 2021-10-01  5:27 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Thu, Sep 30, 2021 at 09:26:37PM -0700, Elijah Newren wrote:

> >  - This side-steps all of our usual code for getting object data into
> >    memory. In general, I'd expect this content to not be too enormous,
> >    but it _could_ be if there are many / large blobs in the result. So
> >    we may end up with large maps. Probably not a big deal on modern
> >    64-bit systems. Maybe an issue on 32-bit systems, just because of
> >    virtual address space.
> >
> >    Likewise, we do support systems with NO_MMAP. They'd work here, but
> >    it would probably mean putting all that object data into the heap. I
> >    could live with that, given how rare such systems are these days, and
> >    that it only matters if you're using --remerge-diff with big blobs.
> 
> Um, I'm starting to get uncomfortable with this pretend_object stuff.
> Part of the reason that merge-ort isn't truly "in memory" despite
> attempting to do exactly that, was because for large enough repos with
> enough files modified on both sides, I wasn't comfortable assuming
> that all new files from three-way content merges and all new trees fit
> into memory.  I'm sure we'd be fine with current-day linux kernel
> sized repos.  No big deal.  In fact, most merges probably don't add
> more than a few dozen new files.  But for microsoft-sized repos, and
> with repos tending to grow over time, more so when the tools
> themselves scale nicely (which we've all been working on enabling),
> makes me worry there might be enough new objects within a single merge
> (especially given the recursive inner merges) that we might need to
> worry about this.

I do think we need to consider that the content might be larger than
will comfortably fit in memory. But the point of using mmap is that we
don't have to care. The OS is taking care of it for us (just like it
would in regular object files).

The question is just whether we're comfortable assuming that mmap
exists if you're working on such a large repository. I'd guess that big
repos are pretty painful with out it (and again, I'm not even clear
which systems Git runs on even lack mmap these days). So IMHO this isn't
really a blocker for going in this direction.

> >    The whole pretend-we-have-this-object thing
> >    may want to likewise make sure we don't write out objects that we
> >    already have in the real odb.
> 
> Right, so I'd have to copy the relevant logic from write_object_file()
> -- I think that means instead of write_object_file()'s calls to
> freshen_packed_object() and freshen_loose_object() that I instead call
> find_pack_entry() and make has_loose_object() in object-file.c not be
> static and then call it.  Does that sound right?

Yes. You _could_ call the same freshening functions, but I think since
we know that our objects are throw-away anyway, we do not even need to
perform that freshening operation. I think just has_object_file() would
be sufficient for your needs.

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/7] merge-ort: add ability to record conflict messages in a file
  2021-10-01  2:07         ` Elijah Newren
@ 2021-10-01  5:28           ` Jeff King
  0 siblings, 0 replies; 64+ messages in thread
From: Jeff King @ 2021-10-01  5:28 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

On Thu, Sep 30, 2021 at 07:07:21PM -0700, Elijah Newren wrote:

> > That said, I think it could also make sense to annotate the conflict
> > files by giving them an unusual set of mode bits. That would be easier
> > to recognize (and while real trees _could_ have silly modes, we do
> > complain about them in fsck, so it shouldn't happen in practice).
> 
> I tried giving things unusual mode bits in early versions of merge-ort
> for other reasons.  It doesn't work: canon_mode() will "fix" the
> unusualness so there's no way for the reader to know that they had
> unusual bits.

Ah, right, I totally forgot about that.

> But, as you said later, avoiding these pseudo-files is going to be
> cleaner anyway, so let's just do that.

I'm quite happy if you're on board with that direction. :)

-Peff

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 4/7] merge-ort: capture and print ll-merge warnings in our preferred fashion
  2021-10-01  1:54     ` Elijah Newren
@ 2021-10-01  7:23       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 64+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-01  7:23 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Sergey Organov


On Thu, Sep 30 2021, Elijah Newren wrote:

> On Thu, Sep 30, 2021 at 9:53 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:
>>
>> > @@ -108,8 +108,14 @@ test_expect_success 'refuse to merge binary files' '
>> >       printf "\0\0" >binary-file &&
>> >       git add binary-file &&
>> >       git commit -m binary2 &&
>> > -     test_must_fail git merge F >merge.out 2>merge.err &&
>> > -     grep "Cannot merge binary files: binary-file (HEAD vs. F)" merge.err
>> > +     if test "$GIT_TEST_MERGE_ALGORITHM" = ort
>> > +     then
>> > +             test_must_fail git merge F >merge.out &&
>> > +             grep "Cannot merge binary files: binary-file (HEAD vs. F)" merge.out
>> > +     else
>> > +             test_must_fail git merge F >merge.out 2>merge.err &&
>> > +             grep "Cannot merge binary files: binary-file (HEAD vs. F)" merge.err
>> > +     fi
>> >  '
>>
>> To save readers from eyeballing if a single character has changed here,
>> which I don't think it has, just do:
>>
>> if ...
>> then
>>         cmd >actual
>> else
>>         other cmd >actual
>
> Do you mean
>    other cmd >/dev/null 2>actual
> ?

Yes, sorry. I was going for lazy pseudocode. I meant maybe the "grep"
can be shared between the two.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-10-01  3:11           ` Elijah Newren
@ 2021-10-01  7:30             ` Ævar Arnfjörð Bjarmason
  2021-10-01  8:03               ` Elijah Newren
  0 siblings, 1 reply; 64+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-01  7:30 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Jeff King, Junio C Hamano, Elijah Newren via GitGitGadget,
	Git Mailing List, Jonathan Nieder, Sergey Organov


On Thu, Sep 30 2021, Elijah Newren wrote:

> On Thu, Sep 30, 2021 at 6:31 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> On Thu, Sep 30 2021, Jeff King wrote:
>>
>> > On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote:
>> >
>> >> Jeff King <peff@peff.net> writes:
>> >>
>> >> >   Side note: The pretend_object_file() approach is actually even better,
>> >> >   because we know the object is fake. So it does not confuse
>> >> >   write_object_file()'s "do we already have this object" freshening
>> >> >   check.
>> >> >
>> >> >   I suspect it could even be made faster than the tmp_objdir approach.
>> >> >   From our perspective, these objects really are tempfiles. So we could
>> >> >   write them as such, not worrying about things like fsyncing them,
>> >> >   naming them into place, etc. We could just write them out, then mmap
>> >> >   the results, and put the pointers into cached_objects (currently it
>> >> >   insists on malloc-ing a copy of the input buffer, but that seems like
>> >> >   an easy extension to add).
>> >> >
>> >> >   In fact, I think you could get away with just _one_ tempfile per
>> >> >   merge. Open up one tempfile. Write out all of the objects you want to
>> >> >   "store" into it in sequence, and record the lseek() offsets before and
>> >> >   after for each object. Then mmap the whole result, and stuff the
>> >> >   appropriate pointers (based on arithmetic with the offsets) into the
>> >> >   cached_objects list.
>> >>
>> >> Cute.  The remerge diff code path creates a full tree that records
>> >> the mechanical merge result.  By hooking into the lowest layer of
>> >> write_object() interface, we'd serialize all objects in such a tree
>> >> in the order they are computed (bottom up from the leaf level, I'd
>> >> presume) into a single flat file ;-)
>> >
>> > I do still like this approach, but just two possible gotchas I was
>> > thinking of:
>> >
>> >  - This side-steps all of our usual code for getting object data into
>> >    memory. In general, I'd expect this content to not be too enormous,
>> >    but it _could_ be if there are many / large blobs in the result. So
>> >    we may end up with large maps. Probably not a big deal on modern
>> >    64-bit systems. Maybe an issue on 32-bit systems, just because of
>> >    virtual address space.
>> >
>> >    Likewise, we do support systems with NO_MMAP. They'd work here, but
>> >    it would probably mean putting all that object data into the heap. I
>> >    could live with that, given how rare such systems are these days, and
>> >    that it only matters if you're using --remerge-diff with big blobs.
>> >
>> >  - I wonder to what degree --remerge-diff benefits from omitting writes
>> >    for objects we already have. I.e., if you are writing out a whole
>> >    tree representing the conflicted state, then you don't want to write
>> >    all of the trees that aren't interesting. Hopefully the code is
>> >    already figuring out which paths the merge even touched, and ignoring
>> >    the rest. It probably benefits performance-wise from
>> >    write_object_file() deciding to skip some object writes, as well
>> >    (e.g., for resolutions which the final tree already took, as they'd
>> >    be in the merge commit). The whole pretend-we-have-this-object thing
>> >    may want to likewise make sure we don't write out objects that we
>> >    already have in the real odb.
>>
>> I haven't benchmarked since my core.checkCollisions RFC patch[1]
>> resulted in the somewhat related loose object cache patch from you, and
>> not with something like the midx, but just a note that on some setups
>> just writing things out is faster than exhaustively checking if we
>> absolutely need to write things out.
>>
>> I also wonder how much if anything writing out the one file v.s. lots of
>> loose objects is worthwhile on systems where we could write out those
>> loose objects on a ramdisk, which is commonly available on e.g. Linux
>> distros these days out of the box. If you care about performance but not
>> about your transitory data using a ramdisk is generally much better than
>> any other potential I/O optimization.
>>
>> Finally, and I don't mean to throw a monkey wrench into this whole
>> discussion, so take this as a random musing: I wonder how much faster
>> this thing could be on its second run if instead of avoiding writing to
>> the store & cleaning up, it just wrote to the store, and then wrote
>
> It'd be _much_ slower.  My first implementation in fact did that; it
> just wrote objects to the store, left them there, and didn't bother to
> do any auto-gcs.  It slowed down quite a bit as it ran.  Adding
> auto-gc's during the run were really slow too. But stepping back,
> gc'ing objects that I already knew were garbage seemed like a waste;
> why not just prune them pre-emptively?  To do so, though, I'd have to
> track all the individual objects I added to make sure I didn't prune
> something else.  Following that idea and a few different attempts
> eventually led me to the discovery of tmp_objdir.
>
> In case it's not clear to you why just writing all the objects to the
> normal store and leaving them there slows things down so much...
>
> Let's say 1 in 10 merges had originally needed some kind of conflict
> resolution (either in the outer merge or in the inner merge for the
> virtual merge bases), meaning that 9 out of 10 merges traversed by
> --remerge-diff don't write any objects.  Now for each merge for which
> --remerge-diff does need to create conflicted blobs and new trees,
> let's say it writes on average 3 blobs and 7 trees.  (I don't know the
> real average numbers, it could well be ~5 total, but ~10 seems a
> realistic first order approximation and it makes the math easy.)
> Then, if we keep all objects we write, then `git log --remerge-diff`
> on a history with 100,000 merge commits, will have added 100,000 loose
> objects by the time it finishes.  That means that all diff and merge
> operations slow down considerably as it runs due to all the extra
> loose objects.

...

>> another object keyed on the git version and any revision paramaters
>> etc., and then pretty much just had to do a "git cat-file -p <that-obj>"
>> to present the result to the user :)
>
> So caching the full `git log ...` output, based on a hash of the
> command line flags, and then merely re-showing it later?  And having
> that output be invalidated as soon as any head advances?  Or are you
> thinking of caching the output per-commit based on a hash of the other
> command line flags...potentially slowing non-log operations down with
> the huge number of loose objects?

Yeah I meant caching the full 'git log' output. Anyway, I think what you
said above about number of loose objects clearly makes that a stupid
suggestion on my part.

FWIW I meant that more as a "maybe in the future we can ..." musing than
anything for this series. I.e. in general I've often wanted say the
second run of a 'git log -G<rx>"' to be faster than it is, which could
use such on-the-fly caching.

But if we had such caching then something like --remerge-diff would need
to (from my understanding of what you're saying) need both this
temporary staging area for objects and perhaps to cache the output in
the main (or some cache) store. I.e. they're orthagonal.

>> I suppose that would be throwing a lot more work at an eventual "git gc"
>> than we ever do now, so maybe it's a bit crazy, but I think it might be
>> an interesting direction in general to (ab)use either the primary or
>> some secondary store in the .git dir as a semi-permanent cache of
>> resolved queries from the likes of "git log".
>
> If you do per-commit caching, and the user scrolls through enough
> output (not hard to do, just searching the output for some string
> often is enough), that "eventual" git-gc will be the very next git
> operation.  If you cache the entire output, it'll be invalidated
> pretty quickly.  So I don't see how this works.  Or am I
> misunderstanding something you're suggesting here?

With the caveat above that I think this is all a pretty stupid
suggestion on my part:

Just on the narrow aspect of how "git gc" behaves in that scenario: We'd
keep those objects around for 2 weeks by default, or whatever you have
gc.pruneExpire set to.

So such a setup would certainly cause lots of pathological issues (see
my [1] for one), but I don't think over-eager expiry of loose objects
would be one.

I'm not sure but are you referring to "invalidated pretty quickly"
because it would go over the gc.auto limit? If so that's now how it
works in this scenario, see [2]. I.e. it's moderated by the
gc.pruneExpire setting.

1. https://lore.kernel.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/
2. https://lore.kernel.org/git/87inc89j38.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-10-01  5:27           ` Jeff King
@ 2021-10-01  7:43             ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 64+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-01  7:43 UTC (permalink / raw)
  To: Jeff King
  Cc: Elijah Newren, Junio C Hamano, Elijah Newren via GitGitGadget,
	Git Mailing List, Jonathan Nieder, Sergey Organov


On Fri, Oct 01 2021, Jeff King wrote:

> On Thu, Sep 30, 2021 at 09:26:37PM -0700, Elijah Newren wrote:
>
>> >  - This side-steps all of our usual code for getting object data into
>> >    memory. In general, I'd expect this content to not be too enormous,
>> >    but it _could_ be if there are many / large blobs in the result. So
>> >    we may end up with large maps. Probably not a big deal on modern
>> >    64-bit systems. Maybe an issue on 32-bit systems, just because of
>> >    virtual address space.
>> >
>> >    Likewise, we do support systems with NO_MMAP. They'd work here, but
>> >    it would probably mean putting all that object data into the heap. I
>> >    could live with that, given how rare such systems are these days, and
>> >    that it only matters if you're using --remerge-diff with big blobs.
>> 
>> Um, I'm starting to get uncomfortable with this pretend_object stuff.
>> Part of the reason that merge-ort isn't truly "in memory" despite
>> attempting to do exactly that, was because for large enough repos with
>> enough files modified on both sides, I wasn't comfortable assuming
>> that all new files from three-way content merges and all new trees fit
>> into memory.  I'm sure we'd be fine with current-day linux kernel
>> sized repos.  No big deal.  In fact, most merges probably don't add
>> more than a few dozen new files.  But for microsoft-sized repos, and
>> with repos tending to grow over time, more so when the tools
>> themselves scale nicely (which we've all been working on enabling),
>> makes me worry there might be enough new objects within a single merge
>> (especially given the recursive inner merges) that we might need to
>> worry about this.
>
> I do think we need to consider that the content might be larger than
> will comfortably fit in memory. But the point of using mmap is that we
> don't have to care. The OS is taking care of it for us (just like it
> would in regular object files).
>
> The question is just whether we're comfortable assuming that mmap
> exists if you're working on such a large repository. I'd guess that big
> repos are pretty painful with out it (and again, I'm not even clear
> which systems Git runs on even lack mmap these days). So IMHO this isn't
> really a blocker for going in this direction.

On the not a blocker: Even without mmap() such a user also has the out
of increasing the size of their swap/page file.

And generally I agree that it's fair for us to say that if you've got
such outsized performance needs you're going to need something more than
the lowest common denominator git can be ported to.

I also wouldn't be surprised if in that scenario we'd run faster if we
were using memory (and no mmap) than if we tried to fallback to the FS,
i.e. your suggestion of replacing N loose objects with one file with
index offsets is basically what the OS would be doing for you as it
pages out memory it can't fit in RAM to disk.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-10-01  7:30             ` Ævar Arnfjörð Bjarmason
@ 2021-10-01  8:03               ` Elijah Newren
  0 siblings, 0 replies; 64+ messages in thread
From: Elijah Newren @ 2021-10-01  8:03 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, Junio C Hamano, Elijah Newren via GitGitGadget,
	Git Mailing List, Jonathan Nieder, Sergey Organov

On Fri, Oct 1, 2021 at 12:41 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Thu, Sep 30 2021, Elijah Newren wrote:
>
> > On Thu, Sep 30, 2021 at 6:31 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >> On Thu, Sep 30 2021, Jeff King wrote:
> >>
> >> > On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote:
> >> >
> >> >> Jeff King <peff@peff.net> writes:
> >> >>
> >> >> >   Side note: The pretend_object_file() approach is actually even better,
> >> >> >   because we know the object is fake. So it does not confuse
> >> >> >   write_object_file()'s "do we already have this object" freshening
> >> >> >   check.
> >> >> >
> >> >> >   I suspect it could even be made faster than the tmp_objdir approach.
> >> >> >   From our perspective, these objects really are tempfiles. So we could
> >> >> >   write them as such, not worrying about things like fsyncing them,
> >> >> >   naming them into place, etc. We could just write them out, then mmap
> >> >> >   the results, and put the pointers into cached_objects (currently it
> >> >> >   insists on malloc-ing a copy of the input buffer, but that seems like
> >> >> >   an easy extension to add).
> >> >> >
> >> >> >   In fact, I think you could get away with just _one_ tempfile per
> >> >> >   merge. Open up one tempfile. Write out all of the objects you want to
> >> >> >   "store" into it in sequence, and record the lseek() offsets before and
> >> >> >   after for each object. Then mmap the whole result, and stuff the
> >> >> >   appropriate pointers (based on arithmetic with the offsets) into the
> >> >> >   cached_objects list.
> >> >>
> >> >> Cute.  The remerge diff code path creates a full tree that records
> >> >> the mechanical merge result.  By hooking into the lowest layer of
> >> >> write_object() interface, we'd serialize all objects in such a tree
> >> >> in the order they are computed (bottom up from the leaf level, I'd
> >> >> presume) into a single flat file ;-)
> >> >
> >> > I do still like this approach, but just two possible gotchas I was
> >> > thinking of:
> >> >
> >> >  - This side-steps all of our usual code for getting object data into
> >> >    memory. In general, I'd expect this content to not be too enormous,
> >> >    but it _could_ be if there are many / large blobs in the result. So
> >> >    we may end up with large maps. Probably not a big deal on modern
> >> >    64-bit systems. Maybe an issue on 32-bit systems, just because of
> >> >    virtual address space.
> >> >
> >> >    Likewise, we do support systems with NO_MMAP. They'd work here, but
> >> >    it would probably mean putting all that object data into the heap. I
> >> >    could live with that, given how rare such systems are these days, and
> >> >    that it only matters if you're using --remerge-diff with big blobs.
> >> >
> >> >  - I wonder to what degree --remerge-diff benefits from omitting writes
> >> >    for objects we already have. I.e., if you are writing out a whole
> >> >    tree representing the conflicted state, then you don't want to write
> >> >    all of the trees that aren't interesting. Hopefully the code is
> >> >    already figuring out which paths the merge even touched, and ignoring
> >> >    the rest. It probably benefits performance-wise from
> >> >    write_object_file() deciding to skip some object writes, as well
> >> >    (e.g., for resolutions which the final tree already took, as they'd
> >> >    be in the merge commit). The whole pretend-we-have-this-object thing
> >> >    may want to likewise make sure we don't write out objects that we
> >> >    already have in the real odb.
> >>
> >> I haven't benchmarked since my core.checkCollisions RFC patch[1]
> >> resulted in the somewhat related loose object cache patch from you, and
> >> not with something like the midx, but just a note that on some setups
> >> just writing things out is faster than exhaustively checking if we
> >> absolutely need to write things out.
> >>
> >> I also wonder how much if anything writing out the one file v.s. lots of
> >> loose objects is worthwhile on systems where we could write out those
> >> loose objects on a ramdisk, which is commonly available on e.g. Linux
> >> distros these days out of the box. If you care about performance but not
> >> about your transitory data using a ramdisk is generally much better than
> >> any other potential I/O optimization.
> >>
> >> Finally, and I don't mean to throw a monkey wrench into this whole
> >> discussion, so take this as a random musing: I wonder how much faster
> >> this thing could be on its second run if instead of avoiding writing to
> >> the store & cleaning up, it just wrote to the store, and then wrote
> >
> > It'd be _much_ slower.  My first implementation in fact did that; it
> > just wrote objects to the store, left them there, and didn't bother to
> > do any auto-gcs.  It slowed down quite a bit as it ran.  Adding
> > auto-gc's during the run were really slow too. But stepping back,
> > gc'ing objects that I already knew were garbage seemed like a waste;
> > why not just prune them pre-emptively?  To do so, though, I'd have to
> > track all the individual objects I added to make sure I didn't prune
> > something else.  Following that idea and a few different attempts
> > eventually led me to the discovery of tmp_objdir.
> >
> > In case it's not clear to you why just writing all the objects to the
> > normal store and leaving them there slows things down so much...
> >
> > Let's say 1 in 10 merges had originally needed some kind of conflict
> > resolution (either in the outer merge or in the inner merge for the
> > virtual merge bases), meaning that 9 out of 10 merges traversed by
> > --remerge-diff don't write any objects.  Now for each merge for which
> > --remerge-diff does need to create conflicted blobs and new trees,
> > let's say it writes on average 3 blobs and 7 trees.  (I don't know the
> > real average numbers, it could well be ~5 total, but ~10 seems a
> > realistic first order approximation and it makes the math easy.)
> > Then, if we keep all objects we write, then `git log --remerge-diff`
> > on a history with 100,000 merge commits, will have added 100,000 loose
> > objects by the time it finishes.  That means that all diff and merge
> > operations slow down considerably as it runs due to all the extra
> > loose objects.
>
> ...
>
> >> another object keyed on the git version and any revision paramaters
> >> etc., and then pretty much just had to do a "git cat-file -p <that-obj>"
> >> to present the result to the user :)
> >
> > So caching the full `git log ...` output, based on a hash of the
> > command line flags, and then merely re-showing it later?  And having
> > that output be invalidated as soon as any head advances?  Or are you
> > thinking of caching the output per-commit based on a hash of the other
> > command line flags...potentially slowing non-log operations down with
> > the huge number of loose objects?
>
> Yeah I meant caching the full 'git log' output. Anyway, I think what you
> said above about number of loose objects clearly makes that a stupid
> suggestion on my part.

Most my ideas for merge-ort were stupid.  I didn't know until I tried
them, and then I just discarded them and only showed off the good
ideas.  :-)

Brainstorming new ways of doing things is a useful exercise, I think.

> FWIW I meant that more as a "maybe in the future we can ..." musing than
> anything for this series. I.e. in general I've often wanted say the
> second run of a 'git log -G<rx>"' to be faster than it is, which could
> use such on-the-fly caching.
>
> But if we had such caching then something like --remerge-diff would need
> to (from my understanding of what you're saying) need both this
> temporary staging area for objects and perhaps to cache the output in
> the main (or some cache) store. I.e. they're orthagonal.
>
> >> I suppose that would be throwing a lot more work at an eventual "git gc"
> >> than we ever do now, so maybe it's a bit crazy, but I think it might be
> >> an interesting direction in general to (ab)use either the primary or
> >> some secondary store in the .git dir as a semi-permanent cache of
> >> resolved queries from the likes of "git log".
> >
> > If you do per-commit caching, and the user scrolls through enough
> > output (not hard to do, just searching the output for some string
> > often is enough), that "eventual" git-gc will be the very next git
> > operation.  If you cache the entire output, it'll be invalidated
> > pretty quickly.  So I don't see how this works.  Or am I
> > misunderstanding something you're suggesting here?
>
> With the caveat above that I think this is all a pretty stupid
> suggestion on my part:
>
> Just on the narrow aspect of how "git gc" behaves in that scenario: We'd
> keep those objects around for 2 weeks by default, or whatever you have
> gc.pruneExpire set to.
>
> So such a setup would certainly cause lots of pathological issues (see
> my [1] for one), but I don't think over-eager expiry of loose objects
> would be one.
>
> I'm not sure but are you referring to "invalidated pretty quickly"
> because it would go over the gc.auto limit? If so that's now how it
> works in this scenario, see [2]. I.e. it's moderated by the
> gc.pruneExpire setting.

By "invalidated pretty quickly" I meant that if the user ran "git log
--all" and you cached the entire output in a single cache file, then
as soon as any ref is updated, that cached output is invalid because
"git log --all" should now show different output than before.

It seems to be a case of pick your poison: either (a) cache output per
commit so that it can be re-used, and suffer from extreme numbers of
loose objects, OR (b) cache the entire output to avoid the large
number of loose objects, resulting in the cache only being useful if
the user passed the _exact_ same arguments and revisions to git-log --
and also having the cache become stale as soon as any involved ref is
updated.

>
> 1. https://lore.kernel.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/
> 2. https://lore.kernel.org/git/87inc89j38.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs
  2021-10-01  3:59             ` Elijah Newren
@ 2021-10-01 16:36               ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2021-10-01 16:36 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Jeff King, Elijah Newren via GitGitGadget, Git Mailing List,
	Jonathan Nieder, Sergey Organov

Elijah Newren <newren@gmail.com> writes:

> On Thu, Sep 30, 2021 at 1:06 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Jeff King <peff@peff.net> writes:
>>
>> >   - is it possible for the merge code to ever write an object? I kind of
>> >     wonder if we'd ever do any cache-able transformations as part of a
>> >     content-level merge. I don't think we do now, though.
>>
>> How is virtual merge base, result of an inner merge that recursively
>> merging two merge bases, fed to an outer merge as the ancestor?
>> Isn't it written as a tree object by the inner merge as a tree with
>> full of blob objects, so the outer merge does not have to treat a
>> merge base tree that is virtual any specially from a sole merge
>> base?
>
> Yes.  And that's not unique to the inner merges; it also writes new
> blob and tree objects for the outer merge, and then passes the
> resulting toplevel tree id out as the result.

I think Peff's question was "other than what is contained in the
merge result, are there objects we create during the merge process
that we may want to keep?"  And my inner merge example was objects
that do not appear in the merge result, but they are discardable,
so it wasn't a good example.

A tentative merge to recreate what the manual resolver would have
seen while showing remerge-diff needs to be recorded somewhere
before the diff gets shown, but after showing the diff, we can lose
them unless we need them for some other reason (e.g. they already
existed, another process happened to have created the same blob
while we are running remerge-diff, etc.).  So in that sense, the
objects that represent the result of the outermost merge is also
discardable.

The question is if there are other things similar to the textconv
cache, where we need to create during the operation and may keep the
result for later use.  The closest thing I can think of is the
rename cache we've talked about from time to time without actually
adding one ;-).

I guess using cached textconv result, without putting newly
discovered textconv result back to the cache, would be the best we
can do?

The blobs (hence their textconv results) involved in showing a merge
M with --remerge-diff are the ones in the trees of M^@ (all parents)
and $(git merge-base $(git rev-parse M^@)), and they may have been
created when running "git show" on or when letting "git log -p"
pass, these commits, so it is not like we are disabling the cache
entirely, and allowing read-only access may still have value.

Thanks.


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2021-10-01 16:36 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-31  2:26 [PATCH 0/7] Add a new --remerge-diff capability to show & log Elijah Newren via GitGitGadget
2021-08-31  2:26 ` [PATCH 1/7] merge-ort: mark a few more conflict messages as omittable Elijah Newren via GitGitGadget
2021-08-31 21:06   ` Junio C Hamano
2021-09-01  0:03     ` Elijah Newren
2021-09-01 17:19       ` Junio C Hamano
2021-08-31  2:26 ` [PATCH 2/7] merge-ort: add ability to record conflict messages in a file Elijah Newren via GitGitGadget
2021-09-28 22:29   ` Jeff King
2021-09-29  6:25     ` Elijah Newren
2021-09-29 16:14       ` Junio C Hamano
2021-09-29 16:31         ` Elijah Newren
2021-09-30  7:58       ` Jeff King
2021-09-30  8:09         ` Ævar Arnfjörð Bjarmason
2021-10-01  2:07         ` Elijah Newren
2021-10-01  5:28           ` Jeff King
2021-08-31  2:26 ` [PATCH 3/7] ll-merge: add API for capturing warnings in a strbuf instead of stderr Elijah Newren via GitGitGadget
2021-09-28 22:37   ` Jeff King
2021-09-28 23:49     ` Junio C Hamano
2021-09-29  4:03     ` Elijah Newren
2021-08-31  2:26 ` [PATCH 4/7] merge-ort: capture and print ll-merge warnings in our preferred fashion Elijah Newren via GitGitGadget
2021-09-28 22:39   ` Jeff King
2021-09-30 16:53   ` Ævar Arnfjörð Bjarmason
2021-10-01  1:54     ` Elijah Newren
2021-10-01  7:23       ` Ævar Arnfjörð Bjarmason
2021-08-31  2:26 ` [PATCH 5/7] tmp-objdir: new API for creating and removing primary object dirs Elijah Newren via GitGitGadget
2021-09-28  7:55   ` Ævar Arnfjörð Bjarmason
2021-09-29  4:22     ` Elijah Newren
2021-09-30  7:41       ` Jeff King
2021-09-30 14:17       ` Ævar Arnfjörð Bjarmason
2021-10-01  3:55         ` Elijah Newren
2021-09-28 23:17   ` Jeff King
2021-09-29  4:08     ` Junio C Hamano
2021-09-30  7:33       ` Jeff King
2021-09-30 13:16         ` Ævar Arnfjörð Bjarmason
2021-09-30 21:00           ` Jeff King
2021-10-01  3:11           ` Elijah Newren
2021-10-01  7:30             ` Ævar Arnfjörð Bjarmason
2021-10-01  8:03               ` Elijah Newren
2021-10-01  4:26         ` Elijah Newren
2021-10-01  5:27           ` Jeff King
2021-10-01  7:43             ` Ævar Arnfjörð Bjarmason
2021-09-29  5:05     ` Elijah Newren
2021-09-30  7:26       ` Jeff King
2021-09-30  7:46         ` Jeff King
2021-09-30 20:06           ` Junio C Hamano
2021-10-01  3:59             ` Elijah Newren
2021-10-01 16:36               ` Junio C Hamano
2021-10-01  2:31           ` Elijah Newren
2021-10-01  5:21             ` Jeff King
2021-09-30 19:25         ` Neeraj Singh
2021-09-30 20:19           ` Junio C Hamano
2021-09-30 20:00         ` Junio C Hamano
2021-10-01  2:23         ` Elijah Newren
2021-08-31  2:26 ` [PATCH 6/7] show, log: provide a --remerge-diff capability Elijah Newren via GitGitGadget
2021-08-31  9:19   ` Sergey Organov
2021-09-28  8:01   ` Ævar Arnfjörð Bjarmason
2021-09-29  4:00     ` Elijah Newren
2021-08-31  2:26 ` [PATCH 7/7] doc/diff-options: explain the new --remerge-diff option Elijah Newren via GitGitGadget
2021-08-31 11:05 ` [PATCH 0/7] Add a new --remerge-diff capability to show & log Bagas Sanjaya
2021-08-31 16:16   ` Elijah Newren
2021-08-31 20:03 ` Junio C Hamano
2021-08-31 20:23   ` Elijah Newren
2021-09-01 21:07 ` Junio C Hamano
2021-09-01 21:42   ` Elijah Newren
2021-09-01 21:55     ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.