git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/20] fundamentals of merge-ort implementation
@ 2020-11-02 20:43 Elijah Newren
  2020-11-02 20:43 ` [PATCH v2 01/20] merge-ort: setup basic internal data structures Elijah Newren
                   ` (21 more replies)
  0 siblings, 22 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

This series depends on a merge of en/strmap (after updating to v3) and
en/merge-ort-api-null-impl.

As promised, here's the update of the series due to the strmap
updates...and two other tiny updates.

Changes since v1:
  * updates needed based on changes to made in v3 of strmap series
  * fixed a typo in a comment
  * tiny tweak to move a strmap_put() into setup_paths()

Elijah Newren (20):
  merge-ort: setup basic internal data structures
  merge-ort: add some high-level algorithm structure
  merge-ort: port merge_start() from merge-recursive
  merge-ort: use histogram diff
  merge-ort: add an err() function similar to one from merge-recursive
  merge-ort: implement a very basic collect_merge_info()
  merge-ort: avoid repeating fill_tree_descriptor() on the same tree
  merge-ort: compute a few more useful fields for collect_merge_info
  merge-ort: record stage and auxiliary info for every path
  merge-ort: avoid recursing into identical trees
  merge-ort: add a preliminary simple process_entries() implementation
  merge-ort: have process_entries operate in a defined order
  merge-ort: step 1 of tree writing -- record basenames, modes, and oids
  merge-ort: step 2 of tree writing -- function to create tree object
  merge-ort: step 3 of tree writing -- handling subdirectories as we go
  merge-ort: basic outline for merge_switch_to_result()
  merge-ort: add implementation of checkout()
  tree: enable cmp_cache_name_compare() to be used elsewhere
  merge-ort: add implementation of record_unmerged_index_entries()
  merge-ort: free data structures in merge_finalize()

 merge-ort.c | 929 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 tree.c      |   2 +-
 tree.h      |   2 +
 3 files changed, 929 insertions(+), 4 deletions(-)

-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v2 01/20] merge-ort: setup basic internal data structures
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-06 22:05   ` Jonathan Tan
  2020-11-02 20:43 ` [PATCH v2 02/20] merge-ort: add some high-level algorithm structure Elijah Newren
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Set up some basic internal data structures.  The only carry-over from
merge-recursive.c is call_depth, though needed_rename_limit will be
added later.

The central piece of data will definitely be the strmap "paths", which
will map every relevant pathname under consideration to either a
merged_info or a conflict_info.  ("unmerged" is a strmap that is a
subset of "paths".)

merged_info contains all relevant information for a non-conflicted
entry.  conflict_info contains a merged_info, plus any additional
information about a conflict such as the higher orders stages involved
and the names of the paths those came from (handy once renames get
involved).  If an entry remains conflicted, the merged_info portion of a
conflict_info will later be filled with whatever version of the file
should be placed in the working directory (e.g. an as-merged-as-possible
variation that contains conflict markers).

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index b487901d3e..9d5ea0930d 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -17,6 +17,46 @@
 #include "cache.h"
 #include "merge-ort.h"
 
+#include "strmap.h"
+
+struct merge_options_internal {
+	struct strmap paths;    /* maps path -> (merged|conflict)_info */
+	struct strmap unmerged; /* maps path -> conflict_info */
+	const char *current_dir_name;
+	int call_depth;
+};
+
+struct version_info {
+	struct object_id oid;
+	unsigned short mode;
+};
+
+struct merged_info {
+	struct version_info result;
+	unsigned is_null:1;
+	unsigned clean:1;
+	size_t basename_offset;
+	 /*
+	  * Containing directory name.  Note that we assume directory_name is
+	  * constructed such that
+	  *    strcmp(dir1_name, dir2_name) == 0 iff dir1_name == dir2_name,
+	  * i.e. string equality is equivalent to pointer equality.  For this
+	  * to hold, we have to be careful setting directory_name.
+	  */
+	const char *directory_name;
+};
+
+struct conflict_info {
+	struct merged_info merged;
+	struct version_info stages[3];
+	const char *pathnames[3];
+	unsigned df_conflict:1;
+	unsigned path_conflict:1;
+	unsigned filemask:3;
+	unsigned dirmask:3;
+	unsigned match_mask:3;
+};
+
 void merge_switch_to_result(struct merge_options *opt,
 			    struct tree *head,
 			    struct merge_result *result,
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 02/20] merge-ort: add some high-level algorithm structure
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
  2020-11-02 20:43 ` [PATCH v2 01/20] merge-ort: setup basic internal data structures Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-02 20:43 ` [PATCH v2 03/20] merge-ort: port merge_start() from merge-recursive Elijah Newren
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

merge_ort_nonrecursive_internal() will be used by both
merge_inmemory_nonrecursive() and merge_inmemory_recursive(); let's
focus on it for now.  It involves some setup -- merge_start() --
followed by the following chain of functions:

  collect_merge_info()
    This function will populate merge_options_internal's paths field,
    via a call to traverse_trees() and a new callback that will be added
    later.

  detect_and_process_renames()
    This function will detect renames, and then adjust entries in paths
    to move conflict stages from old pathnames into those for new
    pathnames, so that the next step doesn't have to think about renames
    and just can do three-way content merging and such.

  process_entries()
    This function determines how to take the various stages (versions of
    a file from the three different sides) and merge them, and whether
    to mark the result as conflicted or cleanly merged.  It also writes
    out these merged file versions as it goes to create a tree.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 66 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 9d5ea0930d..b53cd80104 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -18,6 +18,7 @@
 #include "merge-ort.h"
 
 #include "strmap.h"
+#include "tree.h"
 
 struct merge_options_internal {
 	struct strmap paths;    /* maps path -> (merged|conflict)_info */
@@ -57,6 +58,37 @@ struct conflict_info {
 	unsigned match_mask:3;
 };
 
+static int collect_merge_info(struct merge_options *opt,
+			      struct tree *merge_base,
+			      struct tree *side1,
+			      struct tree *side2)
+{
+	die("Not yet implemented.");
+}
+
+static int detect_and_process_renames(struct merge_options *opt,
+				      struct tree *merge_base,
+				      struct tree *side1,
+				      struct tree *side2)
+{
+	int clean = 1;
+
+	/*
+	 * Rename detection works by detecting file similarity.  Here we use
+	 * a really easy-to-implement scheme: files are similar IFF they have
+	 * the same filename.  Therefore, by this scheme, there are no renames.
+	 *
+	 * TODO: Actually implement a real rename detection scheme.
+	 */
+	return clean;
+}
+
+static void process_entries(struct merge_options *opt,
+			    struct object_id *result_oid)
+{
+	die("Not yet implemented.");
+}
+
 void merge_switch_to_result(struct merge_options *opt,
 			    struct tree *head,
 			    struct merge_result *result,
@@ -73,13 +105,46 @@ void merge_finalize(struct merge_options *opt,
 	die("Not yet implemented");
 }
 
+static void merge_start(struct merge_options *opt, struct merge_result *result)
+{
+	die("Not yet implemented.");
+}
+
+/*
+ * Originally from merge_trees_internal(); heavily adapted, though.
+ */
+static void merge_ort_nonrecursive_internal(struct merge_options *opt,
+					    struct tree *merge_base,
+					    struct tree *side1,
+					    struct tree *side2,
+					    struct merge_result *result)
+{
+	struct object_id working_tree_oid;
+
+	collect_merge_info(opt, merge_base, side1, side2);
+	result->clean = detect_and_process_renames(opt, merge_base,
+						   side1, side2);
+	process_entries(opt, &working_tree_oid);
+
+	/* Set return values */
+	result->tree = parse_tree_indirect(&working_tree_oid);
+	/* existence of unmerged entries implies unclean */
+	result->clean &= strmap_empty(&opt->priv->unmerged);
+	if (!opt->priv->call_depth) {
+		result->priv = opt->priv;
+		opt->priv = NULL;
+	}
+}
+
 void merge_incore_nonrecursive(struct merge_options *opt,
 			       struct tree *merge_base,
 			       struct tree *side1,
 			       struct tree *side2,
 			       struct merge_result *result)
 {
-	die("Not yet implemented");
+	assert(opt->ancestor != NULL);
+	merge_start(opt, result);
+	merge_ort_nonrecursive_internal(opt, merge_base, side1, side2, result);
 }
 
 void merge_incore_recursive(struct merge_options *opt,
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 03/20] merge-ort: port merge_start() from merge-recursive
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
  2020-11-02 20:43 ` [PATCH v2 01/20] merge-ort: setup basic internal data structures Elijah Newren
  2020-11-02 20:43 ` [PATCH v2 02/20] merge-ort: add some high-level algorithm structure Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-11 13:52   ` Derrick Stolee
  2020-11-02 20:43 ` [PATCH v2 04/20] merge-ort: use histogram diff Elijah Newren
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

merge_start() basically does a bunch of sanity checks, then allocates
and initializes opt->priv -- a struct merge_options_internal.

Most the sanity checks are usable as-is.  The allocation/intialization
is a bit different since merge-ort has a very different
merge_options_internal than merge-recursive, but the idea is the same.

The weirdest part here is that merge-ort and merge-recursive use the
same struct merge_options, even though merge_options has a number of
fields that are oddly specific to merge-recursive's internal
implementation and don't even make sense with merge-ort's high-level
design (e.g. buffer_output, which merge-ort has to always do).  I reused
the same data structure because:
  * most the fields made sense to both merge algorithms
  * making a new struct would have required making new enums or somehow
    externalizing them, and that was getting messy.
  * it simplifies converting the existing callers by not having to
    have different code paths for merge_options setup.

I also marked detect_renames as ignored.  We can revisit that later, but
in short: merge-recursive allowed turning off rename detection because
it was sometimes glacially slow.  When you speed something up by a few
orders of magnitude, it's worth revisiting whether that justification is
still relevant.  Besides, if folks find it's still too slow, perhaps
they have a better scaling case than I could find and maybe it turns up
some more optimizations we can add.  If it still is needed as an option,
it is easy to add later.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index b53cd80104..f5460a8a52 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -17,6 +17,8 @@
 #include "cache.h"
 #include "merge-ort.h"
 
+#include "diff.h"
+#include "diffcore.h"
 #include "strmap.h"
 #include "tree.h"
 
@@ -107,7 +109,47 @@ void merge_finalize(struct merge_options *opt,
 
 static void merge_start(struct merge_options *opt, struct merge_result *result)
 {
-	die("Not yet implemented.");
+	/* Sanity checks on opt */
+	assert(opt->repo);
+
+	assert(opt->branch1 && opt->branch2);
+
+	assert(opt->detect_directory_renames >= MERGE_DIRECTORY_RENAMES_NONE &&
+	       opt->detect_directory_renames <= MERGE_DIRECTORY_RENAMES_TRUE);
+	assert(opt->rename_limit >= -1);
+	assert(opt->rename_score >= 0 && opt->rename_score <= MAX_SCORE);
+	assert(opt->show_rename_progress >= 0 && opt->show_rename_progress <= 1);
+
+	assert(opt->xdl_opts >= 0);
+	assert(opt->recursive_variant >= MERGE_VARIANT_NORMAL &&
+	       opt->recursive_variant <= MERGE_VARIANT_THEIRS);
+
+	/*
+	 * detect_renames, verbosity, buffer_output, and obuf are ignored
+	 * fields that were used by "recursive" rather than "ort" -- but
+	 * sanity check them anyway.
+	 */
+	assert(opt->detect_renames >= -1 &&
+	       opt->detect_renames <= DIFF_DETECT_COPY);
+	assert(opt->verbosity >= 0 && opt->verbosity <= 5);
+	assert(opt->buffer_output <= 2);
+	assert(opt->obuf.len == 0);
+
+	assert(opt->priv == NULL);
+
+	/* Initialization of opt->priv, our internal merge data */
+	opt->priv = xcalloc(1, sizeof(*opt->priv));
+	/*
+	 * Although we initialize opt->priv->paths with strdup_strings=0,
+	 * that's just to avoid making yet another copy of an allocated
+	 * string.  Putting the entry into paths means we are taking
+	 * ownership, so we will later free it.
+	 *
+	 * In contrast, unmerged just has a subset of keys from paths, so
+	 * we don't want to free those (it'd be a duplicate free).
+	 */
+	strmap_init_with_options(&opt->priv->paths, NULL, 0);
+	strmap_init_with_options(&opt->priv->unmerged, NULL, 0);
 }
 
 /*
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 04/20] merge-ort: use histogram diff
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (2 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 03/20] merge-ort: port merge_start() from merge-recursive Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-11 13:54   ` Derrick Stolee
  2020-11-02 20:43 ` [PATCH v2 05/20] merge-ort: add an err() function similar to one from merge-recursive Elijah Newren
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

I have some ideas for using a histogram diff to improve content merges,
which fundamentally relies on the idea of a histogram.  Since the diffs
are never displayed to the user but just used internally for merging,
the typical user preference shouldn't matter anyway, and I want to make
sure that all my testing works with this algorithm.

Granted, I don't yet know if those ideas will pan out and I haven't even
tried any of them out yet, but it's easy to change the diff algorithm in
the future if needed or wanted.  For now, just set it to histogram.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index f5460a8a52..df97a54773 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -21,6 +21,7 @@
 #include "diffcore.h"
 #include "strmap.h"
 #include "tree.h"
+#include "xdiff-interface.h"
 
 struct merge_options_internal {
 	struct strmap paths;    /* maps path -> (merged|conflict)_info */
@@ -137,6 +138,9 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 
 	assert(opt->priv == NULL);
 
+	/* Default to histogram diff.  Actually, just hardcode it...for now. */
+	opt->xdl_opts = DIFF_WITH_ALG(opt, HISTOGRAM_DIFF);
+
 	/* Initialization of opt->priv, our internal merge data */
 	opt->priv = xcalloc(1, sizeof(*opt->priv));
 	/*
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 05/20] merge-ort: add an err() function similar to one from merge-recursive
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (3 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 04/20] merge-ort: use histogram diff Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-11 13:58   ` Derrick Stolee
  2020-11-02 20:43 ` [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info() Elijah Newren
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Various places in merge-recursive used an err() function when it hit
some kind of unrecoverable error.  That code was from the reusable bits
of merge-recursive.c that we liked, such as merge_3way, writing object
files to the object store, reading blobs from the object store, etc.  So
create a similar function to allow us to port that code over, and use it
for when we detect problems returned from collect_merge_info()'s
traverse_trees() call, which we will be adding next.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index df97a54773..537da9f6df 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -61,11 +61,28 @@ struct conflict_info {
 	unsigned match_mask:3;
 };
 
+static int err(struct merge_options *opt, const char *err, ...)
+{
+	va_list params;
+	struct strbuf sb = STRBUF_INIT;
+
+	strbuf_addstr(&sb, "error: ");
+	va_start(params, err);
+	strbuf_vaddf(&sb, err, params);
+	va_end(params);
+
+	error("%s", sb.buf);
+	strbuf_release(&sb);
+
+	return -1;
+}
+
 static int collect_merge_info(struct merge_options *opt,
 			      struct tree *merge_base,
 			      struct tree *side1,
 			      struct tree *side2)
 {
+	/* TODO: Implement this using traverse_trees() */
 	die("Not yet implemented.");
 }
 
@@ -167,7 +184,15 @@ static void merge_ort_nonrecursive_internal(struct merge_options *opt,
 {
 	struct object_id working_tree_oid;
 
-	collect_merge_info(opt, merge_base, side1, side2);
+	if (collect_merge_info(opt, merge_base, side1, side2) != 0) {
+		err(opt, _("collecting merge info failed for trees %s, %s, %s"),
+		    oid_to_hex(&merge_base->object.oid),
+		    oid_to_hex(&side1->object.oid),
+		    oid_to_hex(&side2->object.oid));
+		result->clean = -1;
+		return;
+	}
+
 	result->clean = detect_and_process_renames(opt, merge_base,
 						   side1, side2);
 	process_entries(opt, &working_tree_oid);
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info()
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (4 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 05/20] merge-ort: add an err() function similar to one from merge-recursive Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-06 22:19   ` Jonathan Tan
  2020-11-11 14:38   ` Derrick Stolee
  2020-11-02 20:43 ` [PATCH v2 07/20] merge-ort: avoid repeating fill_tree_descriptor() on the same tree Elijah Newren
                   ` (15 subsequent siblings)
  21 siblings, 2 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

This does not actually collect any necessary info other than the
pathnames involved, since it just allocates an all-zero conflict_info
and stuffs that into paths.  However, it invokes the traverse_trees()
machinery to walk over all the paths and sets up the basic
infrastructure we need.

I have left out a few obvious optimizations to try to make this patch as
short and obvious as possible.  A subsequent patch will add some of
those back in with some more useful data fields before we introduce a
patch that actually sets up the conflict_info fields.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 119 insertions(+), 2 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 537da9f6df..626eb9713e 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -77,13 +77,130 @@ static int err(struct merge_options *opt, const char *err, ...)
 	return -1;
 }
 
+static int collect_merge_info_callback(int n,
+				       unsigned long mask,
+				       unsigned long dirmask,
+				       struct name_entry *names,
+				       struct traverse_info *info)
+{
+	/*
+	 * n is 3.  Always.
+	 * common ancestor (mbase) has mask 1, and stored in index 0 of names
+	 * head of side 1  (side1) has mask 2, and stored in index 1 of names
+	 * head of side 2  (side2) has mask 4, and stored in index 2 of names
+	 */
+	struct merge_options *opt = info->data;
+	struct merge_options_internal *opti = opt->priv;
+	struct conflict_info *ci;
+	struct name_entry *p;
+	size_t len;
+	char *fullpath;
+	unsigned filemask = mask & ~dirmask;
+	unsigned mbase_null = !(mask & 1);
+	unsigned side1_null = !(mask & 2);
+	unsigned side2_null = !(mask & 4);
+
+	/* n = 3 is a fundamental assumption. */
+	if (n != 3)
+		BUG("Called collect_merge_info_callback wrong");
+
+	/*
+	 * A bunch of sanity checks verifying that traverse_trees() calls
+	 * us the way I expect.  Could just remove these at some point,
+	 * though maybe they are helpful to future code readers.
+	 */
+	assert(mbase_null == is_null_oid(&names[0].oid));
+	assert(side1_null == is_null_oid(&names[1].oid));
+	assert(side2_null == is_null_oid(&names[2].oid));
+	assert(!mbase_null || !side1_null || !side2_null);
+	assert(mask > 0 && mask < 8);
+
+	/* Other invariant checks, mostly for documentation purposes. */
+	assert(mask == (dirmask | filemask));
+
+	/*
+	 * Get the name of the relevant filepath, which we'll pass to
+	 * setup_path_info() for tracking.
+	 */
+	p = names;
+	while (!p->mode)
+		p++;
+	len = traverse_path_len(info, p->pathlen);
+
+	/* +1 in both of the following lines to include the NUL byte */
+	fullpath = xmalloc(len+1);
+	make_traverse_path(fullpath, len+1, info, p->path, p->pathlen);
+
+	/*
+	 * TODO: record information about the path other than all zeros,
+	 * so we can resolve later in process_entries.
+	 */
+	ci = xcalloc(1, sizeof(struct conflict_info));
+	strmap_put(&opti->paths, fullpath, ci);
+
+	/* If dirmask, recurse into subdirectories */
+	if (dirmask) {
+		struct traverse_info newinfo;
+		struct tree_desc t[3];
+		void *buf[3] = {NULL,};
+		const char *original_dir_name;
+		int i, ret;
+
+		ci->match_mask &= filemask;
+		newinfo = *info;
+		newinfo.prev = info;
+		newinfo.name = p->path;
+		newinfo.namelen = p->pathlen;
+		newinfo.pathlen = st_add3(newinfo.pathlen, p->pathlen, 1);
+
+		for (i = 0; i < 3; i++, dirmask >>= 1) {
+			const struct object_id *oid = NULL;
+			if (dirmask & 1)
+				oid = &names[i].oid;
+			buf[i] = fill_tree_descriptor(opt->repo, t + i, oid);
+		}
+
+		original_dir_name = opti->current_dir_name;
+		opti->current_dir_name = fullpath;
+		ret = traverse_trees(NULL, 3, t, &newinfo);
+		opti->current_dir_name = original_dir_name;
+
+		for (i = 0; i < 3; i++)
+			free(buf[i]);
+
+		if (ret < 0)
+			return -1;
+	}
+
+	return mask;
+}
+
 static int collect_merge_info(struct merge_options *opt,
 			      struct tree *merge_base,
 			      struct tree *side1,
 			      struct tree *side2)
 {
-	/* TODO: Implement this using traverse_trees() */
-	die("Not yet implemented.");
+	int ret;
+	struct tree_desc t[3];
+	struct traverse_info info;
+	char *toplevel_dir_placeholder = "";
+
+	opt->priv->current_dir_name = toplevel_dir_placeholder;
+	setup_traverse_info(&info, toplevel_dir_placeholder);
+	info.fn = collect_merge_info_callback;
+	info.data = opt;
+	info.show_all_errors = 1;
+
+	parse_tree(merge_base);
+	parse_tree(side1);
+	parse_tree(side2);
+	init_tree_desc(t+0, merge_base->buffer, merge_base->size);
+	init_tree_desc(t+1, side1->buffer, side1->size);
+	init_tree_desc(t+2, side2->buffer, side2->size);
+
+	ret = traverse_trees(NULL, 3, t, &info);
+
+	return ret;
 }
 
 static int detect_and_process_renames(struct merge_options *opt,
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 07/20] merge-ort: avoid repeating fill_tree_descriptor() on the same tree
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (5 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info() Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-11 14:51   ` Derrick Stolee
  2020-11-02 20:43 ` [PATCH v2 08/20] merge-ort: compute a few more useful fields for collect_merge_info Elijah Newren
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Three-way merges, by their nature, are going to often have two or more
trees match at a given subdirectory.  We can avoid calling
fill_tree_descriptor() on the same tree by checking when these trees
match.  Noting when various oids match will also be useful in other
calculations and optimizations as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 626eb9713e..d3c1d00fc6 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -99,6 +99,15 @@ static int collect_merge_info_callback(int n,
 	unsigned mbase_null = !(mask & 1);
 	unsigned side1_null = !(mask & 2);
 	unsigned side2_null = !(mask & 4);
+	unsigned side1_matches_mbase = (!side1_null && !mbase_null &&
+					names[0].mode == names[1].mode &&
+					oideq(&names[0].oid, &names[1].oid));
+	unsigned side2_matches_mbase = (!side2_null && !mbase_null &&
+					names[0].mode == names[2].mode &&
+					oideq(&names[0].oid, &names[2].oid));
+	unsigned sides_match = (!side1_null && !side2_null &&
+				names[1].mode == names[2].mode &&
+				oideq(&names[1].oid, &names[2].oid));
 
 	/* n = 3 is a fundamental assumption. */
 	if (n != 3)
@@ -154,10 +163,19 @@ static int collect_merge_info_callback(int n,
 		newinfo.pathlen = st_add3(newinfo.pathlen, p->pathlen, 1);
 
 		for (i = 0; i < 3; i++, dirmask >>= 1) {
-			const struct object_id *oid = NULL;
-			if (dirmask & 1)
-				oid = &names[i].oid;
-			buf[i] = fill_tree_descriptor(opt->repo, t + i, oid);
+			if (i == 1 && side1_matches_mbase)
+				t[1] = t[0];
+			else if (i == 2 && side2_matches_mbase)
+				t[2] = t[0];
+			else if (i == 2 && sides_match)
+				t[2] = t[1];
+			else {
+				const struct object_id *oid = NULL;
+				if (dirmask & 1)
+					oid = &names[i].oid;
+				buf[i] = fill_tree_descriptor(opt->repo,
+							      t + i, oid);
+			}
 		}
 
 		original_dir_name = opti->current_dir_name;
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 08/20] merge-ort: compute a few more useful fields for collect_merge_info
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (6 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 07/20] merge-ort: avoid repeating fill_tree_descriptor() on the same tree Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-06 22:52   ` Jonathan Tan
  2020-11-02 20:43 ` [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path Elijah Newren
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index d3c1d00fc6..0ff90981cf 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -96,6 +96,7 @@ static int collect_merge_info_callback(int n,
 	size_t len;
 	char *fullpath;
 	unsigned filemask = mask & ~dirmask;
+	unsigned match_mask = 0; /* will be updated below */
 	unsigned mbase_null = !(mask & 1);
 	unsigned side1_null = !(mask & 2);
 	unsigned side2_null = !(mask & 4);
@@ -108,6 +109,13 @@ static int collect_merge_info_callback(int n,
 	unsigned sides_match = (!side1_null && !side2_null &&
 				names[1].mode == names[2].mode &&
 				oideq(&names[1].oid, &names[2].oid));
+	/*
+	 * Note: We only label files with df_conflict, not directories.
+	 * Since directories stay where they are, and files move out of the
+	 * way to make room for a directory, we don't care if there was a
+	 * directory/file conflict for a parent directory of the current path.
+	 */
+	unsigned df_conflict = (filemask != 0) && (dirmask != 0);
 
 	/* n = 3 is a fundamental assumption. */
 	if (n != 3)
@@ -127,6 +135,14 @@ static int collect_merge_info_callback(int n,
 	/* Other invariant checks, mostly for documentation purposes. */
 	assert(mask == (dirmask | filemask));
 
+	/* Determine match_mask */
+	if (side1_matches_mbase)
+		match_mask = (side2_matches_mbase ? 7 : 3);
+	else if (side2_matches_mbase)
+		match_mask = 5;
+	else if (sides_match)
+		match_mask = 6;
+
 	/*
 	 * Get the name of the relevant filepath, which we'll pass to
 	 * setup_path_info() for tracking.
@@ -145,6 +161,8 @@ static int collect_merge_info_callback(int n,
 	 * so we can resolve later in process_entries.
 	 */
 	ci = xcalloc(1, sizeof(struct conflict_info));
+	ci->df_conflict = df_conflict;
+	ci->match_mask = match_mask;
 	strmap_put(&opti->paths, fullpath, ci);
 
 	/* If dirmask, recurse into subdirectories */
@@ -161,6 +179,13 @@ static int collect_merge_info_callback(int n,
 		newinfo.name = p->path;
 		newinfo.namelen = p->pathlen;
 		newinfo.pathlen = st_add3(newinfo.pathlen, p->pathlen, 1);
+		/*
+		 * If we did care about parent directories having a D/F
+		 * conflict, then we'd include
+		 *    newinfo.df_conflicts |= (mask & ~dirmask);
+		 * here.  But we don't.  (See comment near setting of local
+		 * df_conflict variable near the beginning of this function).
+		 */
 
 		for (i = 0; i < 3; i++, dirmask >>= 1) {
 			if (i == 1 && side1_matches_mbase)
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (7 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 08/20] merge-ort: compute a few more useful fields for collect_merge_info Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-06 22:58   ` Jonathan Tan
  2020-11-11 15:26   ` Derrick Stolee
  2020-11-02 20:43 ` [PATCH v2 10/20] merge-ort: avoid recursing into identical trees Elijah Newren
                   ` (12 subsequent siblings)
  21 siblings, 2 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Create a helper function, setup_path_info(), which can be used to record
all the information we want in a merged_info or conflict_info.  While
there is currently only one caller of this new function, and some of its
particular parameters are fixed, future callers of this function will be
added later.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 54 insertions(+), 7 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 0ff90981cf..bef3c648a0 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -77,6 +77,51 @@ static int err(struct merge_options *opt, const char *err, ...)
 	return -1;
 }
 
+static void setup_path_info(struct merge_options *opt,
+			    struct string_list_item *result,
+			    const char *current_dir_name,
+			    int current_dir_name_len,
+			    char *fullpath, /* we'll take over ownership */
+			    struct name_entry *names,
+			    struct name_entry *merged_version,
+			    unsigned is_null,     /* boolean */
+			    unsigned df_conflict, /* boolean */
+			    unsigned filemask,
+			    unsigned dirmask,
+			    int resolved          /* boolean */)
+{
+	struct conflict_info *path_info;
+
+	assert(!is_null || resolved);
+	assert(!df_conflict || !resolved); /* df_conflict implies !resolved */
+	assert(resolved == (merged_version != NULL));
+
+	path_info = xcalloc(1, resolved ? sizeof(struct merged_info) :
+					  sizeof(struct conflict_info));
+	path_info->merged.directory_name = current_dir_name;
+	path_info->merged.basename_offset = current_dir_name_len;
+	path_info->merged.clean = !!resolved;
+	if (resolved) {
+		path_info->merged.result.mode = merged_version->mode;
+		oidcpy(&path_info->merged.result.oid, &merged_version->oid);
+		path_info->merged.is_null = !!is_null;
+	} else {
+		int i;
+
+		for (i = 0; i < 3; i++) {
+			path_info->pathnames[i] = fullpath;
+			path_info->stages[i].mode = names[i].mode;
+			oidcpy(&path_info->stages[i].oid, &names[i].oid);
+		}
+		path_info->filemask = filemask;
+		path_info->dirmask = dirmask;
+		path_info->df_conflict = !!df_conflict;
+	}
+	strmap_put(&opt->priv->paths, fullpath, path_info);
+	result->string = fullpath;
+	result->util = path_info;
+}
+
 static int collect_merge_info_callback(int n,
 				       unsigned long mask,
 				       unsigned long dirmask,
@@ -91,10 +136,12 @@ static int collect_merge_info_callback(int n,
 	 */
 	struct merge_options *opt = info->data;
 	struct merge_options_internal *opti = opt->priv;
-	struct conflict_info *ci;
+	struct string_list_item pi;  /* Path Info */
+	struct conflict_info *ci; /* pi.util when there's a conflict */
 	struct name_entry *p;
 	size_t len;
 	char *fullpath;
+	const char *dirname = opti->current_dir_name;
 	unsigned filemask = mask & ~dirmask;
 	unsigned match_mask = 0; /* will be updated below */
 	unsigned mbase_null = !(mask & 1);
@@ -157,13 +204,13 @@ static int collect_merge_info_callback(int n,
 	make_traverse_path(fullpath, len+1, info, p->path, p->pathlen);
 
 	/*
-	 * TODO: record information about the path other than all zeros,
-	 * so we can resolve later in process_entries.
+	 * Record information about the path so we can resolve later in
+	 * process_entries.
 	 */
-	ci = xcalloc(1, sizeof(struct conflict_info));
-	ci->df_conflict = df_conflict;
+	setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+			names, NULL, 0, df_conflict, filemask, dirmask, 0);
+	ci = pi.util;
 	ci->match_mask = match_mask;
-	strmap_put(&opti->paths, fullpath, ci);
 
 	/* If dirmask, recurse into subdirectories */
 	if (dirmask) {
@@ -204,7 +251,7 @@ static int collect_merge_info_callback(int n,
 		}
 
 		original_dir_name = opti->current_dir_name;
-		opti->current_dir_name = fullpath;
+		opti->current_dir_name = pi.string;
 		ret = traverse_trees(NULL, 3, t, &newinfo);
 		opti->current_dir_name = original_dir_name;
 
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 10/20] merge-ort: avoid recursing into identical trees
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (8 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-11 15:31   ` Derrick Stolee
  2020-11-02 20:43 ` [PATCH v2 11/20] merge-ort: add a preliminary simple process_entries() implementation Elijah Newren
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

When all three trees have the same oid, there is no need to recurse into
these trees to find that all files within them happen to match.  We can
just record any one of the trees as the resolution of merging that
particular path.

Immediately resolving trees for other types of trivial tree merges (such
as one side matches the merge base, or the two sides match each other)
would prevent us from detecting renames for some paths, and thus prevent
us from doing three-way content merges for those paths whose renames we
did not detect.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index bef3c648a0..9900fa1bf8 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -203,6 +203,19 @@ static int collect_merge_info_callback(int n,
 	fullpath = xmalloc(len+1);
 	make_traverse_path(fullpath, len+1, info, p->path, p->pathlen);
 
+	/*
+	 * If mbase, side1, and side2 all match, we can resolve early.  Even
+	 * if these are trees, there will be no renames or anything
+	 * underneath.
+	 */
+	if (side1_matches_mbase && side2_matches_mbase) {
+		/* mbase, side1, & side2 all match; use mbase as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+0, mbase_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
 	/*
 	 * Record information about the path so we can resolve later in
 	 * process_entries.
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 11/20] merge-ort: add a preliminary simple process_entries() implementation
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (9 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 10/20] merge-ort: avoid recursing into identical trees Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-11 19:51   ` Jonathan Tan
  2020-11-02 20:43 ` [PATCH v2 12/20] merge-ort: have process_entries operate in a defined order Elijah Newren
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Add a process_entries() implementation that just loops over the paths
and processes each one individually with an auxiliary process_entry()
call.  Add a basic process_entry() as well, which handles several cases
but leaves a few of the more involved ones with die-not-implemented
messages.  Also, although process_entries() is supposed to create a
tree, it does not yet have code to do so -- except in the special case
of merging completely empty trees.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 106 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 105 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 9900fa1bf8..92bbdc7255 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -323,10 +323,114 @@ static int detect_and_process_renames(struct merge_options *opt,
 	return clean;
 }
 
+/* Per entry merge function */
+static void process_entry(struct merge_options *opt,
+			  const char *path,
+			  struct conflict_info *ci)
+{
+	assert(!ci->merged.clean);
+	assert(ci->filemask >= 0 && ci->filemask <= 7);
+
+	if (ci->filemask == 0) {
+		/*
+		 * This is a placeholder for directories that were recursed
+		 * into; nothing to do in this case.
+		 */
+		return;
+	}
+
+	if (ci->df_conflict) {
+		die("Not yet implemented.");
+	}
+
+	/*
+	 * NOTE: Below there is a long switch-like if-elseif-elseif... block
+	 *       which the code goes through even for the df_conflict cases
+	 *       above.  Well, it will once we don't die-not-implemented above.
+	 */
+	if (ci->match_mask) {
+		ci->merged.clean = 1;
+		if (ci->match_mask == 6) {
+			/* stages[1] == stages[2] */
+			ci->merged.result.mode = ci->stages[1].mode;
+			oidcpy(&ci->merged.result.oid, &ci->stages[1].oid);
+		} else {
+			/* determine the mask of the side that didn't match */
+			unsigned int othermask = 7 & ~ci->match_mask;
+			int side = (othermask == 4) ? 2 : 1;
+
+			ci->merged.is_null = (ci->filemask == ci->match_mask);
+			ci->merged.result.mode = ci->stages[side].mode;
+			oidcpy(&ci->merged.result.oid, &ci->stages[side].oid);
+
+			assert(othermask == 2 || othermask == 4);
+			assert(ci->merged.is_null == !ci->merged.result.mode);
+		}
+	} else if (ci->filemask >= 6 &&
+		   (S_IFMT & ci->stages[1].mode) !=
+		   (S_IFMT & ci->stages[2].mode)) {
+		/*
+		 * Two different items from (file/submodule/symlink)
+		 */
+		die("Not yet implemented.");
+	} else if (ci->filemask >= 6) {
+		/*
+		 * TODO: Needs a two-way or three-way content merge, but we're
+		 * just being lazy and copying the version from HEAD and
+		 * leaving it as conflicted.
+		 */
+		ci->merged.clean = 0;
+		ci->merged.result.mode = ci->stages[1].mode;
+		oidcpy(&ci->merged.result.oid, &ci->stages[1].oid);
+	} else if (ci->filemask == 3 || ci->filemask == 5) {
+		/* Modify/delete */
+		die("Not yet implemented.");
+	} else if (ci->filemask == 2 || ci->filemask == 4) {
+		/* Added on one side */
+		int side = (ci->filemask == 4) ? 2 : 1;
+		ci->merged.result.mode = ci->stages[side].mode;
+		oidcpy(&ci->merged.result.oid, &ci->stages[side].oid);
+		ci->merged.clean = !ci->df_conflict && !ci->path_conflict;
+	} else if (ci->filemask == 1) {
+		/* Deleted on both sides */
+		ci->merged.is_null = 1;
+		ci->merged.result.mode = 0;
+		oidcpy(&ci->merged.result.oid, &null_oid);
+		ci->merged.clean = !ci->path_conflict;
+	}
+
+	/*
+	 * If still unmerged, record it separately.  This allows us to later
+	 * iterate over just unmerged entries when updating the index instead
+	 * of iterating over all entries.
+	 */
+	if (!ci->merged.clean)
+		strmap_put(&opt->priv->unmerged, path, ci);
+}
+
 static void process_entries(struct merge_options *opt,
 			    struct object_id *result_oid)
 {
-	die("Not yet implemented.");
+	struct hashmap_iter iter;
+	struct strmap_entry *e;
+
+	if (strmap_empty(&opt->priv->paths)) {
+		oidcpy(result_oid, opt->repo->hash_algo->empty_tree);
+		return;
+	}
+
+	strmap_for_each_entry(&opt->priv->paths, &iter, e) {
+		/*
+		 * WARNING: If ci->merged.clean is true, then ci does not
+		 * actually point to a conflict_info but a struct merge_info.
+		 */
+		struct conflict_info *ci = e->value;
+
+		if (!ci->merged.clean)
+			process_entry(opt, e->key, e->value);
+	}
+
+	die("Tree creation not yet implemented");
 }
 
 void merge_switch_to_result(struct merge_options *opt,
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 12/20] merge-ort: have process_entries operate in a defined order
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (10 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 11/20] merge-ort: add a preliminary simple process_entries() implementation Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-11 16:09   ` Derrick Stolee
  2020-11-02 20:43 ` [PATCH v2 13/20] merge-ort: step 1 of tree writing -- record basenames, modes, and oids Elijah Newren
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

We want to handle paths below a directory before needing to handle the
directory itself.  Also, we want to handle the directory immediately
after the paths below it, so we can't use simple lexicographic ordering
from strcmp (which would insert foo.txt between foo and foo/file.c).
Copy string_list_df_name_compare() from merge-recursive.c, and set up a
string list of paths sorted by that function so that we can iterate in
the desired order.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 47 insertions(+), 2 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 92bbdc7255..3d46d62ed3 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -323,6 +323,33 @@ static int detect_and_process_renames(struct merge_options *opt,
 	return clean;
 }
 
+static int string_list_df_name_compare(const char *one, const char *two)
+{
+	int onelen = strlen(one);
+	int twolen = strlen(two);
+	/*
+	 * Here we only care that entries for D/F conflicts are
+	 * adjacent, in particular with the file of the D/F conflict
+	 * appearing before files below the corresponding directory.
+	 * The order of the rest of the list is irrelevant for us.
+	 *
+	 * To achieve this, we sort with df_name_compare and provide
+	 * the mode S_IFDIR so that D/F conflicts will sort correctly.
+	 * We use the mode S_IFDIR for everything else for simplicity,
+	 * since in other cases any changes in their order due to
+	 * sorting cause no problems for us.
+	 */
+	int cmp = df_name_compare(one, onelen, S_IFDIR,
+				  two, twolen, S_IFDIR);
+	/*
+	 * Now that 'foo' and 'foo/bar' compare equal, we have to make sure
+	 * that 'foo' comes before 'foo/bar'.
+	 */
+	if (cmp)
+		return cmp;
+	return onelen - twolen;
+}
+
 /* Per entry merge function */
 static void process_entry(struct merge_options *opt,
 			  const char *path,
@@ -413,23 +440,41 @@ static void process_entries(struct merge_options *opt,
 {
 	struct hashmap_iter iter;
 	struct strmap_entry *e;
+	struct string_list plist = STRING_LIST_INIT_NODUP;
+	struct string_list_item *entry;
 
 	if (strmap_empty(&opt->priv->paths)) {
 		oidcpy(result_oid, opt->repo->hash_algo->empty_tree);
 		return;
 	}
 
+	/* Hack to pre-allocate plist to the desired size */
+	ALLOC_GROW(plist.items, strmap_get_size(&opt->priv->paths), plist.alloc);
+
+	/* Put every entry from paths into plist, then sort */
 	strmap_for_each_entry(&opt->priv->paths, &iter, e) {
+		string_list_append(&plist, e->key)->util = e->value;
+	}
+	plist.cmp = string_list_df_name_compare;
+	string_list_sort(&plist);
+
+	/*
+	 * Iterate over the items in reverse order, so we can handle paths
+	 * below a directory before needing to handle the directory itself.
+	 */
+	for (entry = &plist.items[plist.nr-1]; entry >= plist.items; --entry) {
+		char *path = entry->string;
 		/*
 		 * WARNING: If ci->merged.clean is true, then ci does not
 		 * actually point to a conflict_info but a struct merge_info.
 		 */
-		struct conflict_info *ci = e->value;
+		struct conflict_info *ci = entry->util;
 
 		if (!ci->merged.clean)
-			process_entry(opt, e->key, e->value);
+			process_entry(opt, path, ci);
 	}
 
+	string_list_clear(&plist, 0);
 	die("Tree creation not yet implemented");
 }
 
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 13/20] merge-ort: step 1 of tree writing -- record basenames, modes, and oids
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (11 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 12/20] merge-ort: have process_entries operate in a defined order Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-11 20:01   ` Jonathan Tan
  2020-11-02 20:43 ` [PATCH v2 14/20] merge-ort: step 2 of tree writing -- function to create tree object Elijah Newren
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

As a step towards transforming the processed path->conflict_info entries
into an actual tree object, start recording basenames, modes, and oids
in a dir_metadata structure.  Subsequent commits will make use of this
to actually write a tree.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 3d46d62ed3..ff4d455dce 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -350,10 +350,31 @@ static int string_list_df_name_compare(const char *one, const char *two)
 	return onelen - twolen;
 }
 
+struct directory_versions {
+	struct string_list versions;
+};
+
+static void record_entry_for_tree(struct directory_versions *dir_metadata,
+				  const char *path,
+				  struct conflict_info *ci)
+{
+	const char *basename;
+
+	if (ci->merged.is_null)
+		/* nothing to record */
+		return;
+
+	basename = path + ci->merged.basename_offset;
+	assert(strchr(basename, '/') == NULL);
+	string_list_append(&dir_metadata->versions,
+			   basename)->util = &ci->merged.result;
+}
+
 /* Per entry merge function */
 static void process_entry(struct merge_options *opt,
 			  const char *path,
-			  struct conflict_info *ci)
+			  struct conflict_info *ci,
+			  struct directory_versions *dir_metadata)
 {
 	assert(!ci->merged.clean);
 	assert(ci->filemask >= 0 && ci->filemask <= 7);
@@ -433,6 +454,7 @@ static void process_entry(struct merge_options *opt,
 	 */
 	if (!ci->merged.clean)
 		strmap_put(&opt->priv->unmerged, path, ci);
+	record_entry_for_tree(dir_metadata, path, ci);
 }
 
 static void process_entries(struct merge_options *opt,
@@ -442,6 +464,7 @@ static void process_entries(struct merge_options *opt,
 	struct strmap_entry *e;
 	struct string_list plist = STRING_LIST_INIT_NODUP;
 	struct string_list_item *entry;
+	struct directory_versions dir_metadata;
 
 	if (strmap_empty(&opt->priv->paths)) {
 		oidcpy(result_oid, opt->repo->hash_algo->empty_tree);
@@ -458,6 +481,9 @@ static void process_entries(struct merge_options *opt,
 	plist.cmp = string_list_df_name_compare;
 	string_list_sort(&plist);
 
+	/* other setup */
+	string_list_init(&dir_metadata.versions, 0);
+
 	/*
 	 * Iterate over the items in reverse order, so we can handle paths
 	 * below a directory before needing to handle the directory itself.
@@ -470,11 +496,14 @@ static void process_entries(struct merge_options *opt,
 		 */
 		struct conflict_info *ci = entry->util;
 
-		if (!ci->merged.clean)
-			process_entry(opt, path, ci);
+		if (ci->merged.clean)
+			record_entry_for_tree(&dir_metadata, path, ci);
+		else
+			process_entry(opt, path, ci, &dir_metadata);
 	}
 
 	string_list_clear(&plist, 0);
+	string_list_clear(&dir_metadata.versions, 0);
 	die("Tree creation not yet implemented");
 }
 
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 14/20] merge-ort: step 2 of tree writing -- function to create tree object
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (12 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 13/20] merge-ort: step 1 of tree writing -- record basenames, modes, and oids Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-11 20:47   ` Jonathan Tan
  2020-11-02 20:43 ` [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go Elijah Newren
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Create a new function, write_tree(), which will take a list of
basenames, modes, and oids for a single directory and create a tree
object in the object-store.  We do not yet have just basenames, modes,
and oids for just a single directory (we have a mixture of entries from
all directory levels in the hierarchy) so we still die() before the
current call to write_tree(), but the next patch will rectify that.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index ff4d455dce..c560dd1634 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -19,6 +19,7 @@
 
 #include "diff.h"
 #include "diffcore.h"
+#include "object-store.h"
 #include "strmap.h"
 #include "tree.h"
 #include "xdiff-interface.h"
@@ -354,6 +355,50 @@ struct directory_versions {
 	struct string_list versions;
 };
 
+static void write_tree(struct object_id *result_oid,
+		       struct string_list *versions,
+		       unsigned int offset)
+{
+	size_t maxlen = 0;
+	unsigned int nr = versions->nr - offset;
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list relevant_entries = STRING_LIST_INIT_NODUP;
+	int i;
+
+	/*
+	 * We want to sort the last (versions->nr-offset) entries in versions.
+	 * Do so by abusing the string_list API a bit: make another string_list
+	 * that contains just those entries and then sort them.
+	 *
+	 * We won't use relevant_entries again and will let it just pop off the
+	 * stack, so there won't be allocation worries or anything.
+	 */
+	relevant_entries.items = versions->items + offset;
+	relevant_entries.nr = versions->nr - offset;
+	string_list_sort(&relevant_entries);
+
+	/* Pre-allocate some space in buf */
+	for (i = 0; i < nr; i++) {
+		maxlen += strlen(versions->items[offset+i].string) + 34;
+	}
+	strbuf_reset(&buf);
+	strbuf_grow(&buf, maxlen);
+
+	/* Write each entry out to buf */
+	for (i = 0; i < nr; i++) {
+		struct merged_info *mi = versions->items[offset+i].util;
+		struct version_info *ri = &mi->result;
+		strbuf_addf(&buf, "%o %s%c",
+			    ri->mode,
+			    versions->items[offset+i].string, '\0');
+		strbuf_add(&buf, ri->oid.hash, the_hash_algo->rawsz);
+	}
+
+	/* Write this object file out, and record in result_oid */
+	write_object_file(buf.buf, buf.len, tree_type, result_oid);
+	strbuf_release(&buf);
+}
+
 static void record_entry_for_tree(struct directory_versions *dir_metadata,
 				  const char *path,
 				  struct conflict_info *ci)
@@ -502,9 +547,16 @@ static void process_entries(struct merge_options *opt,
 			process_entry(opt, path, ci, &dir_metadata);
 	}
 
+	/*
+	 * TODO: We can't actually write a tree yet, because dir_metadata just
+	 * contains all basenames of all files throughout the tree with their
+	 * mode and hash.  Not only is that a nonsensical tree, it will have
+	 * lots of duplicates for paths such as "Makefile" or ".gitignore".
+	 */
+	die("Not yet implemented; need to process subtrees separately");
+	write_tree(result_oid, &dir_metadata.versions, 0);
 	string_list_clear(&plist, 0);
 	string_list_clear(&dir_metadata.versions, 0);
-	die("Tree creation not yet implemented");
 }
 
 void merge_switch_to_result(struct merge_options *opt,
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (13 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 14/20] merge-ort: step 2 of tree writing -- function to create tree object Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-12 20:15   ` Jonathan Tan
  2020-11-02 20:43 ` [PATCH v2 16/20] merge-ort: basic outline for merge_switch_to_result() Elijah Newren
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Our order for processing of entries means that if we have a tree of
files that looks like
   Makefile
   src/moduleA/foo.c
   src/moduleA/bar.c
   src/moduleB/baz.c
   src/moduleB/umm.c
   tokens.txt

Then we will process paths in the order of the leftmost column below.  I
have added two additional columns that help explain the algorithm that
follows; the 2nd column is there to remind us we have oid & mode info we
are tracking for each of these paths (which differs between the paths
which I'm not representing well here), and the third column annotates
the parent directory of the entry:
   tokens.txt               <version_info>    ""
   src/moduleB/umm.c        <version_info>    src/moduleB
   src/moduleB/baz.c        <version_info>    src/moduleB
   src/moduleB              <version_info>    src
   src/moduleA/foo.c        <version_info>    src/moduleA
   src/moduleA/bar.c        <version_info>    src/moduleA
   src/moduleA              <version_info>    src
   src                      <version_info>    ""
   Makefile                 <version_info>    ""

When the parent directory changes, if it's a subdirectory of the previous
parent directory (e.g. "" -> src/moduleB) then we can just keep appending.
If the parent directory differs from the previous parent directory and is
not a subdirectory, then we should process that directory.

So, for example, when we get to this point:
   tokens.txt               <version_info>    ""
   src/moduleB/umm.c        <version_info>    src/moduleB
   src/moduleB/baz.c        <version_info>    src/moduleB

and note that the next entry (src/moduleB) has a different parent than
the last one that isn't a subdirectory, we should write out a tree for it
   100644 blob <HASH> umm.c
   100644 blob <HASH> baz.c

then pop all the entries under that directory while recording the new
hash for that directory, leaving us with
   tokens.txt               <version_info>        ""
   src/moduleB              <new version_info>    src

This process repeats until at the end we get to
   tokens.txt               <version_info>        ""
   src                      <new version_info>    ""
   Makefile                 <version_info>        ""

and then we can write out the toplevel tree.  Since we potentially have
entries in our string_list corresponding to multiple different toplevel
directories, e.g. a slightly different repository might have:
   whizbang.txt             <version_info>        ""
   tokens.txt               <version_info>        ""
   src/moduleD              <new version_info>    src
   src/moduleC              <new version_info>    src
   src/moduleB              <new version_info>    src
   src/moduleA/foo.c        <version_info>        src/moduleA
   src/moduleA/bar.c        <version_info>        src/moduleA

When src/moduleA is popped off, we need to know that the "last
directory" reverts back to src, and how many entries in our string_list
are associated with that parent directory.  So I use an auxiliary
offsets string_list which would have (parent_directory,offset)
information of the form
   ""             0
   src            2
   src/moduleA    5

Whenever I write out a tree for a subdirectory, I set versions.nr to
the final offset value and then decrement offsets.nr...and then add
an entry to versions with a hash for the new directory.

The idea is relatively simple, there's just a lot of accounting to
implement this.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 106 insertions(+), 7 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index c560dd1634..20b7c0d8b0 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -353,6 +353,9 @@ static int string_list_df_name_compare(const char *one, const char *two)
 
 struct directory_versions {
 	struct string_list versions;
+	struct string_list offsets;
+	const char *last_directory;
+	unsigned last_directory_len;
 };
 
 static void write_tree(struct object_id *result_oid,
@@ -409,12 +412,100 @@ static void record_entry_for_tree(struct directory_versions *dir_metadata,
 		/* nothing to record */
 		return;
 
+	/*
+	 * Note: write_completed_directories() already added
+	 * entries for directories to dir_metadata->versions,
+	 * so no need to handle ci->filemask == 0 again.
+	 */
+	if (!ci->merged.clean && !ci->filemask)
+		return;
+
 	basename = path + ci->merged.basename_offset;
 	assert(strchr(basename, '/') == NULL);
 	string_list_append(&dir_metadata->versions,
 			   basename)->util = &ci->merged.result;
 }
 
+static void write_completed_directories(struct merge_options *opt,
+					const char *new_directory_name,
+					struct directory_versions *info)
+{
+	const char *prev_dir;
+	struct merged_info *dir_info = NULL;
+	unsigned int offset;
+	int wrote_a_new_tree = 0;
+
+	if (new_directory_name == info->last_directory)
+		return;
+
+	/*
+	 * If we are just starting (last_directory is NULL), or last_directory
+	 * is a prefix of the current directory, then we can just update
+	 * last_directory and record the offset where we started this directory.
+	 */
+	if (info->last_directory == NULL ||
+	    !strncmp(new_directory_name, info->last_directory,
+		     info->last_directory_len)) {
+		uintptr_t offset = info->versions.nr;
+
+		info->last_directory = new_directory_name;
+		info->last_directory_len = strlen(info->last_directory);
+		string_list_append(&info->offsets,
+				   info->last_directory)->util = (void*)offset;
+		return;
+	}
+
+	/*
+	 * At this point, ne (next entry) is within a different directory
+	 * than the last entry, so we need to create a tree object for all
+	 * the entries in info->versions that are under info->last_directory.
+	 */
+	dir_info = strmap_get(&opt->priv->paths, info->last_directory);
+	assert(dir_info);
+	offset = (uintptr_t)info->offsets.items[info->offsets.nr-1].util;
+	if (offset == info->versions.nr) {
+		dir_info->is_null = 1;
+	} else {
+		dir_info->result.mode = S_IFDIR;
+		write_tree(&dir_info->result.oid, &info->versions, offset);
+		wrote_a_new_tree = 1;
+	}
+
+	/*
+	 * We've now used several entries from info->versions and one entry
+	 * from info->offsets, so we get rid of those values.
+	 */
+	info->offsets.nr--;
+	info->versions.nr = offset;
+
+	/*
+	 * Now we've got an OID for last_directory in dir_info.  We need to
+	 * add it to info->versions for it to be part of the computation of
+	 * its parent directories' OID.  But first, we have to find out what
+	 * its' parent name was and whether that matches the previous
+	 * info->offsets or we need to set up a new one.
+	 */
+	prev_dir = info->offsets.nr == 0 ? NULL :
+		   info->offsets.items[info->offsets.nr-1].string;
+	if (new_directory_name != prev_dir) {
+		uintptr_t c = info->versions.nr;
+		string_list_append(&info->offsets,
+				   new_directory_name)->util = (void*)c;
+	}
+
+	/*
+	 * Okay, finally record OID for last_directory in info->versions,
+	 * and update last_directory.
+	 */
+	if (wrote_a_new_tree) {
+		const char *dir_name = strrchr(info->last_directory, '/');
+		dir_name = dir_name ? dir_name+1 : info->last_directory;
+		string_list_append(&info->versions, dir_name)->util = dir_info;
+	}
+	info->last_directory = new_directory_name;
+	info->last_directory_len = strlen(info->last_directory);
+}
+
 /* Per entry merge function */
 static void process_entry(struct merge_options *opt,
 			  const char *path,
@@ -528,6 +619,9 @@ static void process_entries(struct merge_options *opt,
 
 	/* other setup */
 	string_list_init(&dir_metadata.versions, 0);
+	string_list_init(&dir_metadata.offsets, 0);
+	dir_metadata.last_directory = NULL;
+	dir_metadata.last_directory_len = 0;
 
 	/*
 	 * Iterate over the items in reverse order, so we can handle paths
@@ -541,22 +635,27 @@ static void process_entries(struct merge_options *opt,
 		 */
 		struct conflict_info *ci = entry->util;
 
+		write_completed_directories(opt, ci->merged.directory_name,
+					    &dir_metadata);
 		if (ci->merged.clean)
 			record_entry_for_tree(&dir_metadata, path, ci);
 		else
 			process_entry(opt, path, ci, &dir_metadata);
 	}
 
-	/*
-	 * TODO: We can't actually write a tree yet, because dir_metadata just
-	 * contains all basenames of all files throughout the tree with their
-	 * mode and hash.  Not only is that a nonsensical tree, it will have
-	 * lots of duplicates for paths such as "Makefile" or ".gitignore".
-	 */
-	die("Not yet implemented; need to process subtrees separately");
+	if (dir_metadata.offsets.nr != 1 ||
+	    (uintptr_t)dir_metadata.offsets.items[0].util != 0) {
+		printf("dir_metadata.offsets.nr = %d (should be 1)\n",
+		       dir_metadata.offsets.nr);
+		printf("dir_metadata.offsets.items[0].util = %u (should be 0)\n",
+		       (unsigned)(uintptr_t)dir_metadata.offsets.items[0].util);
+		fflush(stdout);
+		BUG("dir_metadata accounting completely off; shouldn't happen");
+	}
 	write_tree(result_oid, &dir_metadata.versions, 0);
 	string_list_clear(&plist, 0);
 	string_list_clear(&dir_metadata.versions, 0);
+	string_list_clear(&dir_metadata.offsets, 0);
 }
 
 void merge_switch_to_result(struct merge_options *opt,
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 16/20] merge-ort: basic outline for merge_switch_to_result()
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (14 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-02 20:43 ` [PATCH v2 17/20] merge-ort: add implementation of checkout() Elijah Newren
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

This adds a basic implementation for merge_switch_to_result(), though
just in terms of a few new empty functions that will be defined in
subsequent commits.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 20b7c0d8b0..2a60d84f1d 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -658,13 +658,53 @@ static void process_entries(struct merge_options *opt,
 	string_list_clear(&dir_metadata.offsets, 0);
 }
 
+static int checkout(struct merge_options *opt,
+		    struct tree *prev,
+		    struct tree *next)
+{
+	die("Not yet implemented.");
+}
+
+static int record_unmerged_index_entries(struct merge_options *opt,
+					 struct index_state *index,
+					 struct strmap *paths,
+					 struct strmap *unmerged)
+{
+	if (strmap_empty(unmerged))
+		return 0;
+
+	die("Not yet implemented.");
+}
+
 void merge_switch_to_result(struct merge_options *opt,
 			    struct tree *head,
 			    struct merge_result *result,
 			    int update_worktree_and_index,
 			    int display_update_msgs)
 {
-	die("Not yet implemented");
+	assert(opt->priv == NULL);
+	if (result->clean >= 0 && update_worktree_and_index) {
+		struct merge_options_internal *opti = result->priv;
+
+		if (checkout(opt, head, result->tree)) {
+			/* failure to function */
+			result->clean = -1;
+			return;
+		}
+
+		if (record_unmerged_index_entries(opt, opt->repo->index,
+						  &opti->paths,
+						  &opti->unmerged)) {
+			/* failure to function */
+			result->clean = -1;
+			return;
+		}
+	}
+
+	if (display_update_msgs) {
+		/* TODO: print out CONFLICT and other informational messages. */
+	}
+
 	merge_finalize(opt, result);
 }
 
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 17/20] merge-ort: add implementation of checkout()
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (15 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 16/20] merge-ort: basic outline for merge_switch_to_result() Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-02 20:43 ` [PATCH v2 18/20] tree: enable cmp_cache_name_compare() to be used elsewhere Elijah Newren
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Since merge-ort creates a tree for its output, when there are no
conflicts, updating the working tree and index is as simple as using the
unpack_trees() machinery with a twoway_merge (i.e. doing the equivalent
of a "checkout" operation).

If there were conflicts in the merge, then since the tree we created
included all the conflict markers, then using the unpack_trees machinery
in this manner will still update the working tree correctly.  Further,
all index entries corresponding to cleanly merged files will also be
updated correctly by this procedure.  Index entries corresponding to
unmerged entries will appear as though the user had run "git add -u"
after the merge to accept all files as-is with conflict markers.

Thus, after running unpack_trees(), there needs to be a separate step
for updating the entries in the index corresponding to unmerged files.
This will be the job for the function record_unmerged_index_entries(),
which will be implemented in a subsequent commit.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 2a60d84f1d..b7c5973d4d 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -19,9 +19,11 @@
 
 #include "diff.h"
 #include "diffcore.h"
+#include "dir.h"
 #include "object-store.h"
 #include "strmap.h"
 #include "tree.h"
+#include "unpack-trees.h"
 #include "xdiff-interface.h"
 
 struct merge_options_internal {
@@ -662,7 +664,48 @@ static int checkout(struct merge_options *opt,
 		    struct tree *prev,
 		    struct tree *next)
 {
-	die("Not yet implemented.");
+	/* Switch the index/working copy from old to new */
+	int ret;
+	struct tree_desc trees[2];
+	struct unpack_trees_options unpack_opts;
+
+	memset(&unpack_opts, 0, sizeof(unpack_opts));
+	unpack_opts.head_idx = -1;
+	unpack_opts.src_index = opt->repo->index;
+	unpack_opts.dst_index = opt->repo->index;
+
+	setup_unpack_trees_porcelain(&unpack_opts, "merge");
+
+	/*
+	 * NOTE: if this were just "git checkout" code, we would probably
+	 * read or refresh the cache and check for an unmerged index, but
+	 * builtin/merge.c or sequencer.c really needs to read the index
+	 * and check for unmerged entries before starting merging for a
+	 * good user experience (no sense waiting for merges/rebases before
+	 * erroring out), so there's no reason to duplicate that work here.
+	 */
+
+	/* 2-way merge to the new branch */
+	unpack_opts.update = 1;
+	unpack_opts.merge = 1;
+	unpack_opts.quiet = 0; /* FIXME: sequencer might want quiet? */
+	unpack_opts.verbose_update = (opt->verbosity > 2);
+	unpack_opts.fn = twoway_merge;
+	if (1/* FIXME: opts->overwrite_ignore*/) {
+		unpack_opts.dir = xcalloc(1, sizeof(*unpack_opts.dir));
+		unpack_opts.dir->flags |= DIR_SHOW_IGNORED;
+		setup_standard_excludes(unpack_opts.dir);
+	}
+	parse_tree(prev);
+	init_tree_desc(&trees[0], prev->buffer, prev->size);
+	parse_tree(next);
+	init_tree_desc(&trees[1], next->buffer, next->size);
+
+	ret = unpack_trees(2, trees, &unpack_opts);
+	clear_unpack_trees_porcelain(&unpack_opts);
+	dir_clear(unpack_opts.dir);
+	FREE_AND_NULL(unpack_opts.dir);
+	return ret;
 }
 
 static int record_unmerged_index_entries(struct merge_options *opt,
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 18/20] tree: enable cmp_cache_name_compare() to be used elsewhere
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (16 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 17/20] merge-ort: add implementation of checkout() Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-02 20:43 ` [PATCH v2 19/20] merge-ort: add implementation of record_unmerged_index_entries() Elijah Newren
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 tree.c | 2 +-
 tree.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/tree.c b/tree.c
index e76517f6b1..a52479812c 100644
--- a/tree.c
+++ b/tree.c
@@ -144,7 +144,7 @@ int read_tree_recursive(struct repository *r,
 	return ret;
 }
 
-static int cmp_cache_name_compare(const void *a_, const void *b_)
+int cmp_cache_name_compare(const void *a_, const void *b_)
 {
 	const struct cache_entry *ce1, *ce2;
 
diff --git a/tree.h b/tree.h
index 9383745073..3eb0484cbf 100644
--- a/tree.h
+++ b/tree.h
@@ -28,6 +28,8 @@ void free_tree_buffer(struct tree *tree);
 /* Parses and returns the tree in the given ent, chasing tags and commits. */
 struct tree *parse_tree_indirect(const struct object_id *oid);
 
+int cmp_cache_name_compare(const void *a_, const void *b_);
+
 #define READ_TREE_RECURSIVE 1
 typedef int (*read_tree_fn_t)(const struct object_id *, struct strbuf *, const char *, unsigned int, int, void *);
 
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 19/20] merge-ort: add implementation of record_unmerged_index_entries()
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (17 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 18/20] tree: enable cmp_cache_name_compare() to be used elsewhere Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-02 20:43 ` [PATCH v2 20/20] merge-ort: free data structures in merge_finalize() Elijah Newren
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

After checkout(), the working tree has the appropriate contents, and the
index matches the working copy.  That means that all unmodified and
cleanly merged files have correct index entries, but unmerged entries
need to be updated.

We do this by looping over the unmerged entries, marking the existing
index entry for the path with CE_REMOVE, adding new higher order staged
for the path at the end of the index (ignoring normal index sort order),
and then at the end of the loop removing the CE_REMOVED-marked cache
entries and sorting the index.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 86 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index b7c5973d4d..19c30117b0 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -17,6 +17,7 @@
 #include "cache.h"
 #include "merge-ort.h"
 
+#include "cache-tree.h"
 #include "diff.h"
 #include "diffcore.h"
 #include "dir.h"
@@ -713,10 +714,94 @@ static int record_unmerged_index_entries(struct merge_options *opt,
 					 struct strmap *paths,
 					 struct strmap *unmerged)
 {
+	struct hashmap_iter iter;
+	struct strmap_entry *e;
+	int errs = 0;
+	int original_cache_nr;
+
 	if (strmap_empty(unmerged))
 		return 0;
 
-	die("Not yet implemented.");
+	original_cache_nr = index->cache_nr;
+
+	/* Put every entry from paths into plist, then sort */
+	strmap_for_each_entry(unmerged, &iter, e) {
+		const char *path = e->key;
+		struct conflict_info *ci = e->value;
+		int pos;
+		struct cache_entry *ce;
+		int i;
+
+		/*
+		 * The index will already have a stage=0 entry for this path,
+		 * because we created an as-merged-as-possible version of the
+		 * file and checkout() moved the working copy and index over
+		 * to that version.
+		 *
+		 * However, previous iterations through this loop will have
+		 * added unstaged entries to the end of the cache which
+		 * ignore the standard alphabetical ordering of cache
+		 * entries and break invariants needed for index_name_pos()
+		 * to work.  However, we know the entry we want is before
+		 * those appended cache entries, so do a temporary swap on
+		 * cache_nr to only look through entries of interest.
+		 */
+		SWAP(index->cache_nr, original_cache_nr);
+		pos = index_name_pos(index, path, strlen(path));
+		SWAP(index->cache_nr, original_cache_nr);
+		if (pos < 0) {
+			if (ci->filemask == 1)
+				cache_tree_invalidate_path(index, path);
+			else
+				BUG("Unmerged %s but nothing in basic working tree or index; this shouldn't happen", path);
+		} else {
+			ce = index->cache[pos];
+
+			/*
+			 * Clean paths with CE_SKIP_WORKTREE set will not be
+			 * written to the working tree by the unpack_trees()
+			 * call in checkout().  Our unmerged entries would
+			 * have appeared clean to that code since we ignored
+			 * the higher order stages.  Thus, we need override
+			 * the CE_SKIP_WORKTREE bit and manually write those
+			 * files to the working disk here.
+			 *
+			 * TODO: Implement this CE_SKIP_WORKTREE fixup.
+			 */
+
+			/*
+			 * Mark this cache entry for removal and instead add
+			 * new stage>0 entries corresponding to the
+			 * conflicts.  If there are many unmerged entries, we
+			 * want to avoid memmove'ing O(NM) entries by
+			 * inserting the new entries one at a time.  So,
+			 * instead, we just add the new cache entries to the
+			 * end (ignoring normal index requirements on sort
+			 * order) and sort the index once we're all done.
+			 */
+			ce->ce_flags |= CE_REMOVE;
+		}
+
+		for (i = 0; i < 3; i++) {
+			struct version_info *vi;
+			if (!(ci->filemask & (1ul << i)))
+				continue;
+			vi = &ci->stages[i];
+			ce = make_cache_entry(index, vi->mode, &vi->oid,
+					      path, i+1, 0);
+			add_index_entry(index, ce, ADD_CACHE_JUST_APPEND);
+		}
+	}
+
+	/*
+	 * Remove the unused cache entries (and invalidate the relevant
+	 * cache-trees), then sort the index entries to get the unmerged
+	 * entries we added to the end into their right locations.
+	 */
+	remove_marked_cache_entries(index, 1);
+	QSORT(index->cache, index->cache_nr, cmp_cache_name_compare);
+
+	return errs;
 }
 
 void merge_switch_to_result(struct merge_options *opt,
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 20/20] merge-ort: free data structures in merge_finalize()
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (18 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 19/20] merge-ort: add implementation of record_unmerged_index_entries() Elijah Newren
@ 2020-11-02 20:43 ` Elijah Newren
  2020-11-03 14:49 ` [PATCH v2 00/20] fundamentals of merge-ort implementation Derrick Stolee
  2020-11-11 17:08 ` Derrick Stolee
  21 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-02 20:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 19c30117b0..c6a0fc388f 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -81,6 +81,16 @@ static int err(struct merge_options *opt, const char *err, ...)
 	return -1;
 }
 
+static void free_strmap_strings(struct strmap *map)
+{
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	strmap_for_each_entry(map, &iter, entry) {
+		free((char*)entry->key);
+	}
+}
+
 static void setup_path_info(struct merge_options *opt,
 			    struct string_list_item *result,
 			    const char *current_dir_name,
@@ -839,7 +849,27 @@ void merge_switch_to_result(struct merge_options *opt,
 void merge_finalize(struct merge_options *opt,
 		    struct merge_result *result)
 {
-	die("Not yet implemented");
+	struct merge_options_internal *opti = result->priv;
+
+	assert(opt->priv == NULL);
+
+	/*
+	 * We marked opti->paths with strdup_strings = 0, so that we
+	 * wouldn't have to make another copy of the fullpath created by
+	 * make_traverse_path from setup_path_info().  But, now that we've
+	 * used it and have no other references to these strings, it is time
+	 * to deallocate them.
+	 */
+	free_strmap_strings(&opti->paths);
+	strmap_clear(&opti->paths, 1);
+
+	/*
+	 * All strings and util fields in opti->unmerged are a subset of
+	 * those in opti->paths.  We don't want to deallocate anything
+	 * twice, so we don't free the strings we pass 0 for free_util.
+	 */
+	strmap_clear(&opti->unmerged, 0);
+	FREE_AND_NULL(opti);
 }
 
 static void merge_start(struct merge_options *opt, struct merge_result *result)
-- 
2.29.0.471.ga4f56089c0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (19 preceding siblings ...)
  2020-11-02 20:43 ` [PATCH v2 20/20] merge-ort: free data structures in merge_finalize() Elijah Newren
@ 2020-11-03 14:49 ` Derrick Stolee
  2020-11-03 16:36   ` Elijah Newren
  2020-11-11 17:08 ` Derrick Stolee
  21 siblings, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-03 14:49 UTC (permalink / raw)
  To: Elijah Newren, git

On 11/2/2020 3:43 PM, Elijah Newren wrote:
> This series depends on a merge of en/strmap (after updating to v3) and
> en/merge-ort-api-null-impl.
> 
> As promised, here's the update of the series due to the strmap
> updates...and two other tiny updates.

Hi Elijah,

I'm sorry that I've been unavailable to read and review your series
on this topic. I'm very excited about the opportunities here, and I
wanted to take your topic and merge it with our microsoft/git fork
so I could test the performance in a Scalar-enabled monorepo. My
branch is available in my fork [1]

[1] https://github.com/derrickstolee/git/tree/merge-ort-vfs

However, I'm unable to discover how to trigger your ort strategy,
even for a simple rebase. Perhaps you could supply a recommended
command for testing?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-03 14:49 ` [PATCH v2 00/20] fundamentals of merge-ort implementation Derrick Stolee
@ 2020-11-03 16:36   ` Elijah Newren
  2020-11-07  6:06     ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-03 16:36 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

On Tue, Nov 3, 2020 at 6:50 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/2/2020 3:43 PM, Elijah Newren wrote:
> > This series depends on a merge of en/strmap (after updating to v3) and
> > en/merge-ort-api-null-impl.
> >
> > As promised, here's the update of the series due to the strmap
> > updates...and two other tiny updates.
>
> Hi Elijah,
>
> I'm sorry that I've been unavailable to read and review your series
> on this topic. I'm very excited about the opportunities here, and I
> wanted to take your topic and merge it with our microsoft/git fork
> so I could test the performance in a Scalar-enabled monorepo. My
> branch is available in my fork [1]
>
> [1] https://github.com/derrickstolee/git/tree/merge-ort-vfs
>
> However, I'm unable to discover how to trigger your ort strategy,
> even for a simple rebase. Perhaps you could supply a recommended
> command for testing?
>
> Thanks,
> -Stolee

If you want to test performance, you shouldn't test this particular
submission, you should test the end result which exists as the 'ort'
branch of my repo.  It actually passes all the tests rather than just
trivial cherry-picks and rebases, and has lots (and lots) of
performance work that hasn't even begun at the point of the
'ort-basics' branch.  (However, it also contains some unrelated memory
cleanup in revision.c, chdir-notify.c, and a number of other places
because I was annoyed that a rebase wouldn't run valgrind-free and
made it harder to spot my memory leaks.  And the day I went hunting
those memory "leaks", I went and grabbed some unrelated memory leaks
too.  If it causes you merge conflicts, let me know and I'll try to
create a branch for you that hash the minimal changes outside of
merge-ort*.[ch] and diffcore*.[ch])

All that said, for testing either branch you just need to first set
pull.twohead=ort in your git config (see
https://lore.kernel.org/git/61217a83bd7ff0ce9016eb4df9ded4fdf29a506c.1604360734.git.gitgitgadget@gmail.com/),
or, if running regression tests, set GIT_TEST_MERGE_ALGORITHM=ort.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 01/20] merge-ort: setup basic internal data structures
  2020-11-02 20:43 ` [PATCH v2 01/20] merge-ort: setup basic internal data structures Elijah Newren
@ 2020-11-06 22:05   ` Jonathan Tan
  2020-11-06 22:45     ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Tan @ 2020-11-06 22:05 UTC (permalink / raw)
  To: newren; +Cc: git, Jonathan Tan

I'm not very familiar with the merge machinery, but I'll attempt a
review of at least the first 10 patches.

> merged_info contains all relevant information for a non-conflicted
> entry.  conflict_info contains a merged_info, plus any additional
> information about a conflict such as the higher orders stages involved
> and the names of the paths those came from (handy once renames get
> involved).  If an entry remains conflicted, the merged_info portion of a
> conflict_info will later be filled with whatever version of the file
> should be placed in the working directory (e.g. an as-merged-as-possible
> variation that contains conflict markers).

I think that this information should be in the .c file.

> diff --git a/merge-ort.c b/merge-ort.c
> index b487901d3e..9d5ea0930d 100644
> --- a/merge-ort.c
> +++ b/merge-ort.c
> @@ -17,6 +17,46 @@
>  #include "cache.h"
>  #include "merge-ort.h"
>  
> +#include "strmap.h"
> +
> +struct merge_options_internal {
> +	struct strmap paths;    /* maps path -> (merged|conflict)_info */
> +	struct strmap unmerged; /* maps path -> conflict_info */
> +	const char *current_dir_name;
> +	int call_depth;
> +};

Maybe document if the path is from the root of the directory or just the
filename as it appears in a tree object?

I would have expected "unmerged" to be a "strset", but I guess it's more
convenient for it to be a map itself. Maybe just document it as "all
mappings in paths wherein the value is a struct conflict_info".

There seems to be 2 ways of referring to something that we couldn't
merge - "conflicted" (or "having a conflict") and "unmerged". Should we
stick to one of them?

Also, looking ahead, I see that current_dir_name is used as a temporary
variable in the recursive calls to collect_merge_info_callback(). I
would prefer if current_dir_name went in the cbdata to that function
instead, but if that's not possible, maybe document here that
current_dir_name is only used in collect_merge_info_callback(), and
temporarily at that.

> +struct version_info {
> +	struct object_id oid;
> +	unsigned short mode;
> +};

OK.

> +struct merged_info {
> +	struct version_info result;
> +	unsigned is_null:1;
> +	unsigned clean:1;
> +	size_t basename_offset;
> +	 /*
> +	  * Containing directory name.  Note that we assume directory_name is
> +	  * constructed such that
> +	  *    strcmp(dir1_name, dir2_name) == 0 iff dir1_name == dir2_name,
> +	  * i.e. string equality is equivalent to pointer equality.  For this
> +	  * to hold, we have to be careful setting directory_name.
> +	  */
> +	const char *directory_name;
> +};

I'm not sure how most of the fields in this struct are to be used, but
perhaps that will be clearer once I read the subsequent code.

> +struct conflict_info {
> +	struct merged_info merged;
> +	struct version_info stages[3];
> +	const char *pathnames[3];

Why would these be different across stages? (Rename detection?)

> +	unsigned df_conflict:1;

OK.

> +	unsigned path_conflict:1;

This doesn't seem to be assigned anywhere in this patch set?

> +	unsigned filemask:3;
> +	unsigned dirmask:3;

I wonder if this needs to be documented that the least significant bit
corresponds to stages[0], and so forth.

> +	unsigned match_mask:3;

I think this can be derived by just looking at the stages array? Maybe
document as:

  Optimization to track which stages match. Either 0 or at least 2 bits
  are set; if at least 2 bits are set, their corresponding stages match.

> +};

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info()
  2020-11-02 20:43 ` [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info() Elijah Newren
@ 2020-11-06 22:19   ` Jonathan Tan
  2020-11-06 23:10     ` Elijah Newren
  2020-11-11 14:38   ` Derrick Stolee
  1 sibling, 1 reply; 84+ messages in thread
From: Jonathan Tan @ 2020-11-06 22:19 UTC (permalink / raw)
  To: newren; +Cc: git, Jonathan Tan

> diff --git a/merge-ort.c b/merge-ort.c
> index 537da9f6df..626eb9713e 100644
> --- a/merge-ort.c
> +++ b/merge-ort.c
> @@ -77,13 +77,130 @@ static int err(struct merge_options *opt, const char *err, ...)
>  	return -1;
>  }
>  
> +static int collect_merge_info_callback(int n,
> +				       unsigned long mask,
> +				       unsigned long dirmask,
> +				       struct name_entry *names,
> +				       struct traverse_info *info)
> +{

[snip]

> +	unsigned mbase_null = !(mask & 1);
> +	unsigned side1_null = !(mask & 2);
> +	unsigned side2_null = !(mask & 4);

Should these be "int"?

> +	/*
> +	 * A bunch of sanity checks verifying that traverse_trees() calls
> +	 * us the way I expect.  Could just remove these at some point,
> +	 * though maybe they are helpful to future code readers.
> +	 */
> +	assert(mbase_null == is_null_oid(&names[0].oid));
> +	assert(side1_null == is_null_oid(&names[1].oid));
> +	assert(side2_null == is_null_oid(&names[2].oid));
> +	assert(!mbase_null || !side1_null || !side2_null);
> +	assert(mask > 0 && mask < 8);

These were helpful to me.

> +	/* Other invariant checks, mostly for documentation purposes. */
> +	assert(mask == (dirmask | filemask));

But not this - filemask was computed in this function, so I need not
look elsewhere to see that this is correct.

> +	/*
> +	 * TODO: record information about the path other than all zeros,
> +	 * so we can resolve later in process_entries.
> +	 */
> +	ci = xcalloc(1, sizeof(struct conflict_info));
> +	strmap_put(&opti->paths, fullpath, ci);

OK - so each entry is a full-size conflict_info to store all relevant
information. Presumably some of these will be converted later into what
is effectively a struct merged_info (so, the extra struct conflict_info
fields are unused but memory is still occupied).

I do see that in patch 10, there is an optimization that directly
allocates the smaller struct merged_info when it is known at this point
that there is no conflict.

[snip rest of function]

>  static int collect_merge_info(struct merge_options *opt,
>  			      struct tree *merge_base,
>  			      struct tree *side1,
>  			      struct tree *side2)
>  {
> -	/* TODO: Implement this using traverse_trees() */
> -	die("Not yet implemented.");
> +	int ret;
> +	struct tree_desc t[3];
> +	struct traverse_info info;
> +	char *toplevel_dir_placeholder = "";
> +
> +	opt->priv->current_dir_name = toplevel_dir_placeholder;
> +	setup_traverse_info(&info, toplevel_dir_placeholder);

I thought that this was written like this (instead of inlining the 2
double-quotes) to ensure that the string-equality-is-pointer-equality
characteristic holds, but I see that that characteristic is for
directory_name in struct merged_info, not current_dir_name in struct
merge_options_internal. Any reason for not inlining ""?

[snip rest of function]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 01/20] merge-ort: setup basic internal data structures
  2020-11-06 22:05   ` Jonathan Tan
@ 2020-11-06 22:45     ` Elijah Newren
  2020-11-09 20:55       ` Jonathan Tan
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-06 22:45 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git Mailing List

On Fri, Nov 6, 2020 at 2:05 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> I'm not very familiar with the merge machinery, but I'll attempt a
> review of at least the first 10 patches.

Thanks for taking a look!

> > merged_info contains all relevant information for a non-conflicted
> > entry.  conflict_info contains a merged_info, plus any additional
> > information about a conflict such as the higher orders stages involved
> > and the names of the paths those came from (handy once renames get
> > involved).  If an entry remains conflicted, the merged_info portion of a
> > conflict_info will later be filled with whatever version of the file
> > should be placed in the working directory (e.g. an as-merged-as-possible
> > variation that contains conflict markers).
>
> I think that this information should be in the .c file.

Okay.

> > diff --git a/merge-ort.c b/merge-ort.c
> > index b487901d3e..9d5ea0930d 100644
> > --- a/merge-ort.c
> > +++ b/merge-ort.c
> > @@ -17,6 +17,46 @@
> >  #include "cache.h"
> >  #include "merge-ort.h"
> >
> > +#include "strmap.h"
> > +
> > +struct merge_options_internal {
> > +     struct strmap paths;    /* maps path -> (merged|conflict)_info */
> > +     struct strmap unmerged; /* maps path -> conflict_info */
> > +     const char *current_dir_name;
> > +     int call_depth;
> > +};
>
> Maybe document if the path is from the root of the directory or just the
> filename as it appears in a tree object?

Yeah, full relative path from toplevel.  I'll try to add some
documentation to that effect.

> I would have expected "unmerged" to be a "strset", but I guess it's more
> convenient for it to be a map itself. Maybe just document it as "all
> mappings in paths wherein the value is a struct conflict_info".

Makes sense.  And yeah, it's not a strset just because of the simple
optimization to avoid needing to do another lookup in paths afterwards
to get the actual conflict_info; the only time it is used is as a loop
over the still-unmerged entries to try to three-way merge them and
such.

> There seems to be 2 ways of referring to something that we couldn't
> merge - "conflicted" (or "having a conflict") and "unmerged". Should we
> stick to one of them?

Uhm...perhaps, but it feels like I'm going to miss some while looking
over it.  Also, there are some semantic differences at play.  Since
the whole algorithm is divided around multiple stages --
collect_merge_info(), detect_and_process_renames(), process_entries(),
as of a given early stage we might just know that we couldn't merge it
*yet*.  For example, both sides modified the entry, or one side
modified and the other side is missing ("did they delete it or rename
it?").  After rename detection and three-way content merge, something
that had not been automatically mergeable as of an earlier step might
end up being so.  But we need names for stuff in the interim state.
But it's also possible for us to know at an early state that thing are
definitely going to be a conflict regardless of what later stages do
(e.g. both sides rename a path, but rename it differently).

> Also, looking ahead, I see that current_dir_name is used as a temporary
> variable in the recursive calls to collect_merge_info_callback(). I
> would prefer if current_dir_name went in the cbdata to that function
> instead, but if that's not possible, maybe document here that
> current_dir_name is only used in collect_merge_info_callback(), and
> temporarily at that.

Yeah, not possible.  collect_merge_info_callback() has to be of
traverse_callback_t type (from tree-walk.h), which provides no extra
parameters for extra callback data.  I can add a documentation
comment.

> > +struct version_info {
> > +     struct object_id oid;
> > +     unsigned short mode;
> > +};
>
> OK.
>
> > +struct merged_info {
> > +     struct version_info result;
> > +     unsigned is_null:1;
> > +     unsigned clean:1;
> > +     size_t basename_offset;
> > +      /*
> > +       * Containing directory name.  Note that we assume directory_name is
> > +       * constructed such that
> > +       *    strcmp(dir1_name, dir2_name) == 0 iff dir1_name == dir2_name,
> > +       * i.e. string equality is equivalent to pointer equality.  For this
> > +       * to hold, we have to be careful setting directory_name.
> > +       */
> > +     const char *directory_name;
> > +};
>
> I'm not sure how most of the fields in this struct are to be used, but
> perhaps that will be clearer once I read the subsequent code.
>
> > +struct conflict_info {
> > +     struct merged_info merged;
> > +     struct version_info stages[3];
> > +     const char *pathnames[3];
>
> Why would these be different across stages? (Rename detection?)

Yes, as noted in the portion of the commit message that you said you
wanted in the .c file.

>
> > +     unsigned df_conflict:1;
>
> OK.
>
> > +     unsigned path_conflict:1;
>
> This doesn't seem to be assigned anywhere in this patch set?

Oh, right, I could drop it for now and add it back later when it is
used.  It's basically non-content conflict other than the specialized
D/F conflict.  So, things like rename/delete, moved by directory
rename, rename/rename(1to2), and rename/add/delete.  I could have
potentially lumped it in with df_conflict or made df_conflict a
subset, but df_conflict is different enough from the others that it
got a special flag.

But yeah, since it's all rename-related stuff and this patchset
doesn't have any rename handling yet, I probably should have left it
out until I added that stuff later.

> > +     unsigned filemask:3;
> > +     unsigned dirmask:3;
>
> I wonder if this needs to be documented that the least significant bit
> corresponds to stages[0], and so forth.

Maybe I should just put a comment to look at tree-walk.h?  The struct
traverse_info has a "fn" member with a big comment above it describing
mask & dirmask; filemask is just mask & ~dirmask.

> > +     unsigned match_mask:3;
>
> I think this can be derived by just looking at the stages array? Maybe
> document as:
>
>   Optimization to track which stages match. Either 0 or at least 2 bits
>   are set; if at least 2 bits are set, their corresponding stages match.

Yep, I only wanted to compute the match_mask once (I always got
annoyed in merge-recursive.c that we were re-comparing what had
already been compared and computed within unpack_trees()).  I like
your suggested comment; will add it.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 08/20] merge-ort: compute a few more useful fields for collect_merge_info
  2020-11-02 20:43 ` [PATCH v2 08/20] merge-ort: compute a few more useful fields for collect_merge_info Elijah Newren
@ 2020-11-06 22:52   ` Jonathan Tan
  2020-11-06 23:41     ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Tan @ 2020-11-06 22:52 UTC (permalink / raw)
  To: newren; +Cc: git, Jonathan Tan

> +	/*
> +	 * Note: We only label files with df_conflict, not directories.
> +	 * Since directories stay where they are, and files move out of the
> +	 * way to make room for a directory, we don't care if there was a
> +	 * directory/file conflict for a parent directory of the current path.
> +	 */
> +	unsigned df_conflict = (filemask != 0) && (dirmask != 0);

Suppose you have:

 [ours]
  foo/
    bar/
      baz
    quux
 [theirs]
  foo

By "we only label files with df_conflict, not directories", are you
referring to not labelling "foo/" in [ours], or to "bar/", "baz", and
"quux" (so, the files and directories within a directory)? At first I
thought you were referring to the former, but perhaps you are referring
to the latter.

> @@ -161,6 +179,13 @@ static int collect_merge_info_callback(int n,
>  		newinfo.name = p->path;
>  		newinfo.namelen = p->pathlen;
>  		newinfo.pathlen = st_add3(newinfo.pathlen, p->pathlen, 1);
> +		/*
> +		 * If we did care about parent directories having a D/F
> +		 * conflict, then we'd include
> +		 *    newinfo.df_conflicts |= (mask & ~dirmask);
> +		 * here.  But we don't.  (See comment near setting of local
> +		 * df_conflict variable near the beginning of this function).
> +		 */

I'm not sure how "mask" and "dirmask" contains information about parent
directories. "mask" represents the available entries, and "dirmask"
represents which of them are directories, as far as I know. So we can
notice when something is missing, but I don't see how this distinguishes
between the case that something is missing because it was in a parent
directory that got deleted, vs something is missing because it itself
got deleted.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path
  2020-11-02 20:43 ` [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path Elijah Newren
@ 2020-11-06 22:58   ` Jonathan Tan
  2020-11-07  0:26     ` Elijah Newren
  2020-11-11 15:26   ` Derrick Stolee
  1 sibling, 1 reply; 84+ messages in thread
From: Jonathan Tan @ 2020-11-06 22:58 UTC (permalink / raw)
  To: newren; +Cc: git, Jonathan Tan

> +static void setup_path_info(struct merge_options *opt,
> +			    struct string_list_item *result,
> +			    const char *current_dir_name,
> +			    int current_dir_name_len,
> +			    char *fullpath, /* we'll take over ownership */
> +			    struct name_entry *names,
> +			    struct name_entry *merged_version,
> +			    unsigned is_null,     /* boolean */
> +			    unsigned df_conflict, /* boolean */

Booleans could be int, I think?

> +			    unsigned filemask,
> +			    unsigned dirmask,
> +			    int resolved          /* boolean */)
> +{
> +	struct conflict_info *path_info;
> +
> +	assert(!is_null || resolved);
> +	assert(!df_conflict || !resolved); /* df_conflict implies !resolved */
> +	assert(resolved == (merged_version != NULL));
> +
> +	path_info = xcalloc(1, resolved ? sizeof(struct merged_info) :
> +					  sizeof(struct conflict_info));
> +	path_info->merged.directory_name = current_dir_name;
> +	path_info->merged.basename_offset = current_dir_name_len;
> +	path_info->merged.clean = !!resolved;
> +	if (resolved) {
> +		path_info->merged.result.mode = merged_version->mode;
> +		oidcpy(&path_info->merged.result.oid, &merged_version->oid);
> +		path_info->merged.is_null = !!is_null;
> +	} else {
> +		int i;
> +
> +		for (i = 0; i < 3; i++) {
> +			path_info->pathnames[i] = fullpath;
> +			path_info->stages[i].mode = names[i].mode;
> +			oidcpy(&path_info->stages[i].oid, &names[i].oid);
> +		}
> +		path_info->filemask = filemask;
> +		path_info->dirmask = dirmask;
> +		path_info->df_conflict = !!df_conflict;
> +	}
> +	strmap_put(&opt->priv->paths, fullpath, path_info);

So these are placed in paths but not unmerged. I'm starting to wonder if
struct merge_options_internal should be called merge_options_state or
something, and each field having documentation about when they're used
(or better yet, have functions like collect_merge_info() return their
calculations in return values (which may be "out" parameters) instead of
in this struct).

> +	result->string = fullpath;
> +	result->util = path_info;
> +}
> +
>  static int collect_merge_info_callback(int n,
>  				       unsigned long mask,
>  				       unsigned long dirmask,
> @@ -91,10 +136,12 @@ static int collect_merge_info_callback(int n,
>  	 */
>  	struct merge_options *opt = info->data;
>  	struct merge_options_internal *opti = opt->priv;
> -	struct conflict_info *ci;
> +	struct string_list_item pi;  /* Path Info */
> +	struct conflict_info *ci; /* pi.util when there's a conflict */

Looking ahead to patch 10, this seems more like "pi.util unless we know
for sure that there's no conflict".

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info()
  2020-11-06 22:19   ` Jonathan Tan
@ 2020-11-06 23:10     ` Elijah Newren
  2020-11-09 20:59       ` Jonathan Tan
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-06 23:10 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git Mailing List

On Fri, Nov 6, 2020 at 2:19 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > diff --git a/merge-ort.c b/merge-ort.c
> > index 537da9f6df..626eb9713e 100644
> > --- a/merge-ort.c
> > +++ b/merge-ort.c
> > @@ -77,13 +77,130 @@ static int err(struct merge_options *opt, const char *err, ...)
> >       return -1;
> >  }
> >
> > +static int collect_merge_info_callback(int n,
> > +                                    unsigned long mask,
> > +                                    unsigned long dirmask,
> > +                                    struct name_entry *names,
> > +                                    struct traverse_info *info)
> > +{
>
> [snip]
>
> > +     unsigned mbase_null = !(mask & 1);
> > +     unsigned side1_null = !(mask & 2);
> > +     unsigned side2_null = !(mask & 4);
>
> Should these be "int"?

Does the type matter, particularly since "boolean" isn't available?

> > +     /*
> > +      * A bunch of sanity checks verifying that traverse_trees() calls
> > +      * us the way I expect.  Could just remove these at some point,
> > +      * though maybe they are helpful to future code readers.
> > +      */
> > +     assert(mbase_null == is_null_oid(&names[0].oid));
> > +     assert(side1_null == is_null_oid(&names[1].oid));
> > +     assert(side2_null == is_null_oid(&names[2].oid));
> > +     assert(!mbase_null || !side1_null || !side2_null);
> > +     assert(mask > 0 && mask < 8);
>
> These were helpful to me.
>
> > +     /* Other invariant checks, mostly for documentation purposes. */
> > +     assert(mask == (dirmask | filemask));
>
> But not this - filemask was computed in this function, so I need not
> look elsewhere to see that this is correct.
>
> > +     /*
> > +      * TODO: record information about the path other than all zeros,
> > +      * so we can resolve later in process_entries.
> > +      */
> > +     ci = xcalloc(1, sizeof(struct conflict_info));
> > +     strmap_put(&opti->paths, fullpath, ci);
>
> OK - so each entry is a full-size conflict_info to store all relevant
> information. Presumably some of these will be converted later into what
> is effectively a struct merged_info (so, the extra struct conflict_info
> fields are unused but memory is still occupied).
>
> I do see that in patch 10, there is an optimization that directly
> allocates the smaller struct merged_info when it is known at this point
> that there is no conflict.

Yep.  :-)

> [snip rest of function]
>
> >  static int collect_merge_info(struct merge_options *opt,
> >                             struct tree *merge_base,
> >                             struct tree *side1,
> >                             struct tree *side2)
> >  {
> > -     /* TODO: Implement this using traverse_trees() */
> > -     die("Not yet implemented.");
> > +     int ret;
> > +     struct tree_desc t[3];
> > +     struct traverse_info info;
> > +     char *toplevel_dir_placeholder = "";
> > +
> > +     opt->priv->current_dir_name = toplevel_dir_placeholder;
> > +     setup_traverse_info(&info, toplevel_dir_placeholder);
>
> I thought that this was written like this (instead of inlining the 2
> double-quotes) to ensure that the string-equality-is-pointer-equality
> characteristic holds, but I see that that characteristic is for
> directory_name in struct merged_info, not current_dir_name in struct
> merge_options_internal. Any reason for not inlining ""?

You're really digging in; I love it.  From setup_path_info(), the
directory_name is set from the current_dir_name:
        path_info->merged.directory_name = current_dir_name;
(and if you follow where the current_dir_name parameter gets its value
from, you find that it came indirectly from
opt->priv->current_dir_name), so current_dir_name must meet all the
requirements on merge_info's directory_name field.

Perhaps there's still some kind of additional simplification possible
here, but directory rename detection is an area that has to take some
special care around this requirement.  I simplified the code a little
bit in this area as I was trying to break off a good first 20 patches
to submit, but even if we can simplify it more, the structure is just
going to come back later.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 08/20] merge-ort: compute a few more useful fields for collect_merge_info
  2020-11-06 22:52   ` Jonathan Tan
@ 2020-11-06 23:41     ` Elijah Newren
  2020-11-09 22:04       ` Jonathan Tan
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-06 23:41 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git Mailing List

On Fri, Nov 6, 2020 at 2:52 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > +     /*
> > +      * Note: We only label files with df_conflict, not directories.
> > +      * Since directories stay where they are, and files move out of the
> > +      * way to make room for a directory, we don't care if there was a
> > +      * directory/file conflict for a parent directory of the current path.
> > +      */
> > +     unsigned df_conflict = (filemask != 0) && (dirmask != 0);
>
> Suppose you have:
>
>  [ours]
>   foo/
>     bar/
>       baz
>     quux
>  [theirs]
>   foo
>
> By "we only label files with df_conflict, not directories", are you
> referring to not labelling "foo/" in [ours], or to "bar/", "baz", and
> "quux" (so, the files and directories within a directory)? At first I
> thought you were referring to the former, but perhaps you are referring
> to the latter.

The former.  I was drawing a distinction between how this code
operates, and how unpack_trees() operates, which probably only matters
to those familiar with unpack_trees() or who have been reading through
it recently.  unpack_trees() will note when there is a directory/file
conflict, and propagates that information to all subtrees, with every
path specially checking for the o->df_conflict_entry and then handling
it specially (e.g. keeping higher order stages instead of using an
aggressive or trivial resolutions).  However, leaving both a file and
a directory at the same path, while allowed in the index, makes for
ugliness and difficulty for users to resolve.   Plus it isn't allowed
in the working tree anyway.  We decided a while ago that it'd be
better to represent these conflicts differently[1], [2].

Also, at the time you are unpacking or traversing trees, you only know
if one side had a directory where the other side had a file.  You
don't know if the final merge result will actually have a
directory/file conflict.  If the file existed in both the base version
and unmodified on one side, for example, then the file will be removed
as part of the merge.  It is similarly possible that the entire
directory of files all need to be deleted or are all renamed
elsewhere.  So, you have to keep track of a df_conflict bit, but you
can't act on it until you've processed several other things first.

Since I already know I'm not going to move a whole directory of files
out of the way so that a file can be placed in the working tree
instead of that whole directory, the directory doesn't need to be
tweaked.  I'm not going to propagate any information about a
directory/file conflict at some path down to all subpaths of the
directory.  I only track it for the file that immediately conflicts,
and then only take action on it after resolving all the paths under
the corresponding directory to see if the directory/file conflict
remains.

[1] https://lore.kernel.org/git/xmqqbmabcuhf.fsf@gitster-ct.c.googlers.com/
and the thread surrounding it
[2] https://lore.kernel.org/git/f27f12e8e50e56c010c29caa00296475d4de205b.1603731704.git.gitgitgadget@gmail.com/,
which is now commit ef52778708 ("merge tests: expect improved
directory/file conflict handling in ort", 2020-10-26)

>
> > @@ -161,6 +179,13 @@ static int collect_merge_info_callback(int n,
> >               newinfo.name = p->path;
> >               newinfo.namelen = p->pathlen;
> >               newinfo.pathlen = st_add3(newinfo.pathlen, p->pathlen, 1);
> > +             /*
> > +              * If we did care about parent directories having a D/F
> > +              * conflict, then we'd include
> > +              *    newinfo.df_conflicts |= (mask & ~dirmask);
> > +              * here.  But we don't.  (See comment near setting of local
> > +              * df_conflict variable near the beginning of this function).
> > +              */
>
> I'm not sure how "mask" and "dirmask" contains information about parent
> directories. "mask" represents the available entries, and "dirmask"
> represents which of them are directories, as far as I know. So we can
> notice when something is missing, but I don't see how this distinguishes
> between the case that something is missing because it was in a parent
> directory that got deleted, vs something is missing because it itself
> got deleted.

Yeah, this is more comparisons to unpack_trees.  This code is about to
set up a recursive call into subdirectories.  newinfo is set based on
the mask and dirmask of the current entry, and then subdirectories can
consult newinfo.df_conflicts to see if that path is within a directory
that was involved in a directory/file conflict.  For example:

Tree in base version:
    foo/
        bar
    stuff.txt
Tree on side 1: (adds foo/baz)
    foo/
        bar
        baz
    stuff.txt
Tree on side 2: (deletes foo/, adds new file foo)
   foo
   stuff.txt

When processing 'foo', we have mask=7, dirmask = 3.  So, here
unpack_trees() would have set newinfo.df_conflicts = (mask & ~dirmask)
= 4.  Then when we process foo/bar or foo/baz, we have mask=2,
dirmask=0, which looks like there are no directory/file conflicts.
However, we can note that these paths are under a directory involved
in a directory/file conflict via info.df_conflicts whose value is 4.
unpack_trees() cared about paths under a directory that was involved
in a directory/file conflict, and someone familiar with that code
might ask why I don't also track the same information.  The answer is
that I don't track it, even though I thought about it, because it's
useless overhead since I'm going to leave the directory alone and move
the file out of the way.

Does that make sense?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path
  2020-11-06 22:58   ` Jonathan Tan
@ 2020-11-07  0:26     ` Elijah Newren
  2020-11-09 22:09       ` Jonathan Tan
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-07  0:26 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git Mailing List

On Fri, Nov 6, 2020 at 2:58 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > +static void setup_path_info(struct merge_options *opt,
> > +                         struct string_list_item *result,
> > +                         const char *current_dir_name,
> > +                         int current_dir_name_len,
> > +                         char *fullpath, /* we'll take over ownership */
> > +                         struct name_entry *names,
> > +                         struct name_entry *merged_version,
> > +                         unsigned is_null,     /* boolean */
> > +                         unsigned df_conflict, /* boolean */
>
> Booleans could be int, I think?

I guess this goes back to the question on patch 6 where you suggested
I mark some unsigned variables (derived from bit math on other
unsigned quantities) instead be int.  I guess it could, but I have the
same question; since "boolean" isn't available in C, does int vs.
unsigned matter?

> > +                         unsigned filemask,
> > +                         unsigned dirmask,
> > +                         int resolved          /* boolean */)
> > +{
> > +     struct conflict_info *path_info;
> > +
> > +     assert(!is_null || resolved);
> > +     assert(!df_conflict || !resolved); /* df_conflict implies !resolved */
> > +     assert(resolved == (merged_version != NULL));
> > +
> > +     path_info = xcalloc(1, resolved ? sizeof(struct merged_info) :
> > +                                       sizeof(struct conflict_info));
> > +     path_info->merged.directory_name = current_dir_name;
> > +     path_info->merged.basename_offset = current_dir_name_len;
> > +     path_info->merged.clean = !!resolved;
> > +     if (resolved) {
> > +             path_info->merged.result.mode = merged_version->mode;
> > +             oidcpy(&path_info->merged.result.oid, &merged_version->oid);
> > +             path_info->merged.is_null = !!is_null;
> > +     } else {
> > +             int i;
> > +
> > +             for (i = 0; i < 3; i++) {
> > +                     path_info->pathnames[i] = fullpath;
> > +                     path_info->stages[i].mode = names[i].mode;
> > +                     oidcpy(&path_info->stages[i].oid, &names[i].oid);
> > +             }
> > +             path_info->filemask = filemask;
> > +             path_info->dirmask = dirmask;
> > +             path_info->df_conflict = !!df_conflict;
> > +     }
> > +     strmap_put(&opt->priv->paths, fullpath, path_info);
>
> So these are placed in paths but not unmerged. I'm starting to wonder if
> struct merge_options_internal should be called merge_options_state or
> something, and each field having documentation about when they're used
> (or better yet, have functions like collect_merge_info() return their
> calculations in return values (which may be "out" parameters) instead of
> in this struct).

Right, unmerged is only those paths that remain unmerged after all
steps.  record_unmerged_index_entries() could simply walk over all
entries in paths and pick out the ones that were unmerged, but
process_entries() has to walk over all paths, determine whether they
can be merged, and determine what to record for the resulting tree for
each path.  So, having it stash away the unmerged stuff is a simple
optimization.

Renaming to merge_options_state or even just merge_state would be fine
-- but any renaming done here will also affect merge-recursive.[ch].
See the definition of merge_options in merge-recursive.  (For history,
merge-recursive.h stuffed state into merge_options, which risked funny
misusage patterns and made the API unnecessarily complex...and made it
suggest that alternative algorithms needed to have the same state.
So, the state was moved to a merge_options_internal struct.  That's
not to say we can't rename, but it does need to be done in
merge-recursive as well.)

As for having collect_merge_info() return their calculations in return
values, would that just end with me returning a struct
merge_options_internal?  Or did you want each return value added to
the function signature?  Each return value in the function signature
makes sense right now for this super-simplified initial 20 patches,
but what about when this data structure gains all kind of
rename-related state that is collected, updated, and passed between
these areas?  I'd have a huge number of "out" and "in" fields to every
function.  Eventually, merge_options_internal (or whatever it might be
renamed to) expands to the following, where I have to first define an
extra enum and two extra structs so that you know the definitions of
new types that show up in merge_options_internal:

enum relevance {
    RELEVANT_NO_MORE = 0,
    RELEVANT_CONTENT = 1,
    RELEVANT_LOCATION = 2,
    RELEVANT_BOTH = 3
};

struct traversal_callback_data {
    unsigned long mask;
    unsigned long dirmask;
    struct name_entry names[3];
};

struct rename_info {
    /* For the next six vars, the 0th entry is ignored and unused */
    struct diff_queue_struct pairs[3]; /* input to & output from
diffcore_rename */
    struct strintmap relevant_sources[3];  /* filepath => enum relevance */
    struct strintmap dirs_removed[3];      /* directory => bool */
    struct strmap dir_rename_count[3];     /* old_dir => {new_dir => int} */
    struct strintmap possible_trivial_merges[3]; /* dirname->dir_rename_mask */
    struct strset target_dirs[3];             /* set of directory paths */
    unsigned trivial_merges_okay[3];          /* 0 = no, 1 = maybe */
    /*
     * dir_rename_mask:
     *   0: optimization removing unmodified potential rename source okay
     *   2 or 4: optimization okay, but must check for files added to dir
     *   7: optimization forbidden; need rename source in case of dir rename
     */
    unsigned dir_rename_mask:3;

    /*
     * dir_rename_mask needs to be coupled with a traversal through trees
     * that iterates over all files in a given tree before all immediate
     * subdirectories within that tree.  Since traverse_trees() doesn't do
     * that naturally, we have a traverse_trees_wrapper() that stores any
     * immediate subdirectories while traversing files, then traverses the
     * immediate subdirectories later.
     */
    struct traversal_callback_data *callback_data;
    int callback_data_nr, callback_data_alloc;
    char *callback_data_traverse_path;

    /*
     * When doing repeated merges, we can re-use renaming information from
     * previous merges under special circumstances;
     */
    struct tree *merge_trees[3];
    int cached_pairs_valid_side;
    struct strmap cached_pairs[3];   /* fullnames -> {rename_path or NULL} */
    struct strset cached_irrelevant[3]; /* fullnames */
    struct strset cached_target_names[3]; /* set of target fullnames */
    /*
     * And sometimes it pays to detect renames, and then restart the merge
     * with the renames cached so that we can do trivial tree merging.
     * Values: 0 = don't bother, 1 = let's do it, 2 = we already did it.
     */
    unsigned redo_after_renames;
};

struct merge_options_internal {
    struct strmap paths;    /* maps path -> (merged|conflict)_info */
    struct strmap unmerged; /* maps path -> conflict_info */
#if USE_MEMORY_POOL
    struct mem_pool pool;
#else
    struct string_list paths_to_free; /* list of strings to free */
#endif
    struct rename_info *renames;
    struct index_state attr_index; /* renormalization weirdly needs one... */
    struct strmap output;  /* maps path -> conflict messages */
    const char *current_dir_name;
    char *toplevel_dir; /* see merge_info.directory_name comment */
    int call_depth;
    int needed_rename_limit;
};


> > +     result->string = fullpath;
> > +     result->util = path_info;
> > +}
> > +
> >  static int collect_merge_info_callback(int n,
> >                                      unsigned long mask,
> >                                      unsigned long dirmask,
> > @@ -91,10 +136,12 @@ static int collect_merge_info_callback(int n,
> >        */
> >       struct merge_options *opt = info->data;
> >       struct merge_options_internal *opti = opt->priv;
> > -     struct conflict_info *ci;
> > +     struct string_list_item pi;  /* Path Info */
> > +     struct conflict_info *ci; /* pi.util when there's a conflict */
>
> Looking ahead to patch 10, this seems more like "pi.util unless we know
> for sure that there's no conflict".

That's too long for the line to remain at 80 characters; it's 16
characters over the limit.  ;-)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-03 16:36   ` Elijah Newren
@ 2020-11-07  6:06     ` Elijah Newren
  2020-11-07 15:02       ` Derrick Stolee
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-07  6:06 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

Hi Derrick,

On Tue, Nov 3, 2020 at 8:36 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, Nov 3, 2020 at 6:50 AM Derrick Stolee <stolee@gmail.com> wrote:
> >
> > On 11/2/2020 3:43 PM, Elijah Newren wrote:
> > > This series depends on a merge of en/strmap (after updating to v3) and
> > > en/merge-ort-api-null-impl.
> > >
> > > As promised, here's the update of the series due to the strmap
> > > updates...and two other tiny updates.
> >
> > Hi Elijah,
> >
> > I'm sorry that I've been unavailable to read and review your series
> > on this topic. I'm very excited about the opportunities here, and I
> > wanted to take your topic and merge it with our microsoft/git fork
> > so I could test the performance in a Scalar-enabled monorepo. My
> > branch is available in my fork [1]
> >
> > [1] https://github.com/derrickstolee/git/tree/merge-ort-vfs
> >
> > However, I'm unable to discover how to trigger your ort strategy,
> > even for a simple rebase. Perhaps you could supply a recommended
> > command for testing?
> >
> > Thanks,
> > -Stolee
>
> If you want to test performance, you shouldn't test this particular
> submission, you should test the end result which exists as the 'ort'
> branch of my repo.  It actually passes all the tests rather than just
> trivial cherry-picks and rebases, and has lots (and lots) of
> performance work that hasn't even begun at the point of the
> 'ort-basics' branch.  (However, it also contains some unrelated memory
> cleanup in revision.c, chdir-notify.c, and a number of other places
> because I was annoyed that a rebase wouldn't run valgrind-free and
> made it harder to spot my memory leaks.  And the day I went hunting
> those memory "leaks", I went and grabbed some unrelated memory leaks
> too.  If it causes you merge conflicts, let me know and I'll try to
> create a branch for you that hash the minimal changes outside of
> merge-ort*.[ch] and diffcore*.[ch])
>
> All that said, for testing either branch you just need to first set
> pull.twohead=ort in your git config (see
> https://lore.kernel.org/git/61217a83bd7ff0ce9016eb4df9ded4fdf29a506c.1604360734.git.gitgitgadget@gmail.com/),
> or, if running regression tests, set GIT_TEST_MERGE_ALGORITHM=ort.

I probably also should have mentioned that merge-ort does not (yet?)
heed merge.renames configuration setting; it always detects renames.
I know you run with merge.renames=false, so you won't quite get an
apples-to-apples comparison.  However, part of my point was I wanted
to make renames fast enough that they could be left turned on, even
for the large scale repos, so I'm very interested in your experience.
If you need an escape hatch, though, just put a "return 1" at the top
of detect_and_process_renames() to turn it off.

Oh, and I went through and re-merged all the merge commits in the
linux kernel and found a bug in merge-ort while doing that (causing it
to die, not to merge badly).  I'm kind of surprised that none of my
testcases triggered that failure earlier; if you're testing it out,
you might want to update to get the fix (commit 067e5c1a38,
"merge-ort: fix bug with cached_target_names not being initialized in
redos", 2020-11-06).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-07  6:06     ` Elijah Newren
@ 2020-11-07 15:02       ` Derrick Stolee
  2020-11-07 19:39         ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-07 15:02 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

On 11/7/20 1:06 AM, Elijah Newren wrote:
> Hi Derrick,
> 
> On Tue, Nov 3, 2020 at 8:36 AM Elijah Newren <newren@gmail.com> wrote:
>> All that said, for testing either branch you just need to first set
>> pull.twohead=ort in your git config (see
>> https://lore.kernel.org/git/61217a83bd7ff0ce9016eb4df9ded4fdf29a506c.1604360734.git.gitgitgadget@gmail.com/),
>> or, if running regression tests, set GIT_TEST_MERGE_ALGORITHM=ort.
> 
> I probably also should have mentioned that merge-ort does not (yet?)
> heed merge.renames configuration setting; it always detects renames.
> I know you run with merge.renames=false, so you won't quite get an
> apples-to-apples comparison.  However, part of my point was I wanted
> to make renames fast enough that they could be left turned on, even
> for the large scale repos, so I'm very interested in your experience.
> If you need an escape hatch, though, just put a "return 1" at the top
> of detect_and_process_renames() to turn it off.
> 
> Oh, and I went through and re-merged all the merge commits in the
> linux kernel and found a bug in merge-ort while doing that (causing it
> to die, not to merge badly).  I'm kind of surprised that none of my
> testcases triggered that failure earlier; if you're testing it out,
> you might want to update to get the fix (commit 067e5c1a38,
> "merge-ort: fix bug with cached_target_names not being initialized in
> redos", 2020-11-06).

I did manage to do some testing to see what happens with
a large repo under a small sparse-checkout. And using
trace2, I was able to see that your code is being exercised.
Unfortunately, I didn't see any performance improvement, and
that is likely due to needing to expand the index entirely
when checking out the merge commit.

Is there a command to construct a merge commit without
actually checking it out? That would reduce the time spent
expanding the index, which would allow your algorithm to
really show its benefits!

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-07 15:02       ` Derrick Stolee
@ 2020-11-07 19:39         ` Elijah Newren
  2020-11-09 12:30           ` Derrick Stolee
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-07 19:39 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

On Sat, Nov 7, 2020 at 7:02 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/7/20 1:06 AM, Elijah Newren wrote:
> > Hi Derrick,
> >
> > On Tue, Nov 3, 2020 at 8:36 AM Elijah Newren <newren@gmail.com> wrote:
> >> All that said, for testing either branch you just need to first set
> >> pull.twohead=ort in your git config (see
> >> https://lore.kernel.org/git/61217a83bd7ff0ce9016eb4df9ded4fdf29a506c.1604360734.git.gitgitgadget@gmail.com/),
> >> or, if running regression tests, set GIT_TEST_MERGE_ALGORITHM=ort.
> >
> > I probably also should have mentioned that merge-ort does not (yet?)
> > heed merge.renames configuration setting; it always detects renames.
> > I know you run with merge.renames=false, so you won't quite get an
> > apples-to-apples comparison.  However, part of my point was I wanted
> > to make renames fast enough that they could be left turned on, even
> > for the large scale repos, so I'm very interested in your experience.
> > If you need an escape hatch, though, just put a "return 1" at the top
> > of detect_and_process_renames() to turn it off.
> >
> > Oh, and I went through and re-merged all the merge commits in the
> > linux kernel and found a bug in merge-ort while doing that (causing it
> > to die, not to merge badly).  I'm kind of surprised that none of my
> > testcases triggered that failure earlier; if you're testing it out,
> > you might want to update to get the fix (commit 067e5c1a38,
> > "merge-ort: fix bug with cached_target_names not being initialized in
> > redos", 2020-11-06).
>
> I did manage to do some testing to see what happens with
> a large repo under a small sparse-checkout. And using
> trace2, I was able to see that your code is being exercised.
> Unfortunately, I didn't see any performance improvement, and
> that is likely due to needing to expand the index entirely
> when checking out the merge commit.
>
> Is there a command to construct a merge commit without
> actually checking it out? That would reduce the time spent
> expanding the index, which would allow your algorithm to
> really show its benefits!

Wow, very interesting.  I am working on a --remerge-diff option for
log, which implies -p and is similar to -c or --cc in that it makes
merge commits show a diff, but which in particular remerges the two
parent commits complete with conflict markers and such and then diffs
the merge commit against that intermediate remerge.  That's a case
that constructs a merge commit without ever touching the index (or
working tree)...but there's no equivalent comparison point for
merge-recursive.  So, it doesn't provide something to compare against
(and while the code can be used I don't actually have a --remerge-diff
option yet -- it just hardcodes the behavior on if wanted or not), so
I'm not sure if you'd be interested in it.  If you are, let me know
though, and I'll send details.

However, I'm really surprised here, because merge-recursive always
reads and writes the index too (the index is the basis for its whole
algorithm).  In fact, merge-recursive always reads the index at least
*twice* (it unconditionally discards and re-reads the index), so you
must have some kind of specialized tweaking of merge-recursive if it
somehow avoids a full index read/write.  In order to do an
apples-to-apples comparison, we'd need to make those same tweaks to
merge-ort, but I don't have a clue what kind of tweaks you've made
here.  So, some investigation points:

*1*. Could you give me the accumulated times from the trace2_regions
so we can verify where the time is spent?  The 'summarize-perf' script
at the toplevel of the repo in my ort branch might be helpful for
this; just prefix any git command with that script and it accumulates
the trace2 region times and prints them out.  For example, I could run
'summarize-perf git merge --no-edit B^0' or 'summarize-perf test-tool
fast-rebase --onto HEAD ca76bea9 myfeature'.  Here's an example:

=== BEGIN OUTPUT ===
$ /home/newren/floss/git/summarize-perf test-tool fast-rebase --onto
HEAD 4703d9119972bf586d2cca76ec6438f819ffa30e hwmon-updates
Rebasing fd8bdb23b91876ac1e624337bb88dc1dcc21d67e...
Done.
Accumulated times:
    0.031 : <unmeasured> ( 3.2%)
    0.837 : 35 : label:incore_nonrecursive
       0.003 : <unmeasured> ( 0.4%)
       0.476 : 41 : ..label:collect_merge_info
          0.001 : <unmeasured> ( 0.2%)
          0.475 : 41 : ....label:traverse_trees
       0.298 : 41 : ..label:renames
          0.015 : <unmeasured> ( 5.1%)
          0.280 : 41 : ....label:regular renames
             0.036 : <unmeasured> (12.7%)
             0.244 : 6 : ......label:diffcore_rename
                0.001 : <unmeasured> ( 0.4%)
                0.078 : 6 : ........label:dir rename setup
                0.055 : 6 : ........label:basename matches
                0.051 : 6 : ........label:exact renames
                0.031 : 6 : ........label:write back to queue
                0.017 : 6 : ........label:setup
                0.009 : 6 : ........label:cull basename
                0.003 : 6 : ........label:cull exact
          0.002 : 35 : ....label:directory renames
          0.001 : 35 : ....label:process renames
       0.052 : 35 : ..label:process_entries
          0.001 : <unmeasured> ( 1.7%)
          0.033 : 35 : ....label:processing
          0.017 : 35 : ....label:process_entries setup
             0.001 : <unmeasured> ( 5.8%)
             0.008 : 35 : ......label:plist copy
             0.008 : 35 : ......label:plist sort
             0.000 : 35 : ......label:plist grow
          0.001 : 35 : ....label:finalize
       0.005 : 35 : ..label:merge_start
          0.001 : <unmeasured> (18.8%)
          0.004 : 34 : ....label:reset_maps
          0.000 : 35 : ....label:sanity checks
          0.000 : 1 : ....label:allocate/init
       0.003 : 6 : ..label:reset_maps
    0.035 : 1 : label:do_write_index
/home/newren/floss/linux-stable/.git/index.lock
    0.034 : 1 : label:checkout
       0.034 : <unmeasured> (99.9%)
       0.000 : 1 : ..label:Filtering content
    0.009 : 1 : label:do_read_index .git/index
    0.000 : 1 : label:write_auto_merge
    0.000 : 1 : label:record_unmerged
Estimated measurement overhead (.010 ms/region-measure * 679):
0.006790000000000001
Timing including forking:  0.960 (0.013 additional seconds)
=== END OUTPUT ===
This was a run that took just under 1s (and was a hot-cache case; I
had just done the same rebase before to warm the caches), and the
combination of index/working tree bits (everything at and after
do_write_index in the output) was 0.035+0.034+0.009+0+0=0.078 seconds,
corresponding to just over 8.1% of overall time.  I'm curious where
that lands for your repository testcase; if the larger time ends up
somewhere under the indented label:incore_nonrecursive region, then
it's due to something other than index reading/updating/writing.

*2*. If it really is due to index reading/updating/writing, then index
handling in merge-ort is confined to two functions: checkout() and
record_unmerged_index_entries().  Both functions aren't too long, and
neither one calls into any other function within merge-ort.c.
(Further, checkout() is a near copy of code from merge_working_tree()
in builtin/checkout.c, or at least a copy of that function from a year
or so ago.)  As such, it's possible you can go in and make whatever
special tweaks you have for partial index reading/writing to those
functions.

I'm curious to hear back more on this.

Elijah

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-07 19:39         ` Elijah Newren
@ 2020-11-09 12:30           ` Derrick Stolee
  2020-11-09 17:13             ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-09 12:30 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

On 11/7/2020 2:39 PM, Elijah Newren wrote:
> On Sat, Nov 7, 2020 at 7:02 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 11/7/20 1:06 AM, Elijah Newren wrote:
>>> Hi Derrick,
>>>
>>> On Tue, Nov 3, 2020 at 8:36 AM Elijah Newren <newren@gmail.com> wrote:
>>>> All that said, for testing either branch you just need to first set
>>>> pull.twohead=ort in your git config (see
>>>> https://lore.kernel.org/git/61217a83bd7ff0ce9016eb4df9ded4fdf29a506c.1604360734.git.gitgitgadget@gmail.com/),
>>>> or, if running regression tests, set GIT_TEST_MERGE_ALGORITHM=ort.
>>>
>>> I probably also should have mentioned that merge-ort does not (yet?)
>>> heed merge.renames configuration setting; it always detects renames.
>>> I know you run with merge.renames=false, so you won't quite get an
>>> apples-to-apples comparison.  However, part of my point was I wanted
>>> to make renames fast enough that they could be left turned on, even
>>> for the large scale repos, so I'm very interested in your experience.
>>> If you need an escape hatch, though, just put a "return 1" at the top
>>> of detect_and_process_renames() to turn it off.
>>>
>>> Oh, and I went through and re-merged all the merge commits in the
>>> linux kernel and found a bug in merge-ort while doing that (causing it
>>> to die, not to merge badly).  I'm kind of surprised that none of my
>>> testcases triggered that failure earlier; if you're testing it out,
>>> you might want to update to get the fix (commit 067e5c1a38,
>>> "merge-ort: fix bug with cached_target_names not being initialized in
>>> redos", 2020-11-06).
>>
>> I did manage to do some testing to see what happens with
>> a large repo under a small sparse-checkout. And using
>> trace2, I was able to see that your code is being exercised.
>> Unfortunately, I didn't see any performance improvement, and
>> that is likely due to needing to expand the index entirely
>> when checking out the merge commit.
>>
>> Is there a command to construct a merge commit without
>> actually checking it out? That would reduce the time spent
>> expanding the index, which would allow your algorithm to
>> really show its benefits!
> 
> Wow, very interesting.  I am working on a --remerge-diff option for
> log, which implies -p and is similar to -c or --cc in that it makes
> merge commits show a diff, but which in particular remerges the two
> parent commits complete with conflict markers and such and then diffs
> the merge commit against that intermediate remerge.  That's a case
> that constructs a merge commit without ever touching the index (or
> working tree)...but there's no equivalent comparison point for
> merge-recursive.  So, it doesn't provide something to compare against
> (and while the code can be used I don't actually have a --remerge-diff
> option yet -- it just hardcodes the behavior on if wanted or not), so
> I'm not sure if you'd be interested in it.  If you are, let me know
> though, and I'll send details.
> 
> However, I'm really surprised here, because merge-recursive always
> reads and writes the index too (the index is the basis for its whole
> algorithm).  In fact, merge-recursive always reads the index at least
> *twice* (it unconditionally discards and re-reads the index), so you
> must have some kind of specialized tweaking of merge-recursive if it
> somehow avoids a full index read/write.  In order to do an
> apples-to-apples comparison, we'd need to make those same tweaks to
> merge-ort, but I don't have a clue what kind of tweaks you've made
> here.  So, some investigation points:
> 
> *1*. Could you give me the accumulated times from the trace2_regions
> so we can verify where the time is spent?  The 'summarize-perf' script
> at the toplevel of the repo in my ort branch might be helpful for
> this; just prefix any git command with that script and it accumulates
> the trace2 region times and prints them out.  For example, I could run
> 'summarize-perf git merge --no-edit B^0' or 'summarize-perf test-tool
> fast-rebase --onto HEAD ca76bea9 myfeature'.  Here's an example:
> 
> === BEGIN OUTPUT ===
> $ /home/newren/floss/git/summarize-perf test-tool fast-rebase --onto
> HEAD 4703d9119972bf586d2cca76ec6438f819ffa30e hwmon-updates
> Rebasing fd8bdb23b91876ac1e624337bb88dc1dcc21d67e...
> Done.
> Accumulated times:
>     0.031 : <unmeasured> ( 3.2%)
>     0.837 : 35 : label:incore_nonrecursive
>        0.003 : <unmeasured> ( 0.4%)
>        0.476 : 41 : ..label:collect_merge_info
>           0.001 : <unmeasured> ( 0.2%)
>           0.475 : 41 : ....label:traverse_trees
>        0.298 : 41 : ..label:renames
>           0.015 : <unmeasured> ( 5.1%)
>           0.280 : 41 : ....label:regular renames
>              0.036 : <unmeasured> (12.7%)
>              0.244 : 6 : ......label:diffcore_rename
>                 0.001 : <unmeasured> ( 0.4%)
>                 0.078 : 6 : ........label:dir rename setup
>                 0.055 : 6 : ........label:basename matches
>                 0.051 : 6 : ........label:exact renames
>                 0.031 : 6 : ........label:write back to queue
>                 0.017 : 6 : ........label:setup
>                 0.009 : 6 : ........label:cull basename
>                 0.003 : 6 : ........label:cull exact
>           0.002 : 35 : ....label:directory renames
>           0.001 : 35 : ....label:process renames
>        0.052 : 35 : ..label:process_entries
>           0.001 : <unmeasured> ( 1.7%)
>           0.033 : 35 : ....label:processing
>           0.017 : 35 : ....label:process_entries setup
>              0.001 : <unmeasured> ( 5.8%)
>              0.008 : 35 : ......label:plist copy
>              0.008 : 35 : ......label:plist sort
>              0.000 : 35 : ......label:plist grow
>           0.001 : 35 : ....label:finalize
>        0.005 : 35 : ..label:merge_start
>           0.001 : <unmeasured> (18.8%)
>           0.004 : 34 : ....label:reset_maps
>           0.000 : 35 : ....label:sanity checks
>           0.000 : 1 : ....label:allocate/init
>        0.003 : 6 : ..label:reset_maps
>     0.035 : 1 : label:do_write_index
> /home/newren/floss/linux-stable/.git/index.lock
>     0.034 : 1 : label:checkout
>        0.034 : <unmeasured> (99.9%)
>        0.000 : 1 : ..label:Filtering content
>     0.009 : 1 : label:do_read_index .git/index
>     0.000 : 1 : label:write_auto_merge
>     0.000 : 1 : label:record_unmerged
> Estimated measurement overhead (.010 ms/region-measure * 679):
> 0.006790000000000001
> Timing including forking:  0.960 (0.013 additional seconds)
> === END OUTPUT ===
> This was a run that took just under 1s (and was a hot-cache case; I
> had just done the same rebase before to warm the caches), and the
> combination of index/working tree bits (everything at and after
> do_write_index in the output) was 0.035+0.034+0.009+0+0=0.078 seconds,
> corresponding to just over 8.1% of overall time.  I'm curious where
> that lands for your repository testcase; if the larger time ends up
> somewhere under the indented label:incore_nonrecursive region, then
> it's due to something other than index reading/updating/writing.
> 
> *2*. If it really is due to index reading/updating/writing, then index
> handling in merge-ort is confined to two functions: checkout() and
> record_unmerged_index_entries().  Both functions aren't too long, and
> neither one calls into any other function within merge-ort.c.
> (Further, checkout() is a near copy of code from merge_working_tree()
> in builtin/checkout.c, or at least a copy of that function from a year
> or so ago.)  As such, it's possible you can go in and make whatever
> special tweaks you have for partial index reading/writing to those
> functions.
> 
> I'm curious to hear back more on this.

I don't have a lot of time to dig into this right now, but here are
the stats for my rebases and merges with and without your option.

The first thing I notice for each is that there is a significant
amount of "unmeasured" time at the beginning of each, and that
could possibly be improved separately.

First, try a rebase forward and backward.

$ /_git/git/summarize-perf git rebase --onto to from test
Successfully rebased and updated refs/heads/test.
Accumulated times:
    8.511 : <unmeasured> (74.9%)
    1.331 : 1 : ......label:unpack_trees
       0.200 : <unmeasured> (15.1%)
       0.580 : 1 : ........label:traverse_trees
       0.403 : 1 : ........label:clear_ce_flags/0x00000000_0x02000000
       0.126 : 1 : ........label:check_updates
          0.126 : <unmeasured> (100.0%)
          0.000 : 1 : ..........label:Filtering content
       0.021 : 1 : ........label:clear_ce_flags/0x00080000_0x42000000
       0.000 : 1 : ........label:fully_valid
    1.059 : 1 : ......label:do_write_index /_git/office/src/.git/index.lock
       0.930 : <unmeasured> (87.9%)
       0.128 : 1 : ........label:write/extension/cache_tree
    0.455 : 2 : ......label:fully_valid
    0.001 : 1 : ......label:traverse_trees
    0.000 : 1 : ......label:check_updates
Estimated measurement overhead (.010 ms/region-measure * 41): 0.00041000000000000005
Timing including forking: 11.382 (0.026 additional seconds)

$ /_git/git/summarize-perf git rebase --onto from to test
Successfully rebased and updated refs/heads/test.
Accumulated times:
    8.556 : <unmeasured> (75.2%)
    1.315 : 1 : ......label:unpack_trees
       0.197 : <unmeasured> (15.0%)
       0.580 : 1 : ........label:traverse_trees
       0.391 : 1 : ........label:clear_ce_flags/0x00000000_0x02000000
       0.126 : 1 : ........label:check_updates
          0.126 : <unmeasured> (100.0%)
          0.000 : 1 : ..........label:Filtering content
       0.021 : 1 : ........label:clear_ce_flags/0x00080000_0x42000000
       0.000 : 1 : ........label:fully_valid
    1.071 : 1 : ......label:do_write_index /_git/office/src/.git/index.lock
       0.942 : <unmeasured> (88.0%)
       0.129 : 1 : ........label:write/extension/cache_tree
    0.431 : 2 : ......label:fully_valid
    0.001 : 1 : ......label:traverse_trees
    0.000 : 1 : ......label:check_updates
Estimated measurement overhead (.010 ms/region-measure * 41): 0.00041000000000000005
Timing including forking: 11.399 (0.026 additional seconds)

Then do the same with the ort strategy.

$ /_git/git/summarize-perf git -c pull.twohead=ort rebase --onto to from test
Successfully rebased and updated refs/heads/test.
Accumulated times:
    8.350 : <unmeasured> (73.2%)
    1.403 : 1 : ....label:checkout  
       0.000 : <unmeasured> ( 0.0%)
       1.403 : 1 : ......label:unpack_trees
          0.312 : <unmeasured> (22.3%)
          0.539 : 1 : ........label:traverse_trees
          0.401 : 1 : ........label:clear_ce_flags/0x00000000_0x02000000
          0.128 : 1 : ........label:check_updates
             0.128 : <unmeasured> (100.0%)
             0.000 : 1 : ..........label:Filtering content
          0.021 : 1 : ........label:clear_ce_flags/0x00080000_0x42000000
          0.000 : 1 : ........label:fully_valid
    1.081 : 1 : ....label:do_write_index /_git/office/src/.git/index.lock
       0.951 : <unmeasured> (88.1%)
       0.129 : 1 : ......label:write/extension/cache_tree
    0.432 : 2 : ....label:fully_valid
    0.143 : 1 : ....label:do_read_index .git/index
       0.019 : <unmeasured> (13.1%)
       0.125 : 1 : label:read/extension/cache_tree
    0.004 : 1 : ....label:incore_nonrecursive
       0.001 : <unmeasured> (25.8%)
       0.002 : 1 : ......label:process_entries
          0.000 : <unmeasured> ( 2.6%)
          0.001 : 1 : ........label:finalize
          0.001 : 1 : ........label:process_entries setup
             0.000 : <unmeasured> ( 6.7%)
             0.001 : 1 : ..........label:plist sort
             0.000 : 1 : ..........label:plist copy
             0.000 : 1 : ..........label:plist grow
          0.000 : 1 : ........label:processing
       0.001 : 1 : ......label:collect_merge_info
          0.000 : <unmeasured> (35.3%)
          0.001 : 1 : ........label:traverse_trees
       0.000 : 1 : ......label:merge_start
          0.000 : <unmeasured> (42.3%)
          0.000 : 1 : ........label:allocate/init
          0.000 : 1 : ........label:sanity checks
       0.000 : 1 : ......label:renames 
    0.001 : 1 : ....label:traverse_trees
    0.000 : 1 : ....label:write_auto_merge
    0.000 : 1 : ....label:check_updates
    0.000 : 1 : ....label:record_unmerged
Estimated measurement overhead (.010 ms/region-measure * 56): 0.0005600000000000001
Timing including forking: 11.442 (0.027 additional seconds)

$ /_git/git/summarize-perf git -c pull.twohead=ort rebase --onto from to test
Successfully rebased and updated refs/heads/test.
Accumulated times:
    8.337 : <unmeasured> (73.2%)
    1.395 : 1 : ....label:checkout  
       0.000 : <unmeasured> ( 0.0%)
       1.395 : 1 : ......label:unpack_trees
          0.309 : <unmeasured> (22.1%)
          0.537 : 1 : ........label:traverse_trees
          0.403 : 1 : ........label:clear_ce_flags/0x00000000_0x02000000
          0.124 : 1 : ........label:check_updates
             0.124 : <unmeasured> (100.0%)
             0.000 : 1 : ..........label:Filtering content
          0.021 : 1 : ........label:clear_ce_flags/0x00080000_0x42000000
          0.000 : 1 : ........label:fully_valid
    1.084 : 1 : ....label:do_write_index /_git/office/src/.git/index.lock
       0.955 : <unmeasured> (88.1%)
       0.129 : 1 : ......label:write/extension/cache_tree
    0.436 : 2 : ....label:fully_valid
    0.137 : 1 : ....label:do_read_index .git/index
       0.013 : <unmeasured> ( 9.3%)
       0.125 : 1 : label:read/extension/cache_tree
    0.004 : 1 : ....label:incore_nonrecursive
       0.001 : <unmeasured> (24.5%)
       0.002 : 1 : ......label:process_entries
          0.000 : <unmeasured> ( 2.5%)
          0.001 : 1 : ........label:finalize
          0.001 : 1 : ........label:process_entries setup
             0.000 : <unmeasured> ( 6.5%)
             0.001 : 1 : ..........label:plist sort
             0.000 : 1 : ..........label:plist copy
             0.000 : 1 : ..........label:plist grow
          0.000 : 1 : ........label:processing
       0.001 : 1 : ......label:collect_merge_info
          0.000 : <unmeasured> (26.5%)
          0.001 : 1 : ........label:traverse_trees
       0.000 : 1 : ......label:merge_start
          0.000 : <unmeasured> (43.1%)
          0.000 : 1 : ........label:allocate/init
          0.000 : 1 : ........label:sanity checks
       0.000 : 1 : ......label:renames 
    0.001 : 1 : ....label:traverse_trees
    0.000 : 1 : ....label:write_auto_merge
    0.000 : 1 : ....label:check_updates
    0.000 : 1 : ....label:record_unmerged
Estimated measurement overhead (.010 ms/region-measure * 56): 0.0005600000000000001
Timing including forking: 11.418 (0.024 additional seconds)

And here are timings for a simple merge. Two files at root were changed in the
commits I made, but there are also some larger changes from the commit history.
These should all be seen as "this tree updated in one of the two, so take that
tree".

$ git reset --hard test2 && /_git/git/summarize-perf git merge test -m test
Merge made by the 'recursive' strategy.
Accumulated times:
    2.647 : <unmeasured> (48.6%)
    1.384 : 1 : ..label:unpack_trees
       0.267 : <unmeasured> (19.3%)
       0.582 : 1 : ....label:traverse_trees
       0.391 : 1 : ....label:clear_ce_flags/0x00000000_0x02000000
       0.124 : 1 : ....label:check_updates
          0.124 : <unmeasured> (100.0%)
          0.000 : 1 : ......label:Filtering content
       0.021 : 1 : ....label:clear_ce_flags/0x00080000_0x42000000
       0.000 : 1 : ....label:fully_valid
    1.060 : 1 : ..label:do_write_index /_git/office/src/.git/index.lock
       0.931 : <unmeasured> (87.9%)
       0.128 : 1 : ....label:write/extension/cache_tree
    0.226 : 1 : ..label:fully_valid 
    0.134 : 1 : ..label:do_read_index .git/index
       0.008 : <unmeasured> ( 5.8%)
       0.126 : 1 : label:read/extension/cache_tree
    0.001 : 1 : ..label:traverse_trees
    0.000 : 1 : ..label:check_updates
    0.000 : 1 : ..label:setup       
    0.000 : 1 : ..label:write back to queue
Estimated measurement overhead (.010 ms/region-measure * 20): 0.0002
Timing including forking:  5.466 (0.015 additional seconds)

$ git reset --hard test2 && /_git/git/summarize-perf git -c pull.twohead=ort merge test -m test
Merge made by the 'ort' strategy.
Accumulated times:
    2.531 : <unmeasured> (49.1%)
    1.328 : 1 : ..label:checkout    
       0.000 : <unmeasured> ( 0.0%)
       1.328 : 1 : ....label:unpack_trees
          0.228 : <unmeasured> (17.2%)
          0.566 : 1 : ......label:traverse_trees
          0.388 : 1 : ......label:clear_ce_flags/0x00000000_0x02000000
          0.125 : 1 : ......label:check_updates
             0.125 : <unmeasured> (100.0%)
             0.000 : 1 : ........label:Filtering content
          0.021 : 1 : ......label:clear_ce_flags/0x00080000_0x42000000
          0.000 : 1 : ......label:fully_valid
    1.067 : 1 : ..label:do_write_index /_git/office/src/.git/index.lock
       0.938 : <unmeasured> (87.9%)
       0.129 : 1 : ....label:write/extension/cache_tree
    0.230 : 1 : ..label:fully_valid 
    0.002 : 1 : ..label:incore_recursive
       0.001 : <unmeasured> (22.3%)
       0.001 : 1 : ....label:collect_merge_info
          0.001 : <unmeasured> (60.2%)
          0.000 : 1 : ......label:traverse_trees
       0.001 : 1 : ....label:process_entries
          0.000 : <unmeasured> ( 2.8%)
          0.001 : 1 : ......label:finalize
          0.000 : 1 : ......label:process_entries setup
             0.000 : <unmeasured> ( 6.9%)
             0.000 : 1 : ........label:plist sort
             0.000 : 1 : ........label:plist copy
             0.000 : 1 : ........label:plist grow
          0.000 : 1 : ......label:processing
       0.000 : 1 : ....label:merge_start
          0.000 : <unmeasured> (50.0%)
          0.000 : 1 : ......label:allocate/init
          0.000 : 1 : ......label:sanity checks
       0.000 : 1 : ....label:renames   
    0.001 : 1 : ..label:traverse_trees
    0.000 : 1 : ..label:write_auto_merge
    0.000 : 1 : ..label:check_updates
    0.000 : 1 : ..label:setup       
    0.000 : 1 : ..label:display messages
    0.000 : 1 : ..label:write back to queue
    0.000 : 1 : ..label:record_unmerged
Estimated measurement overhead (.010 ms/region-measure * 36): 0.00036
Timing including forking:  5.174 (0.015 additional seconds)

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-09 12:30           ` Derrick Stolee
@ 2020-11-09 17:13             ` Elijah Newren
  2020-11-09 19:51               ` Derrick Stolee
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-09 17:13 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

Hi Derrick,

On Mon, Nov 9, 2020 at 4:30 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/7/2020 2:39 PM, Elijah Newren wrote:

> > *1*. Could you give me the accumulated times from the trace2_regions
> > so we can verify where the time is spent?  The 'summarize-perf' script
> > at the toplevel of the repo in my ort branch might be helpful for
> > this; just prefix any git command with that script and it accumulates
> > the trace2 region times and prints them out.  For example, I could run
> > 'summarize-perf git merge --no-edit B^0' or 'summarize-perf test-tool
> > fast-rebase --onto HEAD ca76bea9 myfeature'.  Here's an example:
> >
> > === BEGIN OUTPUT ===
> > $ /home/newren/floss/git/summarize-perf test-tool fast-rebase --onto
> > HEAD 4703d9119972bf586d2cca76ec6438f819ffa30e hwmon-updates
> > Rebasing fd8bdb23b91876ac1e624337bb88dc1dcc21d67e...
> > Done.
> > Accumulated times:
> >     0.031 : <unmeasured> ( 3.2%)
> >     0.837 : 35 : label:incore_nonrecursive
> >        0.003 : <unmeasured> ( 0.4%)
> >        0.476 : 41 : ..label:collect_merge_info
> >           0.001 : <unmeasured> ( 0.2%)
> >           0.475 : 41 : ....label:traverse_trees
> >        0.298 : 41 : ..label:renames
> >           0.015 : <unmeasured> ( 5.1%)
> >           0.280 : 41 : ....label:regular renames
> >              0.036 : <unmeasured> (12.7%)
> >              0.244 : 6 : ......label:diffcore_rename
> >                 0.001 : <unmeasured> ( 0.4%)
> >                 0.078 : 6 : ........label:dir rename setup
> >                 0.055 : 6 : ........label:basename matches
> >                 0.051 : 6 : ........label:exact renames
> >                 0.031 : 6 : ........label:write back to queue
> >                 0.017 : 6 : ........label:setup
> >                 0.009 : 6 : ........label:cull basename
> >                 0.003 : 6 : ........label:cull exact
> >           0.002 : 35 : ....label:directory renames
> >           0.001 : 35 : ....label:process renames
> >        0.052 : 35 : ..label:process_entries
> >           0.001 : <unmeasured> ( 1.7%)
> >           0.033 : 35 : ....label:processing
> >           0.017 : 35 : ....label:process_entries setup
> >              0.001 : <unmeasured> ( 5.8%)
> >              0.008 : 35 : ......label:plist copy
> >              0.008 : 35 : ......label:plist sort
> >              0.000 : 35 : ......label:plist grow
> >           0.001 : 35 : ....label:finalize
> >        0.005 : 35 : ..label:merge_start
> >           0.001 : <unmeasured> (18.8%)
> >           0.004 : 34 : ....label:reset_maps
> >           0.000 : 35 : ....label:sanity checks
> >           0.000 : 1 : ....label:allocate/init
> >        0.003 : 6 : ..label:reset_maps
> >     0.035 : 1 : label:do_write_index
> > /home/newren/floss/linux-stable/.git/index.lock
> >     0.034 : 1 : label:checkout
> >        0.034 : <unmeasured> (99.9%)
> >        0.000 : 1 : ..label:Filtering content
> >     0.009 : 1 : label:do_read_index .git/index
> >     0.000 : 1 : label:write_auto_merge
> >     0.000 : 1 : label:record_unmerged
> > Estimated measurement overhead (.010 ms/region-measure * 679):
> > 0.006790000000000001
> > Timing including forking:  0.960 (0.013 additional seconds)
> > === END OUTPUT ===
> > This was a run that took just under 1s (and was a hot-cache case; I
> > had just done the same rebase before to warm the caches), and the
> > combination of index/working tree bits (everything at and after
> > do_write_index in the output) was 0.035+0.034+0.009+0+0=0.078 seconds,
> > corresponding to just over 8.1% of overall time.  I'm curious where
> > that lands for your repository testcase; if the larger time ends up
> > somewhere under the indented label:incore_nonrecursive region, then
> > it's due to something other than index reading/updating/writing.
> >
> > *2*. If it really is due to index reading/updating/writing, then index
> > handling in merge-ort is confined to two functions: checkout() and
> > record_unmerged_index_entries().  Both functions aren't too long, and
> > neither one calls into any other function within merge-ort.c.
> > (Further, checkout() is a near copy of code from merge_working_tree()
> > in builtin/checkout.c, or at least a copy of that function from a year
> > or so ago.)  As such, it's possible you can go in and make whatever
> > special tweaks you have for partial index reading/writing to those
> > functions.
> >
> > I'm curious to hear back more on this.
>
> I don't have a lot of time to dig into this right now, but here are
> the stats for my rebases and merges with and without your option.

Actually, this was pretty enlightening.  I think I know about what's
happening...

First, a few years ago, Ben said that merges in the Microsoft repos
took about an hour[1]:
"For the repro that I have been using this drops the merge time from ~1 hour to
~5 minutes and the unmerged entries goes down from ~40,000 to 1."
The change he made to drop it that far was to turn off rename detection.

[1] https://lore.kernel.org/git/20180426205202.23056-1-benpeart@microsoft.com/

Keep that in mind, especially since your times are actually
significantly less than 5 minutes...

> The first thing I notice for each is that there is a significant
> amount of "unmeasured" time at the beginning of each, and that
> could possibly be improved separately.
>
> First, try a rebase forward and backward.
>
> $ /_git/git/summarize-perf git rebase --onto to from test
> Successfully rebased and updated refs/heads/test.
> Accumulated times:
>     8.511 : <unmeasured> (74.9%)

Wild guess: This is setup_git_directory() loading your ~3 million entry index.

>     1.331 : 1 : ......label:unpack_trees
>        0.200 : <unmeasured> (15.1%)
>        0.580 : 1 : ........label:traverse_trees
>        0.403 : 1 : ........label:clear_ce_flags/0x00000000_0x02000000
>        0.126 : 1 : ........label:check_updates
>           0.126 : <unmeasured> (100.0%)
>           0.000 : 1 : ..........label:Filtering content
>        0.021 : 1 : ........label:clear_ce_flags/0x00080000_0x42000000
>        0.000 : 1 : ........label:fully_valid
>     1.059 : 1 : ......label:do_write_index /_git/office/src/.git/index.lock
>        0.930 : <unmeasured> (87.9%)
>        0.128 : 1 : ........label:write/extension/cache_tree
>     0.455 : 2 : ......label:fully_valid
>     0.001 : 1 : ......label:traverse_trees
>     0.000 : 1 : ......label:check_updates
> Estimated measurement overhead (.010 ms/region-measure * 41): 0.00041000000000000005
> Timing including forking: 11.382 (0.026 additional seconds)
>
> $ /_git/git/summarize-perf git rebase --onto from to test
> Successfully rebased and updated refs/heads/test.
> Accumulated times:
>     8.556 : <unmeasured> (75.2%)
>     1.315 : 1 : ......label:unpack_trees
>        0.197 : <unmeasured> (15.0%)
>        0.580 : 1 : ........label:traverse_trees
>        0.391 : 1 : ........label:clear_ce_flags/0x00000000_0x02000000
>        0.126 : 1 : ........label:check_updates
>           0.126 : <unmeasured> (100.0%)
>           0.000 : 1 : ..........label:Filtering content
>        0.021 : 1 : ........label:clear_ce_flags/0x00080000_0x42000000
>        0.000 : 1 : ........label:fully_valid
>     1.071 : 1 : ......label:do_write_index /_git/office/src/.git/index.lock
>        0.942 : <unmeasured> (88.0%)
>        0.129 : 1 : ........label:write/extension/cache_tree
>     0.431 : 2 : ......label:fully_valid
>     0.001 : 1 : ......label:traverse_trees
>     0.000 : 1 : ......label:check_updates
> Estimated measurement overhead (.010 ms/region-measure * 41): 0.00041000000000000005
> Timing including forking: 11.399 (0.026 additional seconds)

Did you include two runs of recursive and two runs of ort just to show
that the timings were stable and thus there wasn't warm or cold disk
cache issues affecting things?  If so, good plan.  (If there was
another reason, let me know; I missed it.)

> Then do the same with the ort strategy.
>
> $ /_git/git/summarize-perf git -c pull.twohead=ort rebase --onto to from test
> Successfully rebased and updated refs/heads/test.
> Accumulated times:
>     8.350 : <unmeasured> (73.2%)
>     1.403 : 1 : ....label:checkout
>        0.000 : <unmeasured> ( 0.0%)
>        1.403 : 1 : ......label:unpack_trees
>           0.312 : <unmeasured> (22.3%)
>           0.539 : 1 : ........label:traverse_trees
>           0.401 : 1 : ........label:clear_ce_flags/0x00000000_0x02000000
>           0.128 : 1 : ........label:check_updates
>              0.128 : <unmeasured> (100.0%)
>              0.000 : 1 : ..........label:Filtering content
>           0.021 : 1 : ........label:clear_ce_flags/0x00080000_0x42000000
>           0.000 : 1 : ........label:fully_valid
>     1.081 : 1 : ....label:do_write_index /_git/office/src/.git/index.lock
>        0.951 : <unmeasured> (88.1%)
>        0.129 : 1 : ......label:write/extension/cache_tree
>     0.432 : 2 : ....label:fully_valid
>     0.143 : 1 : ....label:do_read_index .git/index
>        0.019 : <unmeasured> (13.1%)
>        0.125 : 1 : label:read/extension/cache_tree
>     0.004 : 1 : ....label:incore_nonrecursive
>        0.001 : <unmeasured> (25.8%)
>        0.002 : 1 : ......label:process_entries
>           0.000 : <unmeasured> ( 2.6%)
>           0.001 : 1 : ........label:finalize
>           0.001 : 1 : ........label:process_entries setup
>              0.000 : <unmeasured> ( 6.7%)
>              0.001 : 1 : ..........label:plist sort
>              0.000 : 1 : ..........label:plist copy
>              0.000 : 1 : ..........label:plist grow
>           0.000 : 1 : ........label:processing
>        0.001 : 1 : ......label:collect_merge_info
>           0.000 : <unmeasured> (35.3%)
>           0.001 : 1 : ........label:traverse_trees
>        0.000 : 1 : ......label:merge_start
>           0.000 : <unmeasured> (42.3%)
>           0.000 : 1 : ........label:allocate/init
>           0.000 : 1 : ........label:sanity checks
>        0.000 : 1 : ......label:renames
>     0.001 : 1 : ....label:traverse_trees
>     0.000 : 1 : ....label:write_auto_merge
>     0.000 : 1 : ....label:check_updates
>     0.000 : 1 : ....label:record_unmerged
> Estimated measurement overhead (.010 ms/region-measure * 56): 0.0005600000000000001
> Timing including forking: 11.442 (0.027 additional seconds)

.004s on label:incore_nonrecursive -- that's the actual merge
operation.  This was a trivial rebase, and the merging took just 4
milliseconds.  But the overall run took 11.442 seconds because working
with 3M+ entries in the index just takes forever, and my code didn't
touch any on-disk formats, certainly not the index format.

_All_ of my optimization work was on the merging piece, not the stuff
outside.  But for what you're testing here, it appears to be
irrelevant compared to the overhead.

> $ /_git/git/summarize-perf git -c pull.twohead=ort rebase --onto from to test
> Successfully rebased and updated refs/heads/test.
> Accumulated times:
>     8.337 : <unmeasured> (73.2%)
>     1.395 : 1 : ....label:checkout
>        0.000 : <unmeasured> ( 0.0%)
>        1.395 : 1 : ......label:unpack_trees
>           0.309 : <unmeasured> (22.1%)
>           0.537 : 1 : ........label:traverse_trees
>           0.403 : 1 : ........label:clear_ce_flags/0x00000000_0x02000000
>           0.124 : 1 : ........label:check_updates
>              0.124 : <unmeasured> (100.0%)
>              0.000 : 1 : ..........label:Filtering content
>           0.021 : 1 : ........label:clear_ce_flags/0x00080000_0x42000000
>           0.000 : 1 : ........label:fully_valid
>     1.084 : 1 : ....label:do_write_index /_git/office/src/.git/index.lock
>        0.955 : <unmeasured> (88.1%)
>        0.129 : 1 : ......label:write/extension/cache_tree
>     0.436 : 2 : ....label:fully_valid
>     0.137 : 1 : ....label:do_read_index .git/index
>        0.013 : <unmeasured> ( 9.3%)
>        0.125 : 1 : label:read/extension/cache_tree
>     0.004 : 1 : ....label:incore_nonrecursive
>        0.001 : <unmeasured> (24.5%)
>        0.002 : 1 : ......label:process_entries
>           0.000 : <unmeasured> ( 2.5%)
>           0.001 : 1 : ........label:finalize
>           0.001 : 1 : ........label:process_entries setup
>              0.000 : <unmeasured> ( 6.5%)
>              0.001 : 1 : ..........label:plist sort
>              0.000 : 1 : ..........label:plist copy
>              0.000 : 1 : ..........label:plist grow
>           0.000 : 1 : ........label:processing
>        0.001 : 1 : ......label:collect_merge_info
>           0.000 : <unmeasured> (26.5%)
>           0.001 : 1 : ........label:traverse_trees
>        0.000 : 1 : ......label:merge_start
>           0.000 : <unmeasured> (43.1%)
>           0.000 : 1 : ........label:allocate/init
>           0.000 : 1 : ........label:sanity checks
>        0.000 : 1 : ......label:renames
>     0.001 : 1 : ....label:traverse_trees
>     0.000 : 1 : ....label:write_auto_merge
>     0.000 : 1 : ....label:check_updates
>     0.000 : 1 : ....label:record_unmerged
> Estimated measurement overhead (.010 ms/region-measure * 56): 0.0005600000000000001
> Timing including forking: 11.418 (0.024 additional seconds)

Ah, you included two copies for merge-ort too.  I'm guessing you did
that just to show there wasn't some cold cache issues or something and
that the runs showed consistent timings?


> And here are timings for a simple merge. Two files at root were changed in the
> commits I made, but there are also some larger changes from the commit history.
> These should all be seen as "this tree updated in one of the two, so take that
> tree".

Ahah!  That's a microsoft-specific optimization you guys made in the
recursive strategy, yes?  It does NOT exist in upstream git.  It's
also one that is nearly incompatible with rename detection; it turns
out you can only do that optimization in the face of rename detection
if you do a HUGE amount of specialized work and tracking in order to
determine when it's safe _despite_ needing to detect renames.  I
thought that optimization was totally incompatible with rename
detection for a long time; I tried it a couple times while working on
ort and watched it break all kinds of rename tests...but I eventually
discovered some tricks involving a lot of work to be able to run that
optimization.

So, you aren't comparing upstream "recursive" to "ort", you're
comparing a tweaked version of recursive, and one that is incompatible
with how recursive's rename detection work.  In fact, just to be clear
in case you go looking, I suspect that this tweak is to be found
within unpack_trees.c (which recursive relies on heavily).

Further, you've set it up so there are only a few files changed after
unpack_trees returns.

In total, you have: (1) turned off rename detection (most my
optimizations are for improving this factor, meaning I can't show an
advantage), (2) you took advantage of no rename detection to implement
trivial-tree merges (thus killing the main second advantage my
algorithm has), and (3) you are looking at a case with a tiny number
of changes for the merge algorithm to process (thus killing a third
optimization that removes quadratic performance).  Those are my three
big optimizations, and you've made them all irrelevant.  In fact,
you're in an area I would have been worried that ort would do _worse_
than recursive.  I track an awful lot of things and there is overhead
in checking and filling all that information in; if there are only a
few entries to merge, then all that information was a waste to collect
and ort might be slower than recursive.  But then again, that should
be a case where both algorithms are "nearly instantaneous" (or would
be if it weren't for your 3M+ index entry repo causing run_builtin()'s
call to setup_git_directory() in git.c to take a huge amount of time
before the builtin is even called.)


> $ git reset --hard test2 && /_git/git/summarize-perf git merge test -m test
> Merge made by the 'recursive' strategy.
> Accumulated times:
>     2.647 : <unmeasured> (48.6%)
>     1.384 : 1 : ..label:unpack_trees
>        0.267 : <unmeasured> (19.3%)
>        0.582 : 1 : ....label:traverse_trees
>        0.391 : 1 : ....label:clear_ce_flags/0x00000000_0x02000000
>        0.124 : 1 : ....label:check_updates
>           0.124 : <unmeasured> (100.0%)
>           0.000 : 1 : ......label:Filtering content
>        0.021 : 1 : ....label:clear_ce_flags/0x00080000_0x42000000
>        0.000 : 1 : ....label:fully_valid
>     1.060 : 1 : ..label:do_write_index /_git/office/src/.git/index.lock
>        0.931 : <unmeasured> (87.9%)
>        0.128 : 1 : ....label:write/extension/cache_tree
>     0.226 : 1 : ..label:fully_valid
>     0.134 : 1 : ..label:do_read_index .git/index
>        0.008 : <unmeasured> ( 5.8%)
>        0.126 : 1 : label:read/extension/cache_tree
>     0.001 : 1 : ..label:traverse_trees
>     0.000 : 1 : ..label:check_updates
>     0.000 : 1 : ..label:setup
>     0.000 : 1 : ..label:write back to queue
> Estimated measurement overhead (.010 ms/region-measure * 20): 0.0002
> Timing including forking:  5.466 (0.015 additional seconds)

5 seconds.  I do have to hand it to Ben and anyone else involved,
though.  From 1 hour down to 5 seconds is pretty good, even if it was
done by hacks (turning off rename detection, and then implementing
trivial-tree merging that would have broken rename detection).  I
suspect that whoever did that work might have found the unconditional
discarding and re-reading of the index and fixed it as well?

> $ git reset --hard test2 && /_git/git/summarize-perf git -c pull.twohead=ort merge test -m test
> Merge made by the 'ort' strategy.
> Accumulated times:
>     2.531 : <unmeasured> (49.1%)
>     1.328 : 1 : ..label:checkout
>        0.000 : <unmeasured> ( 0.0%)
>        1.328 : 1 : ....label:unpack_trees
>           0.228 : <unmeasured> (17.2%)
>           0.566 : 1 : ......label:traverse_trees
>           0.388 : 1 : ......label:clear_ce_flags/0x00000000_0x02000000
>           0.125 : 1 : ......label:check_updates
>              0.125 : <unmeasured> (100.0%)
>              0.000 : 1 : ........label:Filtering content
>           0.021 : 1 : ......label:clear_ce_flags/0x00080000_0x42000000
>           0.000 : 1 : ......label:fully_valid
>     1.067 : 1 : ..label:do_write_index /_git/office/src/.git/index.lock
>        0.938 : <unmeasured> (87.9%)
>        0.129 : 1 : ....label:write/extension/cache_tree
>     0.230 : 1 : ..label:fully_valid
>     0.002 : 1 : ..label:incore_recursive
>        0.001 : <unmeasured> (22.3%)
>        0.001 : 1 : ....label:collect_merge_info
>           0.001 : <unmeasured> (60.2%)
>           0.000 : 1 : ......label:traverse_trees
>        0.001 : 1 : ....label:process_entries
>           0.000 : <unmeasured> ( 2.8%)
>           0.001 : 1 : ......label:finalize
>           0.000 : 1 : ......label:process_entries setup
>              0.000 : <unmeasured> ( 6.9%)
>              0.000 : 1 : ........label:plist sort
>              0.000 : 1 : ........label:plist copy
>              0.000 : 1 : ........label:plist grow
>           0.000 : 1 : ......label:processing
>        0.000 : 1 : ....label:merge_start
>           0.000 : <unmeasured> (50.0%)
>           0.000 : 1 : ......label:allocate/init
>           0.000 : 1 : ......label:sanity checks
>        0.000 : 1 : ....label:renames
>     0.001 : 1 : ..label:traverse_trees
>     0.000 : 1 : ..label:write_auto_merge
>     0.000 : 1 : ..label:check_updates
>     0.000 : 1 : ..label:setup
>     0.000 : 1 : ..label:display messages
>     0.000 : 1 : ..label:write back to queue
>     0.000 : 1 : ..label:record_unmerged
> Estimated measurement overhead (.010 ms/region-measure * 36): 0.00036
> Timing including forking:  5.174 (0.015 additional seconds)

Heh, yeah 0.002 seconds for ..label:incore_recursive.  Only 2
milliseconds to create the actual merge tree.  That does suggest you
might have fun with 'git log -p --remerge-diff'; if you can redo
merges in 2 milliseconds, showing them in git log output is very
reasonable.  :-)


Could we have some fun, though?  What if you have some merge or rebase
involving lots of changes, and you turn rename detection back on, and
you disable that trivial-tree resolution optimization that breaks
recursive's rename detection handling...and then compare recursive and
ort?  (It might be easiest to just compare upstream recursive rather
than the one with all the microsoft changes to make sure you undid
whatever trivial tree handling work exists.)

For example, my testcase in the linux kernel was finding a series of a
few dozen patches I could rebase back to an older version, but
tweaking the "older" version by renaming drivers/ -> pilots/ (with
about 26K files under that directory, that meant about 26K renames).
So, I got to see rebasing of dozens of real changes across a massive
rename boundary -- and the massive rename boundary also guaranteed
there were lots of entries for the merge algorithm to deal with.


In the end, though, 4 milliseconds for the rebase and 2 milliseconds
for the merge, with the rest all being overhead of interfacing to the
index and working tree actually seems pretty good to me.  I'm just
curious if we can check how things work for more involved cases.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-09 17:13             ` Elijah Newren
@ 2020-11-09 19:51               ` Derrick Stolee
  2020-11-09 22:44                 ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-09 19:51 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, git

On 11/9/20 12:13 PM, Elijah Newren wrote:> Actually, this was pretty enlightening.  I think I know about what's
> happening...
> 
> First, a few years ago, Ben said that merges in the Microsoft repos
> took about an hour[1]:
> "For the repro that I have been using this drops the merge time from ~1 hour to
> ~5 minutes and the unmerged entries goes down from ~40,000 to 1."
> The change he made to drop it that far was to turn off rename detection.
> 
> [1] https://lore.kernel.org/git/20180426205202.23056-1-benpeart@microsoft.com/
> 
> Keep that in mind, especially since your times are actually
> significantly less than 5 minutes...

Yes, the other thing to keep in mind is that this is
a Scalar repo with the default cone-mode sparse-checkout
of only the files at root. For this repo, that means that
there are only ~10 files actually present.

I wanted to remove any working directory updates/checks
from the performance check as possible.

>> $ /_git/git/summarize-perf git rebase --onto to from test
>> Successfully rebased and updated refs/heads/test.
>> Accumulated times:
>>     8.511 : <unmeasured> (74.9%)
> 
> Wild guess: This is setup_git_directory() loading your ~3 million entry index.

I think there is also some commit walking happening, but
it shouldn't be too much. 'from' and 'to' are not very
far away.

> Did you include two runs of recursive and two runs of ort just to show
> that the timings were stable and thus there wasn't warm or cold disk
> cache issues affecting things?  If so, good plan.  (If there was
> another reason, let me know; I missed it.)

For the rebase, I did "--onto to from test" and "--onto from to test"
to show both directions of the rebase. The merge I did twice for the
cache issues ;)

> .004s on label:incore_nonrecursive -- that's the actual merge
> operation.  This was a trivial rebase, and the merging took just 4
> milliseconds.  But the overall run took 11.442 seconds because working
> with 3M+ entries in the index just takes forever, and my code didn't
> touch any on-disk formats, certainly not the index format.
> 
> _All_ of my optimization work was on the merging piece, not the stuff
> outside.  But for what you're testing here, it appears to be
> irrelevant compared to the overhead.

OK, so since we already disable rename detection through config,
the machinery that you are changing is already fast with the old
algorithm in these trivial cases.

To actually show any benefits, we would need to disable rename
detection or use a larger change.
>> And here are timings for a simple merge. Two files at root were changed in the
>> commits I made, but there are also some larger changes from the commit history.
>> These should all be seen as "this tree updated in one of the two, so take that
>> tree".
> 
> Ahah!  That's a microsoft-specific optimization you guys made in the
> recursive strategy, yes? 

I'm not aware of any logic we have that's different from core Git.
The config we use [1] includes "merge.stat = false" and "merge.renames
= false" but otherwise seems to be using stock Git.

[1] https://github.com/microsoft/scalar/blob/1d7938d2df6921f7a3b4f3f1cce56a00929adc40/Scalar.Common/Maintenance/ConfigStep.cs#L100-L127

I'm CC'ing Jeff Hostetler to see if he knows anything about a custom
merge algorithm in microsoft/git.

> It does NOT exist in upstream git.  It's
> also one that is nearly incompatible with rename detection; it turns
> out you can only do that optimization in the face of rename detection
> if you do a HUGE amount of specialized work and tracking in order to
> determine when it's safe _despite_ needing to detect renames. 

Perhaps merge.renames=false is enough to trigger this logic already?

> I
> thought that optimization was totally incompatible with rename
> detection for a long time; I tried it a couple times while working on
> ort and watched it break all kinds of rename tests...but I eventually
> discovered some tricks involving a lot of work to be able to run that
> optimization.

I will try to keep this in mind.

> So, you aren't comparing upstream "recursive" to "ort", you're
> comparing a tweaked version of recursive, and one that is incompatible
> with how recursive's rename detection work.  In fact, just to be clear
> in case you go looking, I suspect that this tweak is to be found
> within unpack_trees.c (which recursive relies on heavily).
> 
> Further, you've set it up so there are only a few files changed after
> unpack_trees returns.
> 
> In total, you have: (1) turned off rename detection (most my
> optimizations are for improving this factor, meaning I can't show an
> advantage), (2) you took advantage of no rename detection to implement
> trivial-tree merges (thus killing the main second advantage my
> algorithm has), and (3) you are looking at a case with a tiny number
> of changes for the merge algorithm to process (thus killing a third
> optimization that removes quadratic performance).  Those are my three
> big optimizations, and you've made them all irrelevant.  In fact,
> you're in an area I would have been worried that ort would do _worse_
> than recursive.  I track an awful lot of things and there is overhead
> in checking and filling all that information in; if there are only a
> few entries to merge, then all that information was a waste to collect
> and ort might be slower than recursive.  But then again, that should
> be a case where both algorithms are "nearly instantaneous" (or would
> be if it weren't for your 3M+ index entry repo causing run_builtin()'s
> call to setup_git_directory() in git.c to take a huge amount of time
> before the builtin is even called.)

Thanks for your time isolating this case. I appreciate knowing exactly
which portions of the merge algorithm are being touched and which are
not.
> 5 seconds.  I do have to hand it to Ben and anyone else involved,
> though.  From 1 hour down to 5 seconds is pretty good, even if it was
> done by hacks (turning off rename detection, and then implementing
> trivial-tree merging that would have broken rename detection).  I
> suspect that whoever did that work might have found the unconditional
> discarding and re-reading of the index and fixed it as well?

As you can probably tell from my general confusion, I had nothing
to do with it. ;)

> Heh, yeah 0.002 seconds for ..label:incore_recursive.  Only 2
> milliseconds to create the actual merge tree.  That does suggest you
> might have fun with 'git log -p --remerge-diff'; if you can redo
> merges in 2 milliseconds, showing them in git log output is very
> reasonable.  :-)

Yeah, 'git merge-tree' is very fast for these cases, so I assumed
that something else was going on for that command.

> Could we have some fun, though?  What if you have some merge or rebase
> involving lots of changes, and you turn rename detection back on, and
> you disable that trivial-tree resolution optimization that breaks
> recursive's rename detection handling...and then compare recursive and
> ort?  (It might be easiest to just compare upstream recursive rather
> than the one with all the microsoft changes to make sure you undid
> whatever trivial tree handling work exists.)

I can try these kinds of cases, but it won't be today. I'm on kid duty
today, and answering emails in between running around with them.

> For example, my testcase in the linux kernel was finding a series of a
> few dozen patches I could rebase back to an older version, but
> tweaking the "older" version by renaming drivers/ -> pilots/ (with
> about 26K files under that directory, that meant about 26K renames).
> So, I got to see rebasing of dozens of real changes across a massive
> rename boundary -- and the massive rename boundary also guaranteed
> there were lots of entries for the merge algorithm to deal with.
> 
> In the end, though, 4 milliseconds for the rebase and 2 milliseconds
> for the merge, with the rest all being overhead of interfacing to the
> index and working tree actually seems pretty good to me.  I'm just
> curious if we can check how things work for more involved cases.

I'm definitely interested in identifying how your algorithm improves
over the previous cases, and perhaps re-enabling rename detection for
merges is enough of a benefit to justify the new one.

Eventually, I hope to actually engage with your patches in the form
of review. Just trying to build a mental model for what's going on
first.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 01/20] merge-ort: setup basic internal data structures
  2020-11-06 22:45     ` Elijah Newren
@ 2020-11-09 20:55       ` Jonathan Tan
  0 siblings, 0 replies; 84+ messages in thread
From: Jonathan Tan @ 2020-11-09 20:55 UTC (permalink / raw)
  To: newren; +Cc: jonathantanmy, git

> > There seems to be 2 ways of referring to something that we couldn't
> > merge - "conflicted" (or "having a conflict") and "unmerged". Should we
> > stick to one of them?
> 
> Uhm...perhaps, but it feels like I'm going to miss some while looking
> over it.  Also, there are some semantic differences at play.  Since
> the whole algorithm is divided around multiple stages --
> collect_merge_info(), detect_and_process_renames(), process_entries(),
> as of a given early stage we might just know that we couldn't merge it
> *yet*.  For example, both sides modified the entry, or one side
> modified and the other side is missing ("did they delete it or rename
> it?").  After rename detection and three-way content merge, something
> that had not been automatically mergeable as of an earlier step might
> end up being so.  But we need names for stuff in the interim state.
> But it's also possible for us to know at an early state that thing are
> definitely going to be a conflict regardless of what later stages do
> (e.g. both sides rename a path, but rename it differently).

In that case, maybe "possibly conflicted" and "unconflicted"? Or maybe
someone else will come up with a better name.

> > Also, looking ahead, I see that current_dir_name is used as a temporary
> > variable in the recursive calls to collect_merge_info_callback(). I
> > would prefer if current_dir_name went in the cbdata to that function
> > instead, but if that's not possible, maybe document here that
> > current_dir_name is only used in collect_merge_info_callback(), and
> > temporarily at that.
> 
> Yeah, not possible.  collect_merge_info_callback() has to be of
> traverse_callback_t type (from tree-walk.h), which provides no extra
> parameters for extra callback data.  I can add a documentation
> comment.

But traverse_callback_t provides a data field, which you use to store a
struct merge_options in patch 6 - in theory, you could instead create
another struct that contains a pointer to struct merge_options and store
that in data instead. That does sound like it will make the code more
complicated, though, so maybe current_dir_name here is the way.

> > I wonder if this needs to be documented that the least significant bit
> > corresponds to stages[0], and so forth.
> 
> Maybe I should just put a comment to look at tree-walk.h?  The struct
> traverse_info has a "fn" member with a big comment above it describing
> mask & dirmask; filemask is just mask & ~dirmask.

That makes sense. I understood a lot more once I saw that it was
iterating over a variable number of trees - hence the arrays.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info()
  2020-11-06 23:10     ` Elijah Newren
@ 2020-11-09 20:59       ` Jonathan Tan
  0 siblings, 0 replies; 84+ messages in thread
From: Jonathan Tan @ 2020-11-09 20:59 UTC (permalink / raw)
  To: newren; +Cc: jonathantanmy, git

> > > +     unsigned mbase_null = !(mask & 1);
> > > +     unsigned side1_null = !(mask & 2);
> > > +     unsigned side2_null = !(mask & 4);
> >
> > Should these be "int"?
> 
> Does the type matter, particularly since "boolean" isn't available?

It doesn't, which is why I would expect the most generic type - if I see
something else, I would be led to think that there was a specific reason
for choosing that. But if I'm in the minority, that's fine.

> > I thought that this was written like this (instead of inlining the 2
> > double-quotes) to ensure that the string-equality-is-pointer-equality
> > characteristic holds, but I see that that characteristic is for
> > directory_name in struct merged_info, not current_dir_name in struct
> > merge_options_internal. Any reason for not inlining ""?
> 
> You're really digging in; I love it.  From setup_path_info(), the
> directory_name is set from the current_dir_name:
>         path_info->merged.directory_name = current_dir_name;
> (and if you follow where the current_dir_name parameter gets its value
> from, you find that it came indirectly from
> opt->priv->current_dir_name), so current_dir_name must meet all the
> requirements on merge_info's directory_name field.
> 
> Perhaps there's still some kind of additional simplification possible
> here, but directory rename detection is an area that has to take some
> special care around this requirement.  I simplified the code a little
> bit in this area as I was trying to break off a good first 20 patches
> to submit, but even if we can simplify it more, the structure is just
> going to come back later.

Ah, I see.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 08/20] merge-ort: compute a few more useful fields for collect_merge_info
  2020-11-06 23:41     ` Elijah Newren
@ 2020-11-09 22:04       ` Jonathan Tan
  2020-11-09 23:05         ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Tan @ 2020-11-09 22:04 UTC (permalink / raw)
  To: newren; +Cc: jonathantanmy, git

> On Fri, Nov 6, 2020 at 2:52 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > > +     /*
> > > +      * Note: We only label files with df_conflict, not directories.
> > > +      * Since directories stay where they are, and files move out of the
> > > +      * way to make room for a directory, we don't care if there was a
> > > +      * directory/file conflict for a parent directory of the current path.
> > > +      */
> > > +     unsigned df_conflict = (filemask != 0) && (dirmask != 0);
> >
> > Suppose you have:
> >
> >  [ours]
> >   foo/
> >     bar/
> >       baz
> >     quux
> >  [theirs]
> >   foo
> >
> > By "we only label files with df_conflict, not directories", are you
> > referring to not labelling "foo/" in [ours], or to "bar/", "baz", and
> > "quux" (so, the files and directories within a directory)? At first I
> > thought you were referring to the former, but perhaps you are referring
> > to the latter.
> 
> The former.  I was drawing a distinction between how this code
> operates, and how unpack_trees() operates, which probably only matters
> to those familiar with unpack_trees() or who have been reading through
> it recently.

Just for clarification: do you mean "the latter"? (The "not" in my
question might be confusing.)

To be more illustrative in what I meant, at first I thought that you
were NOT labelling "foo/" in [ours], hence:

 [ours]
  foo/  <- unlabeled
 [theirs]
  foo   <- labeled

In this way, in a sense, you are indeed labelling only the file, not the
directory.

But instead what you seem to be doing is this:

 [ours]
  foo/     <- labeled
    bar/   <- unlabeled
      baz  <- unlabeled
    quux   <- unlabeled
 [theirs]
  foo      <- labeled

which is what I meant by NOT labelling "bar/", "baz", and "quux".

> unpack_trees() will note when there is a directory/file
> conflict, and propagates that information to all subtrees, with every
> path specially checking for the o->df_conflict_entry and then handling
> it specially (e.g. keeping higher order stages instead of using an
> aggressive or trivial resolutions).

And here it seems like you're describing that unpack_trees() would label
it in this way:

 [ours]
  foo/     <- labeled
    bar/   <- labeled
      baz  <- labeled
    quux   <- labeled
 [theirs]
  foo      <- labeled

(and you're emphasizing by contrast that merge-ort is NOT doing this).

> However, leaving both a file and
> a directory at the same path, while allowed in the index, makes for
> ugliness and difficulty for users to resolve.   Plus it isn't allowed
> in the working tree anyway.  We decided a while ago that it'd be
> better to represent these conflicts differently[1], [2].
> 
> Also, at the time you are unpacking or traversing trees, you only know
> if one side had a directory where the other side had a file.  You
> don't know if the final merge result will actually have a
> directory/file conflict.  If the file existed in both the base version
> and unmodified on one side, for example, then the file will be removed
> as part of the merge.  It is similarly possible that the entire
> directory of files all need to be deleted or are all renamed
> elsewhere.  So, you have to keep track of a df_conflict bit, but you
> can't act on it until you've processed several other things first.
> 
> Since I already know I'm not going to move a whole directory of files
> out of the way so that a file can be placed in the working tree
> instead of that whole directory, the directory doesn't need to be
> tweaked.  I'm not going to propagate any information about a
> directory/file conflict at some path down to all subpaths of the
> directory.  I only track it for the file that immediately conflicts,
> and then only take action on it after resolving all the paths under
> the corresponding directory to see if the directory/file conflict
> remains.
> 
> [1] https://lore.kernel.org/git/xmqqbmabcuhf.fsf@gitster-ct.c.googlers.com/
> and the thread surrounding it
> [2] https://lore.kernel.org/git/f27f12e8e50e56c010c29caa00296475d4de205b.1603731704.git.gitgitgadget@gmail.com/,
> which is now commit ef52778708 ("merge tests: expect improved
> directory/file conflict handling in ort", 2020-10-26)

Makes sense.

> > > @@ -161,6 +179,13 @@ static int collect_merge_info_callback(int n,
> > >               newinfo.name = p->path;
> > >               newinfo.namelen = p->pathlen;
> > >               newinfo.pathlen = st_add3(newinfo.pathlen, p->pathlen, 1);
> > > +             /*
> > > +              * If we did care about parent directories having a D/F
> > > +              * conflict, then we'd include
> > > +              *    newinfo.df_conflicts |= (mask & ~dirmask);
> > > +              * here.  But we don't.  (See comment near setting of local
> > > +              * df_conflict variable near the beginning of this function).
> > > +              */
> >
> > I'm not sure how "mask" and "dirmask" contains information about parent
> > directories. "mask" represents the available entries, and "dirmask"
> > represents which of them are directories, as far as I know. So we can
> > notice when something is missing, but I don't see how this distinguishes
> > between the case that something is missing because it was in a parent
> > directory that got deleted, vs something is missing because it itself
> > got deleted.
> 
> Yeah, this is more comparisons to unpack_trees.  This code is about to
> set up a recursive call into subdirectories.  newinfo is set based on
> the mask and dirmask of the current entry, and then subdirectories can
> consult newinfo.df_conflicts to see if that path is within a directory
> that was involved in a directory/file conflict.  For example:
> 
> Tree in base version:
>     foo/
>         bar
>     stuff.txt
> Tree on side 1: (adds foo/baz)
>     foo/
>         bar
>         baz
>     stuff.txt
> Tree on side 2: (deletes foo/, adds new file foo)
>    foo
>    stuff.txt
> 
> When processing 'foo', we have mask=7, dirmask = 3.  So, here
> unpack_trees() would have set newinfo.df_conflicts = (mask & ~dirmask)
> = 4.  Then when we process foo/bar or foo/baz, we have mask=2,
> dirmask=0, which looks like there are no directory/file conflicts.
> However, we can note that these paths are under a directory involved
> in a directory/file conflict via info.df_conflicts whose value is 4.
> unpack_trees() cared about paths under a directory that was involved
> in a directory/file conflict, and someone familiar with that code
> might ask why I don't also track the same information.  The answer is
> that I don't track it, even though I thought about it, because it's
> useless overhead since I'm going to leave the directory alone and move
> the file out of the way.
> 
> Does that make sense?

Ah...yes, that makes sense. I think I didn't notice the "newinfo", so I
didn't realize that we were setting the info of our children based on
ourselves. Perhaps I would have noticed it sooner if the comment had
read "If this file/directory cared about its parent directory (the
current directory) having a D/F conflict, then we'd propagate the masks
in this way:" instead of "If we did care about parent directories having
a D/F conflict", but perhaps the point is already obvious enough.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path
  2020-11-07  0:26     ` Elijah Newren
@ 2020-11-09 22:09       ` Jonathan Tan
  2020-11-09 23:08         ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Tan @ 2020-11-09 22:09 UTC (permalink / raw)
  To: newren; +Cc: jonathantanmy, git

> > So these are placed in paths but not unmerged. I'm starting to wonder if
> > struct merge_options_internal should be called merge_options_state or
> > something, and each field having documentation about when they're used
> > (or better yet, have functions like collect_merge_info() return their
> > calculations in return values (which may be "out" parameters) instead of
> > in this struct).
> 
> Right, unmerged is only those paths that remain unmerged after all
> steps.  record_unmerged_index_entries() could simply walk over all
> entries in paths and pick out the ones that were unmerged, but
> process_entries() has to walk over all paths, determine whether they
> can be merged, and determine what to record for the resulting tree for
> each path.  So, having it stash away the unmerged stuff is a simple
> optimization.
> 
> Renaming to merge_options_state or even just merge_state would be fine
> -- but any renaming done here will also affect merge-recursive.[ch].
> See the definition of merge_options in merge-recursive.  (For history,
> merge-recursive.h stuffed state into merge_options, which risked funny
> misusage patterns and made the API unnecessarily complex...and made it
> suggest that alternative algorithms needed to have the same state.
> So, the state was moved to a merge_options_internal struct.  That's
> not to say we can't rename, but it does need to be done in
> merge-recursive as well.)

Ah, I see.

> As for having collect_merge_info() return their calculations in return
> values, would that just end with me returning a struct
> merge_options_internal?  Or did you want each return value added to
> the function signature?  Each return value in the function signature
> makes sense right now for this super-simplified initial 20 patches,
> but what about when this data structure gains all kind of
> rename-related state that is collected, updated, and passed between
> these areas?  I'd have a huge number of "out" and "in" fields to every
> function.  Eventually, merge_options_internal (or whatever it might be
> renamed to) expands to the following, where I have to first define an
> extra enum and two extra structs so that you know the definitions of
> new types that show up in merge_options_internal:

[snip enums and structs]

Good point. I should have realized that there would be much more to
track.

> > > +     result->string = fullpath;
> > > +     result->util = path_info;
> > > +}
> > > +
> > >  static int collect_merge_info_callback(int n,
> > >                                      unsigned long mask,
> > >                                      unsigned long dirmask,
> > > @@ -91,10 +136,12 @@ static int collect_merge_info_callback(int n,
> > >        */
> > >       struct merge_options *opt = info->data;
> > >       struct merge_options_internal *opti = opt->priv;
> > > -     struct conflict_info *ci;
> > > +     struct string_list_item pi;  /* Path Info */
> > > +     struct conflict_info *ci; /* pi.util when there's a conflict */
> >
> > Looking ahead to patch 10, this seems more like "pi.util unless we know
> > for sure that there's no conflict".
> 
> That's too long for the line to remain at 80 characters; it's 16
> characters over the limit.  ;-)

Well, you could move the description onto its own line :-)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-09 19:51               ` Derrick Stolee
@ 2020-11-09 22:44                 ` Elijah Newren
  0 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-09 22:44 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List, Jeff Hostetler

Hi Derrick,

On Mon, Nov 9, 2020 at 11:51 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/9/20 12:13 PM, Elijah Newren wrote:> Actually, this was pretty enlightening.  I think I know about what's
> > happening...
> >
> > First, a few years ago, Ben said that merges in the Microsoft repos
> > took about an hour[1]:
> > "For the repro that I have been using this drops the merge time from ~1 hour to
> > ~5 minutes and the unmerged entries goes down from ~40,000 to 1."
> > The change he made to drop it that far was to turn off rename detection.
> >
> > [1] https://lore.kernel.org/git/20180426205202.23056-1-benpeart@microsoft.com/
> >
> > Keep that in mind, especially since your times are actually
> > significantly less than 5 minutes...
>
> Yes, the other thing to keep in mind is that this is
> a Scalar repo with the default cone-mode sparse-checkout
> of only the files at root. For this repo, that means that
> there are only ~10 files actually present.
>
> I wanted to remove any working directory updates/checks
> from the performance check as possible.

Ah, that explains how you got under 20s.  I remember elsewhere on the
list someone (I think it was Ben again) mentioned that a "git checkout
-b <newbranch>" took 20s, despite no need to update the working tree
or index.

I have only done one cursory test of merge-ort with sparse-checkouts;
I should do more.  There might be a bug somewhere, though it does at
least pass the regression tests and I think for the most part it's
actually better: there are cases where merge-recursive will vivify
files outside the sparse-checkout which were not conflicted (see e.g.
https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/);
in contrast, merge-ort shouldn't have any such cases -- it'll only add
files to the working copy if they match the sparsity patterns or the
path has conflicts.

> >> $ /_git/git/summarize-perf git rebase --onto to from test
> >> Successfully rebased and updated refs/heads/test.
> >> Accumulated times:
> >>     8.511 : <unmeasured> (74.9%)
> >
> > Wild guess: This is setup_git_directory() loading your ~3 million entry index.
>
> I think there is also some commit walking happening, but
> it shouldn't be too much. 'from' and 'to' are not very
> far away.

Makes sense.  I suspect that with your commit-graphs this ends up
being fast enough that you might have difficulty even measuring it,
though.

> > Did you include two runs of recursive and two runs of ort just to show
> > that the timings were stable and thus there wasn't warm or cold disk
> > cache issues affecting things?  If so, good plan.  (If there was
> > another reason, let me know; I missed it.)
>
> For the rebase, I did "--onto to from test" and "--onto from to test"
> to show both directions of the rebase. The merge I did twice for the
> cache issues ;)

Oh, good call.  Thanks for pointing it out, I missed that on first reading.

> > .004s on label:incore_nonrecursive -- that's the actual merge
> > operation.  This was a trivial rebase, and the merging took just 4
> > milliseconds.  But the overall run took 11.442 seconds because working
> > with 3M+ entries in the index just takes forever, and my code didn't
> > touch any on-disk formats, certainly not the index format.
> >
> > _All_ of my optimization work was on the merging piece, not the stuff
> > outside.  But for what you're testing here, it appears to be
> > irrelevant compared to the overhead.
>
> OK, so since we already disable rename detection through config,
> the machinery that you are changing is already fast with the old
> algorithm in these trivial cases.
>
> To actually show any benefits, we would need to disable rename
> detection or use a larger change.

...or both.  :-)

> >> And here are timings for a simple merge. Two files at root were changed in the
> >> commits I made, but there are also some larger changes from the commit history.
> >> These should all be seen as "this tree updated in one of the two, so take that
> >> tree".
> >
> > Ahah!  That's a microsoft-specific optimization you guys made in the
> > recursive strategy, yes?
>
> I'm not aware of any logic we have that's different from core Git.
> The config we use [1] includes "merge.stat = false" and "merge.renames
> = false" but otherwise seems to be using stock Git.
>
> [1] https://github.com/microsoft/scalar/blob/1d7938d2df6921f7a3b4f3f1cce56a00929adc40/Scalar.Common/Maintenance/ConfigStep.cs#L100-L127
>
> I'm CC'ing Jeff Hostetler to see if he knows anything about a custom
> merge algorithm in microsoft/git.

Oh, I took your wording that 'These should all be seen as "this tree
updated in one of the two, so take that tree"' as an implication that
you had a special merge tweak and wanted to verify it didn't regress.
I think I read too much into your wording.

Also, thinking over it more, I remember now that Ben also turned on
unpack_opts.aggressive when rename detection was turned off -- see
commit 6f10a09e0a ("merge: pass aggressive when rename detection is
turned off", 2018-05-02).  That isn't quite as advantageous as doing a
trivial tree merge, but if the algorithm that does the trivial tree
merge has to end up updating a complete index later anyway via the
checkout logic of unpack_trees, then the differences are basically a
wash.

> > It does NOT exist in upstream git.  It's
> > also one that is nearly incompatible with rename detection; it turns
> > out you can only do that optimization in the face of rename detection
> > if you do a HUGE amount of specialized work and tracking in order to
> > determine when it's safe _despite_ needing to detect renames.
>
> Perhaps merge.renames=false is enough to trigger this logic already?

Yeah, since I read too much into what you wrote and know that I
remember the "if (no_renames) o.aggressive = 1" bit, then yeah this
would be enough.

> > I
> > thought that optimization was totally incompatible with rename
> > detection for a long time; I tried it a couple times while working on
> > ort and watched it break all kinds of rename tests...but I eventually
> > discovered some tricks involving a lot of work to be able to run that
> > optimization.
>
> I will try to keep this in mind.
>
> > So, you aren't comparing upstream "recursive" to "ort", you're
> > comparing a tweaked version of recursive, and one that is incompatible
> > with how recursive's rename detection work.  In fact, just to be clear
> > in case you go looking, I suspect that this tweak is to be found
> > within unpack_trees.c (which recursive relies on heavily).
> >
> > Further, you've set it up so there are only a few files changed after
> > unpack_trees returns.
> >
> > In total, you have: (1) turned off rename detection (most my
> > optimizations are for improving this factor, meaning I can't show an
> > advantage), (2) you took advantage of no rename detection to implement
> > trivial-tree merges (thus killing the main second advantage my
> > algorithm has), and (3) you are looking at a case with a tiny number
> > of changes for the merge algorithm to process (thus killing a third
> > optimization that removes quadratic performance).  Those are my three
> > big optimizations, and you've made them all irrelevant.  In fact,
> > you're in an area I would have been worried that ort would do _worse_
> > than recursive.  I track an awful lot of things and there is overhead
> > in checking and filling all that information in; if there are only a
> > few entries to merge, then all that information was a waste to collect
> > and ort might be slower than recursive.  But then again, that should
> > be a case where both algorithms are "nearly instantaneous" (or would
> > be if it weren't for your 3M+ index entry repo causing run_builtin()'s
> > call to setup_git_directory() in git.c to take a huge amount of time
> > before the builtin is even called.)
>
> Thanks for your time isolating this case. I appreciate knowing exactly
> which portions of the merge algorithm are being touched and which are
> not.
> > 5 seconds.  I do have to hand it to Ben and anyone else involved,
> > though.  From 1 hour down to 5 seconds is pretty good, even if it was
> > done by hacks (turning off rename detection, and then implementing
> > trivial-tree merging that would have broken rename detection).  I
> > suspect that whoever did that work might have found the unconditional
> > discarding and re-reading of the index and fixed it as well?
>
> As you can probably tell from my general confusion, I had nothing
> to do with it. ;)
>
> > Heh, yeah 0.002 seconds for ..label:incore_recursive.  Only 2
> > milliseconds to create the actual merge tree.  That does suggest you
> > might have fun with 'git log -p --remerge-diff'; if you can redo
> > merges in 2 milliseconds, showing them in git log output is very
> > reasonable.  :-)
>
> Yeah, 'git merge-tree' is very fast for these cases, so I assumed
> that something else was going on for that command.

Oh, interesting.  I forgot about merge-tree.  Maybe I should make a
version based on merge-ort (and then it'd handle rename detection too,
something it doesn't currently do.)?  However, that wouldn't be
comparing merge algorithms, because builtin/merge-tree.c doesn't use
merge-recursive.[ch].  (It would be easy to get confused into thinking
it does, since merge-recursive.[ch] defines a function called
merge_trees(), but builtin/merge-tree.c doesn't use it despite the
name similarity.)

> > Could we have some fun, though?  What if you have some merge or rebase
> > involving lots of changes, and you turn rename detection back on, and
> > you disable that trivial-tree resolution optimization that breaks
> > recursive's rename detection handling...and then compare recursive and
> > ort?  (It might be easiest to just compare upstream recursive rather
> > than the one with all the microsoft changes to make sure you undid
> > whatever trivial tree handling work exists.)
>
> I can try these kinds of cases, but it won't be today. I'm on kid duty
> today, and answering emails in between running around with them.

One word of caution: merge.renameLimit may get in your way.  The
default of 1000 means that you're likely to hit that limit on your
first run, and get a warning message like the following printed out:

warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your merge.renamelimit variable to at
least 27328 and retry the command.

You then need to undo your rebase or merge, bump the limit, and
re-run.  Also, you will need a higher limit for merge-recursive than
you do for merge-ort.  The default of 1000 is enough for merge-ort to
detect all the renames in my 26K-files-in-a-directory rename testcase
of the linux kernel, but the value needs to be bumped to 27328 for
merge-recursive.  And if you don't have the limit high enough, then
one algorithm is doing the work to detect renames and the other is
bailing and skipping it, so it's not an apples-to-apples comparison.
If that warning doesn't appear for either backend, then you have an
apples-to-apples comparison.

> > For example, my testcase in the linux kernel was finding a series of a
> > few dozen patches I could rebase back to an older version, but
> > tweaking the "older" version by renaming drivers/ -> pilots/ (with
> > about 26K files under that directory, that meant about 26K renames).
> > So, I got to see rebasing of dozens of real changes across a massive
> > rename boundary -- and the massive rename boundary also guaranteed
> > there were lots of entries for the merge algorithm to deal with.
> >
> > In the end, though, 4 milliseconds for the rebase and 2 milliseconds
> > for the merge, with the rest all being overhead of interfacing to the
> > index and working tree actually seems pretty good to me.  I'm just
> > curious if we can check how things work for more involved cases.
>
> I'm definitely interested in identifying how your algorithm improves
> over the previous cases, and perhaps re-enabling rename detection for
> merges is enough of a benefit to justify the new one.
>
> Eventually, I hope to actually engage with your patches in the form
> of review. Just trying to build a mental model for what's going on
> first.

Ooh, I can help with that; here's what's going on:  *** Magic ***

(Black, evil magic in the case of merge-recurisve.  Good magic in the
case of merge-ort.)

Glad I could help clear things up for you.  :-)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 08/20] merge-ort: compute a few more useful fields for collect_merge_info
  2020-11-09 22:04       ` Jonathan Tan
@ 2020-11-09 23:05         ` Elijah Newren
  0 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-09 23:05 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git Mailing List

On Mon, Nov 9, 2020 at 2:04 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > On Fri, Nov 6, 2020 at 2:52 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> > >
> > > > +     /*
> > > > +      * Note: We only label files with df_conflict, not directories.
> > > > +      * Since directories stay where they are, and files move out of the
> > > > +      * way to make room for a directory, we don't care if there was a
> > > > +      * directory/file conflict for a parent directory of the current path.
> > > > +      */
> > > > +     unsigned df_conflict = (filemask != 0) && (dirmask != 0);
> > >
> > > Suppose you have:
> > >
> > >  [ours]
> > >   foo/
> > >     bar/
> > >       baz
> > >     quux
> > >  [theirs]
> > >   foo
> > >
> > > By "we only label files with df_conflict, not directories", are you
> > > referring to not labelling "foo/" in [ours], or to "bar/", "baz", and
> > > "quux" (so, the files and directories within a directory)? At first I
> > > thought you were referring to the former, but perhaps you are referring
> > > to the latter.
> >
> > The former.  I was drawing a distinction between how this code
> > operates, and how unpack_trees() operates, which probably only matters
> > to those familiar with unpack_trees() or who have been reading through
> > it recently.
>
> Just for clarification: do you mean "the latter"? (The "not" in my
> question might be confusing.)

Yeah, probably was confusing, so let me just state where you are
almost right below.

> To be more illustrative in what I meant, at first I thought that you
> were NOT labelling "foo/" in [ours], hence:
>
>  [ours]
>   foo/  <- unlabeled
>  [theirs]
>   foo   <- labeled
>
> In this way, in a sense, you are indeed labelling only the file, not the
> directory.
>
> But instead what you seem to be doing is this:
>
>  [ours]
>   foo/     <- labeled
>     bar/   <- unlabeled
>       baz  <- unlabeled
>     quux   <- unlabeled
>  [theirs]
>   foo      <- labeled
>
> which is what I meant by NOT labelling "bar/", "baz", and "quux".

I'm doing something /really/ close to this, yes.  However, just to be
pedantic, there is no "foo/".  '/' is an illegal character in a
filename to record in a tree.  One side has a "foo" whose mode and
object_id happen to reflect a tree rather than a blob.  But I only
have one conflict_info per pathname, not 3 (can't have three since
strmaps don't allow duplicate keys, and wouldn't want it if I could).
That one conflict_info stores 3 (mode, object_id) pairs, and also has
a single df_conflict bit.  So, I label "foo" by setting that
df_conflict bit.  But I only pay attention to it for the pairs
representing a blob, not the ones representing a tree.  And I don't
propagate the information down to paths below the foo directory.

> > unpack_trees() will note when there is a directory/file
> > conflict, and propagates that information to all subtrees, with every
> > path specially checking for the o->df_conflict_entry and then handling
> > it specially (e.g. keeping higher order stages instead of using an
> > aggressive or trivial resolutions).
>
> And here it seems like you're describing that unpack_trees() would label
> it in this way:
>
>  [ours]
>   foo/     <- labeled
>     bar/   <- labeled
>       baz  <- labeled
>     quux   <- labeled
>  [theirs]
>   foo      <- labeled
>
> (and you're emphasizing by contrast that merge-ort is NOT doing this).

Correct.

> > However, leaving both a file and
> > a directory at the same path, while allowed in the index, makes for
> > ugliness and difficulty for users to resolve.   Plus it isn't allowed
> > in the working tree anyway.  We decided a while ago that it'd be
> > better to represent these conflicts differently[1], [2].
> >
> > Also, at the time you are unpacking or traversing trees, you only know
> > if one side had a directory where the other side had a file.  You
> > don't know if the final merge result will actually have a
> > directory/file conflict.  If the file existed in both the base version
> > and unmodified on one side, for example, then the file will be removed
> > as part of the merge.  It is similarly possible that the entire
> > directory of files all need to be deleted or are all renamed
> > elsewhere.  So, you have to keep track of a df_conflict bit, but you
> > can't act on it until you've processed several other things first.
> >
> > Since I already know I'm not going to move a whole directory of files
> > out of the way so that a file can be placed in the working tree
> > instead of that whole directory, the directory doesn't need to be
> > tweaked.  I'm not going to propagate any information about a
> > directory/file conflict at some path down to all subpaths of the
> > directory.  I only track it for the file that immediately conflicts,
> > and then only take action on it after resolving all the paths under
> > the corresponding directory to see if the directory/file conflict
> > remains.
> >
> > [1] https://lore.kernel.org/git/xmqqbmabcuhf.fsf@gitster-ct.c.googlers.com/
> > and the thread surrounding it
> > [2] https://lore.kernel.org/git/f27f12e8e50e56c010c29caa00296475d4de205b.1603731704.git.gitgitgadget@gmail.com/,
> > which is now commit ef52778708 ("merge tests: expect improved
> > directory/file conflict handling in ort", 2020-10-26)
>
> Makes sense.
>
> > > > @@ -161,6 +179,13 @@ static int collect_merge_info_callback(int n,
> > > >               newinfo.name = p->path;
> > > >               newinfo.namelen = p->pathlen;
> > > >               newinfo.pathlen = st_add3(newinfo.pathlen, p->pathlen, 1);
> > > > +             /*
> > > > +              * If we did care about parent directories having a D/F
> > > > +              * conflict, then we'd include
> > > > +              *    newinfo.df_conflicts |= (mask & ~dirmask);
> > > > +              * here.  But we don't.  (See comment near setting of local
> > > > +              * df_conflict variable near the beginning of this function).
> > > > +              */
> > >
> > > I'm not sure how "mask" and "dirmask" contains information about parent
> > > directories. "mask" represents the available entries, and "dirmask"
> > > represents which of them are directories, as far as I know. So we can
> > > notice when something is missing, but I don't see how this distinguishes
> > > between the case that something is missing because it was in a parent
> > > directory that got deleted, vs something is missing because it itself
> > > got deleted.
> >
> > Yeah, this is more comparisons to unpack_trees.  This code is about to
> > set up a recursive call into subdirectories.  newinfo is set based on
> > the mask and dirmask of the current entry, and then subdirectories can
> > consult newinfo.df_conflicts to see if that path is within a directory
> > that was involved in a directory/file conflict.  For example:
> >
> > Tree in base version:
> >     foo/
> >         bar
> >     stuff.txt
> > Tree on side 1: (adds foo/baz)
> >     foo/
> >         bar
> >         baz
> >     stuff.txt
> > Tree on side 2: (deletes foo/, adds new file foo)
> >    foo
> >    stuff.txt
> >
> > When processing 'foo', we have mask=7, dirmask = 3.  So, here
> > unpack_trees() would have set newinfo.df_conflicts = (mask & ~dirmask)
> > = 4.  Then when we process foo/bar or foo/baz, we have mask=2,
> > dirmask=0, which looks like there are no directory/file conflicts.
> > However, we can note that these paths are under a directory involved
> > in a directory/file conflict via info.df_conflicts whose value is 4.
> > unpack_trees() cared about paths under a directory that was involved
> > in a directory/file conflict, and someone familiar with that code
> > might ask why I don't also track the same information.  The answer is
> > that I don't track it, even though I thought about it, because it's
> > useless overhead since I'm going to leave the directory alone and move
> > the file out of the way.
> >
> > Does that make sense?
>
> Ah...yes, that makes sense. I think I didn't notice the "newinfo", so I
> didn't realize that we were setting the info of our children based on
> ourselves. Perhaps I would have noticed it sooner if the comment had
> read "If this file/directory cared about its parent directory (the
> current directory) having a D/F conflict, then we'd propagate the masks
> in this way:" instead of "If we did care about parent directories having
> a D/F conflict", but perhaps the point is already obvious enough.

I'm happy to reword it if that makes it clearer.  Thanks for the suggestion.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path
  2020-11-09 22:09       ` Jonathan Tan
@ 2020-11-09 23:08         ` Elijah Newren
  0 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-09 23:08 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git Mailing List

On Mon, Nov 9, 2020 at 2:09 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > > So these are placed in paths but not unmerged. I'm starting to wonder if
> > > struct merge_options_internal should be called merge_options_state or
> > > something, and each field having documentation about when they're used
> > > (or better yet, have functions like collect_merge_info() return their
> > > calculations in return values (which may be "out" parameters) instead of
> > > in this struct).
> >
> > Right, unmerged is only those paths that remain unmerged after all
> > steps.  record_unmerged_index_entries() could simply walk over all
> > entries in paths and pick out the ones that were unmerged, but
> > process_entries() has to walk over all paths, determine whether they
> > can be merged, and determine what to record for the resulting tree for
> > each path.  So, having it stash away the unmerged stuff is a simple
> > optimization.
> >
> > Renaming to merge_options_state or even just merge_state would be fine
> > -- but any renaming done here will also affect merge-recursive.[ch].
> > See the definition of merge_options in merge-recursive.  (For history,
> > merge-recursive.h stuffed state into merge_options, which risked funny
> > misusage patterns and made the API unnecessarily complex...and made it
> > suggest that alternative algorithms needed to have the same state.
> > So, the state was moved to a merge_options_internal struct.  That's
> > not to say we can't rename, but it does need to be done in
> > merge-recursive as well.)
>
> Ah, I see.
>
> > As for having collect_merge_info() return their calculations in return
> > values, would that just end with me returning a struct
> > merge_options_internal?  Or did you want each return value added to
> > the function signature?  Each return value in the function signature
> > makes sense right now for this super-simplified initial 20 patches,
> > but what about when this data structure gains all kind of
> > rename-related state that is collected, updated, and passed between
> > these areas?  I'd have a huge number of "out" and "in" fields to every
> > function.  Eventually, merge_options_internal (or whatever it might be
> > renamed to) expands to the following, where I have to first define an
> > extra enum and two extra structs so that you know the definitions of
> > new types that show up in merge_options_internal:
>
> [snip enums and structs]
>
> Good point. I should have realized that there would be much more to
> track.
>
> > > > +     result->string = fullpath;
> > > > +     result->util = path_info;
> > > > +}
> > > > +
> > > >  static int collect_merge_info_callback(int n,
> > > >                                      unsigned long mask,
> > > >                                      unsigned long dirmask,
> > > > @@ -91,10 +136,12 @@ static int collect_merge_info_callback(int n,
> > > >        */
> > > >       struct merge_options *opt = info->data;
> > > >       struct merge_options_internal *opti = opt->priv;
> > > > -     struct conflict_info *ci;
> > > > +     struct string_list_item pi;  /* Path Info */
> > > > +     struct conflict_info *ci; /* pi.util when there's a conflict */
> > >
> > > Looking ahead to patch 10, this seems more like "pi.util unless we know
> > > for sure that there's no conflict".
> >
> > That's too long for the line to remain at 80 characters; it's 16
> > characters over the limit.  ;-)
>
> Well, you could move the description onto its own line :-)

:-)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 03/20] merge-ort: port merge_start() from merge-recursive
  2020-11-02 20:43 ` [PATCH v2 03/20] merge-ort: port merge_start() from merge-recursive Elijah Newren
@ 2020-11-11 13:52   ` Derrick Stolee
  2020-11-11 16:22     ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-11 13:52 UTC (permalink / raw)
  To: Elijah Newren, git

On 11/2/2020 3:43 PM, Elijah Newren wrote:
> merge_start() basically does a bunch of sanity checks, then allocates
> and initializes opt->priv -- a struct merge_options_internal.
> 
> Most the sanity checks are usable as-is.  The allocation/intialization

s/Most the/Most of the/

> The weirdest part here is that merge-ort and merge-recursive use the
> same struct merge_options, even though merge_options has a number of
> fields that are oddly specific to merge-recursive's internal
> implementation and don't even make sense with merge-ort's high-level
> design (e.g. buffer_output, which merge-ort has to always do).  I reused
> the same data structure because:
>   * most the fields made sense to both merge algorithms
>   * making a new struct would have required making new enums or somehow
>     externalizing them, and that was getting messy.
>   * it simplifies converting the existing callers by not having to
>     have different code paths for merge_options setup.

I think this is appropriate. The other option would be to split the
struct into "common options" and "specific options" but that starts
to get messy if we add yet another merge strategy that changes what
should be "common". Hopefully we can group options within the struct
merge_options definition to assist with this?

For now, the assertions are a good approach.

> I also marked detect_renames as ignored.  We can revisit that later, but
> in short: merge-recursive allowed turning off rename detection because
> it was sometimes glacially slow.  When you speed something up by a few
> orders of magnitude, it's worth revisiting whether that justification is
> still relevant.  Besides, if folks find it's still too slow, perhaps
> they have a better scaling case than I could find and maybe it turns up
> some more optimizations we can add.  If it still is needed as an option,
> it is easy to add later.

As long as it is easy to add later, I don't see much of a problem. Usually
adding a knob to disable a feature is necessary to mitigate risk, and here
we can simply adjust config to use the non-ort algorithm if we notice a data
shape where rename detection makes the algorithm slow/unusable.

>  static void merge_start(struct merge_options *opt, struct merge_result *result)
>  {
> -	die("Not yet implemented.");
> +	/* Sanity checks on opt */
> +	assert(opt->repo);
> +
> +	assert(opt->branch1 && opt->branch2);
> +
> +	assert(opt->detect_directory_renames >= MERGE_DIRECTORY_RENAMES_NONE &&
> +	       opt->detect_directory_renames <= MERGE_DIRECTORY_RENAMES_TRUE);
> +	assert(opt->rename_limit >= -1);
> +	assert(opt->rename_score >= 0 && opt->rename_score <= MAX_SCORE);
> +	assert(opt->show_rename_progress >= 0 && opt->show_rename_progress <= 1);
> +
> +	assert(opt->xdl_opts >= 0);
> +	assert(opt->recursive_variant >= MERGE_VARIANT_NORMAL &&
> +	       opt->recursive_variant <= MERGE_VARIANT_THEIRS);
> +
> +	/*
> +	 * detect_renames, verbosity, buffer_output, and obuf are ignored
> +	 * fields that were used by "recursive" rather than "ort" -- but
> +	 * sanity check them anyway.
> +	 */
> +	assert(opt->detect_renames >= -1 &&
> +	       opt->detect_renames <= DIFF_DETECT_COPY);
> +	assert(opt->verbosity >= 0 && opt->verbosity <= 5);
> +	assert(opt->buffer_output <= 2);
> +	assert(opt->obuf.len == 0);
> +
> +	assert(opt->priv == NULL);
> +
> +	/* Initialization of opt->priv, our internal merge data */
> +	opt->priv = xcalloc(1, sizeof(*opt->priv));

nit: I would insert an empty line between this code and the
multi-line comment below.

> +	/*
> +	 * Although we initialize opt->priv->paths with strdup_strings=0,
> +	 * that's just to avoid making yet another copy of an allocated
> +	 * string.  Putting the entry into paths means we are taking
> +	 * ownership, so we will later free it.
> +	 *
> +	 * In contrast, unmerged just has a subset of keys from paths, so
> +	 * we don't want to free those (it'd be a duplicate free).
> +	 */
> +	strmap_init_with_options(&opt->priv->paths, NULL, 0);
> +	strmap_init_with_options(&opt->priv->unmerged, NULL, 0);
>  }

This approach looks fine to me.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 04/20] merge-ort: use histogram diff
  2020-11-02 20:43 ` [PATCH v2 04/20] merge-ort: use histogram diff Elijah Newren
@ 2020-11-11 13:54   ` Derrick Stolee
  2020-11-11 16:47     ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-11 13:54 UTC (permalink / raw)
  To: Elijah Newren, git

On 11/2/2020 3:43 PM, Elijah Newren wrote:
> I have some ideas for using a histogram diff to improve content merges,
> which fundamentally relies on the idea of a histogram.  Since the diffs
> are never displayed to the user but just used internally for merging,
> the typical user preference shouldn't matter anyway, and I want to make
> sure that all my testing works with this algorithm.
> 
> Granted, I don't yet know if those ideas will pan out and I haven't even
> tried any of them out yet, but it's easy to change the diff algorithm in
> the future if needed or wanted.  For now, just set it to histogram.

If you are not making use of the histogram yet, then could you set this
patch aside until you _do_ use it? Or are there performance implications
that are also a side benefit?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/20] merge-ort: add an err() function similar to one from merge-recursive
  2020-11-02 20:43 ` [PATCH v2 05/20] merge-ort: add an err() function similar to one from merge-recursive Elijah Newren
@ 2020-11-11 13:58   ` Derrick Stolee
  2020-11-11 17:07     ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-11 13:58 UTC (permalink / raw)
  To: Elijah Newren, git; +Cc: Jeff Hostetler

On 11/2/2020 3:43 PM, Elijah Newren wrote:
> Various places in merge-recursive used an err() function when it hit
> some kind of unrecoverable error.  That code was from the reusable bits
> of merge-recursive.c that we liked, such as merge_3way, writing object
> files to the object store, reading blobs from the object store, etc.  So
> create a similar function to allow us to port that code over, and use it
> for when we detect problems returned from collect_merge_info()'s
> traverse_trees() call, which we will be adding next.
> 
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  merge-ort.c | 27 ++++++++++++++++++++++++++-
>  1 file changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/merge-ort.c b/merge-ort.c
> index df97a54773..537da9f6df 100644
> --- a/merge-ort.c
> +++ b/merge-ort.c
> @@ -61,11 +61,28 @@ struct conflict_info {
>  	unsigned match_mask:3;
>  };
>  
> +static int err(struct merge_options *opt, const char *err, ...)
> +{
> +	va_list params;
> +	struct strbuf sb = STRBUF_INIT;
> +
> +	strbuf_addstr(&sb, "error: ");
> +	va_start(params, err);
> +	strbuf_vaddf(&sb, err, params);
> +	va_end(params);
> +
> +	error("%s", sb.buf);
> +	strbuf_release(&sb);
> +
> +	return -1;
> +}
> +

This seems simple enough to have a duplicate copy lying
around. Do you anticipate that all common code will live
in the same file eventually? Or will these two static err()
method always be duplicated?

Aside: I wonder if these errors could be logged using trace2
primitives, to assist diagnosing problems with merges. CC'ing
Jeff Hostetler if he has an opinion.

>  static int collect_merge_info(struct merge_options *opt,
>  			      struct tree *merge_base,
>  			      struct tree *side1,
>  			      struct tree *side2)
>  {
> +	/* TODO: Implement this using traverse_trees() */
>  	die("Not yet implemented.");
>  }

This comment looks to be applied to the wrong patch.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info()
  2020-11-02 20:43 ` [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info() Elijah Newren
  2020-11-06 22:19   ` Jonathan Tan
@ 2020-11-11 14:38   ` Derrick Stolee
  2020-11-11 17:02     ` Elijah Newren
  1 sibling, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-11 14:38 UTC (permalink / raw)
  To: Elijah Newren, git

On 11/2/2020 3:43 PM, Elijah Newren wrote:
> +	/* +1 in both of the following lines to include the NUL byte */
> +	fullpath = xmalloc(len+1);
> +	make_traverse_path(fullpath, len+1, info, p->path, p->pathlen);

nit: s/len+1/len + 1/g

> +		void *buf[3] = {NULL,};

This "{NULL,}" seems odd to me. I suppose there is a reason why it
isn't "{ NULL, NULL, NULL }"?

> +		const char *original_dir_name;
> +		int i, ret;
> +
> +		ci->match_mask &= filemask;
> +		newinfo = *info;
> +		newinfo.prev = info;
> +		newinfo.name = p->path;
> +		newinfo.namelen = p->pathlen;
> +		newinfo.pathlen = st_add3(newinfo.pathlen, p->pathlen, 1);
> +
> +		for (i = 0; i < 3; i++, dirmask >>= 1) {

This multi-action iterator borders on "too clever". It seems like
placing "dirmask >>= 1;" or "dirmask = dirmask >> 1;" at the end
of the block would be equivalent and less jarring to a reader.

I was thinking it doesn't really matter, except that dirmask is not
in the initializer or sentinel of the for(), so having it here does
not immediately make sense.

(This has been too much writing for such an inconsequential line
of code. Sorry.)

> +			const struct object_id *oid = NULL;
> +			if (dirmask & 1)
> +				oid = &names[i].oid;
> +			buf[i] = fill_tree_descriptor(opt->repo, t + i, oid);
> +		}


>  static int collect_merge_info(struct merge_options *opt,
>  			      struct tree *merge_base,
>  			      struct tree *side1,
>  			      struct tree *side2)
>  {
> -	/* TODO: Implement this using traverse_trees() */
> -	die("Not yet implemented.");
> +	int ret;
> +	struct tree_desc t[3];
> +	struct traverse_info info;
> +	char *toplevel_dir_placeholder = "";

It seems like this should be "const char *"

> +	init_tree_desc(t+0, merge_base->buffer, merge_base->size);
> +	init_tree_desc(t+1, side1->buffer, side1->size);
> +	init_tree_desc(t+2, side2->buffer, side2->size);

More space issues: s/t+/t + /g

I'm only really able to engage in this at a surface level, it
seems, but maybe I'll have more to say as the implementation
grows.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 07/20] merge-ort: avoid repeating fill_tree_descriptor() on the same tree
  2020-11-02 20:43 ` [PATCH v2 07/20] merge-ort: avoid repeating fill_tree_descriptor() on the same tree Elijah Newren
@ 2020-11-11 14:51   ` Derrick Stolee
  2020-11-11 17:13     ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-11 14:51 UTC (permalink / raw)
  To: Elijah Newren, git

On 11/2/2020 3:43 PM, Elijah Newren wrote:
> @@ -99,6 +99,15 @@ static int collect_merge_info_callback(int n,
>  	unsigned mbase_null = !(mask & 1);
>  	unsigned side1_null = !(mask & 2);
>  	unsigned side2_null = !(mask & 4);
> +	unsigned side1_matches_mbase = (!side1_null && !mbase_null &&
> +					names[0].mode == names[1].mode &&
> +					oideq(&names[0].oid, &names[1].oid));
> +	unsigned side2_matches_mbase = (!side2_null && !mbase_null &&
> +					names[0].mode == names[2].mode &&
> +					oideq(&names[0].oid, &names[2].oid));
> +	unsigned sides_match = (!side1_null && !side2_null &&
> +				names[1].mode == names[2].mode &&
> +				oideq(&names[1].oid, &names[2].oid));

If the *_null values were in an array, instead, then all of these
lines could be grouped as a macro:

	unsigned null_oid[3] = {
		!(mask & 1),
		!(mask & 2),
		!(mask & 4)
	};

	#define trivial_merge(i,j) (!null_oid[i] && !null_oid[j] && \
				    names[i].mode == names[j].mode && \
				    oideq(&names[i].oid, &names[j].oid))

	unsigned side1_matches_mbase = trivial_merge(0, 1);
	unsigned side2_matches_mbase = trivial_merge(0, 2);
	unsigned sides_match = trivial_merge(1, 2);

I briefly considered making these last three an array, as well,
except the loop below doesn't use 'i' in a symmetrical way:

> +			if (i == 1 && side1_matches_mbase)
> +				t[1] = t[0];
> +			else if (i == 2 && side2_matches_mbase)
> +				t[2] = t[0];
> +			else if (i == 2 && sides_match)
> +				t[2] = t[1];

Since the 'i == 2' case has two possible options, it wouldn't be
possible to just have 'side_matches[i]' here.

> +			else {
> +				const struct object_id *oid = NULL;
> +				if (dirmask & 1)
> +					oid = &names[i].oid;
> +				buf[i] = fill_tree_descriptor(opt->repo,
> +							      t + i, oid);
> +			}

I do appreciate the reduced recursion here!

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path
  2020-11-02 20:43 ` [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path Elijah Newren
  2020-11-06 22:58   ` Jonathan Tan
@ 2020-11-11 15:26   ` Derrick Stolee
  2020-11-11 18:16     ` Elijah Newren
  1 sibling, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-11 15:26 UTC (permalink / raw)
  To: Elijah Newren, git

On 11/2/2020 3:43 PM, Elijah Newren wrote:
> +static void setup_path_info(struct merge_options *opt,
> +			    struct string_list_item *result,
> +			    const char *current_dir_name,
> +			    int current_dir_name_len,
> +			    char *fullpath, /* we'll take over ownership */
> +			    struct name_entry *names,
> +			    struct name_entry *merged_version,
> +			    unsigned is_null,     /* boolean */
> +			    unsigned df_conflict, /* boolean */
> +			    unsigned filemask,
> +			    unsigned dirmask,
> +			    int resolved          /* boolean */)
> +{
> +	struct conflict_info *path_info;

In addition to my concerns below about 'conflict_info' versus
'merged_info', I was doubly confused that 'result' in the parameter
list is given a variable named 'pi' for "path info" and result->util
eventually is equal to this path_info. What if we renamed 'result'
to 'pi' for "path info" here, then operated on 'pi->util' in this
method?

> +	path_info = xcalloc(1, resolved ? sizeof(struct merged_info) :
> +					  sizeof(struct conflict_info));

Hm. I'm happy to have a `struct merged_info *` pointing to a
`struct conflict_info`, but the opposite seems very dangerous.
Perhaps we should always use sizeof(struct conflict_info)?

We can use path_info->merged.clean to detect whether the rest of
the data is worth looking at. (Or, in your case, whether or not
it is allocated.)

I imagine that in a large repo we will need many of these structs,
but very few of them will actually need to be conflicts, so using
'struct conflict_info' always will lead to memory bloat. But in
that case, would we not be better off with an array instead of a
scattering of data across the heap?

Perhaps 'struct conflict_info' shouldn't contain a 'struct merged_info'
and instead be just the "extra" data. Then we could have a contiguous
array of 'struct merged_info' values for most of the paths, but heap
pointers for 'struct conflict_info' as necessary.

It's also true that I haven't fully formed a mental model for how these
are used in your algorithm, so I'll keep reading.

> +	path_info->merged.directory_name = current_dir_name;
> +	path_info->merged.basename_offset = current_dir_name_len;
> +	path_info->merged.clean = !!resolved;
> +	if (resolved) {
> +		path_info->merged.result.mode = merged_version->mode;
> +		oidcpy(&path_info->merged.result.oid, &merged_version->oid);
> +		path_info->merged.is_null = !!is_null;
> +	} else {
> +		int i;
> +
> +		for (i = 0; i < 3; i++) {
> +			path_info->pathnames[i] = fullpath;
> +			path_info->stages[i].mode = names[i].mode;
> +			oidcpy(&path_info->stages[i].oid, &names[i].oid);
> +		}
> +		path_info->filemask = filemask;
> +		path_info->dirmask = dirmask;
> +		path_info->df_conflict = !!df_conflict;
> +	}
> +	strmap_put(&opt->priv->paths, fullpath, path_info);
> +	result->string = fullpath;
> +	result->util = path_info;

This is set in all cases, so should we use it everywhere? Naturally,
there might be a cost to the extra pointer indirection, so maybe we
create a 'struct conflict_info *util' to operate on during this
method, but set 'result->util = util' right after allocating so we
know how it should behave?

> @@ -91,10 +136,12 @@ static int collect_merge_info_callback(int n,
>  	 */
>  	struct merge_options *opt = info->data;
>  	struct merge_options_internal *opti = opt->priv;
> -	struct conflict_info *ci;
> +	struct string_list_item pi;  /* Path Info */
> +	struct conflict_info *ci; /* pi.util when there's a conflict */

...

> +	setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
> +			names, NULL, 0, df_conflict, filemask, dirmask, 0);
> +	ci = pi.util;

Here is the use of 'pi' that I was talking about earlier.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/20] merge-ort: avoid recursing into identical trees
  2020-11-02 20:43 ` [PATCH v2 10/20] merge-ort: avoid recursing into identical trees Elijah Newren
@ 2020-11-11 15:31   ` Derrick Stolee
  0 siblings, 0 replies; 84+ messages in thread
From: Derrick Stolee @ 2020-11-11 15:31 UTC (permalink / raw)
  To: Elijah Newren, git

On 11/2/2020 3:43 PM, Elijah Newren wrote:
> +	/*
> +	 * If mbase, side1, and side2 all match, we can resolve early.  Even
> +	 * if these are trees, there will be no renames or anything
> +	 * underneath.
> +	 */
> +	if (side1_matches_mbase && side2_matches_mbase) {

Here is a case where if you were not caring about renames you could prevent
recursion here when "!renames && sides_match". Something to think about.

> +		/* mbase, side1, & side2 all match; use mbase as resolution */
> +		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
> +				names, names+0, mbase_null, 0,
> +				filemask, dirmask, 1);
> +		return mask;
> +	}
> +

-Stolee


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 12/20] merge-ort: have process_entries operate in a defined order
  2020-11-02 20:43 ` [PATCH v2 12/20] merge-ort: have process_entries operate in a defined order Elijah Newren
@ 2020-11-11 16:09   ` Derrick Stolee
  2020-11-11 18:58     ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-11 16:09 UTC (permalink / raw)
  To: Elijah Newren, git

On 11/2/2020 3:43 PM, Elijah Newren wrote:
> We want to handle paths below a directory before needing to handle the
> directory itself.  Also, we want to handle the directory immediately
> after the paths below it, so we can't use simple lexicographic ordering
> from strcmp (which would insert foo.txt between foo and foo/file.c).
> Copy string_list_df_name_compare() from merge-recursive.c, and set up a
> string list of paths sorted by that function so that we can iterate in
> the desired order.

This is at least the second time we've copied something from
merge-recursive.c. Should we be starting a merge-utils.[c|h] to group
these together under a common implementation?

> +	/* Put every entry from paths into plist, then sort */
>  	strmap_for_each_entry(&opt->priv->paths, &iter, e) {
> +		string_list_append(&plist, e->key)->util = e->value;
> +	}

nit: are braces required here?

-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 03/20] merge-ort: port merge_start() from merge-recursive
  2020-11-11 13:52   ` Derrick Stolee
@ 2020-11-11 16:22     ` Elijah Newren
  0 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 16:22 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

Thanks for all the reviews and suggestions!

I'll avoid commenting on the simple fixes that I'm just going to
apply, and instead concentrate on the bigger questions you have in my
reply.

On Wed, Nov 11, 2020 at 5:53 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/2/2020 3:43 PM, Elijah Newren wrote:
> > The weirdest part here is that merge-ort and merge-recursive use the
> > same struct merge_options, even though merge_options has a number of
> > fields that are oddly specific to merge-recursive's internal
> > implementation and don't even make sense with merge-ort's high-level
> > design (e.g. buffer_output, which merge-ort has to always do).  I reused
> > the same data structure because:
> >   * most the fields made sense to both merge algorithms
> >   * making a new struct would have required making new enums or somehow
> >     externalizing them, and that was getting messy.
> >   * it simplifies converting the existing callers by not having to
> >     have different code paths for merge_options setup.
>
> I think this is appropriate. The other option would be to split the
> struct into "common options" and "specific options" but that starts
> to get messy if we add yet another merge strategy that changes what
> should be "common". Hopefully we can group options within the struct
> merge_options definition to assist with this?

I think we should plan on merge-recursive.[ch] being deleted before a
third merge strategy comes along, so the common options might make
sense.  But then again, it sounds like work towards simultaneously
supporting two backends in perpetuity, which isn't at all the current
plan[1].

As far as grouping and other cleanups, see the series included with
merge commit 280bd44551 ("Merge branch 'en/merge-recursive-cleanup'",
2019-10-15), particularly commits ff1bfa2cd5 ("merge-recursive: use
common name for ancestors/common/base_list", 2019-08-17), a779fb829b
("merge-recursive: comment and reorder the merge_options fields",
2019-08-17), and 8599ab4574 ("merge-recursive: consolidate unnecessary
fields in merge_options", 2019-08-17).  I guess we could change from
grouping by option similarity and instead group by which are in use by
which merge backends, but since the plan is to [eventually] kill
merge-recursive and then to just drop the unused or ignored fields, I
think helping users understand the purpose of the options (which
grouping-by-similarity aids with) is more important than grouping for
the purpose of reminding myself which ones to remove later.

[1] https://lore.kernel.org/git/xmqqk1ydkbx0.fsf@gitster.mtv.corp.google.com/

> For now, the assertions are a good approach.
>
> > I also marked detect_renames as ignored.  We can revisit that later, but
> > in short: merge-recursive allowed turning off rename detection because
> > it was sometimes glacially slow.  When you speed something up by a few
> > orders of magnitude, it's worth revisiting whether that justification is
> > still relevant.  Besides, if folks find it's still too slow, perhaps
> > they have a better scaling case than I could find and maybe it turns up
> > some more optimizations we can add.  If it still is needed as an option,
> > it is easy to add later.
>
> As long as it is easy to add later, I don't see much of a problem. Usually
> adding a knob to disable a feature is necessary to mitigate risk, and here
> we can simply adjust config to use the non-ort algorithm if we notice a data
> shape where rename detection makes the algorithm slow/unusable.

Yes, it should be pretty easy to add later.

> >  static void merge_start(struct merge_options *opt, struct merge_result *result)
> >  {
> > -     die("Not yet implemented.");
> > +     /* Sanity checks on opt */
> > +     assert(opt->repo);
> > +
> > +     assert(opt->branch1 && opt->branch2);
> > +
> > +     assert(opt->detect_directory_renames >= MERGE_DIRECTORY_RENAMES_NONE &&
> > +            opt->detect_directory_renames <= MERGE_DIRECTORY_RENAMES_TRUE);
> > +     assert(opt->rename_limit >= -1);
> > +     assert(opt->rename_score >= 0 && opt->rename_score <= MAX_SCORE);
> > +     assert(opt->show_rename_progress >= 0 && opt->show_rename_progress <= 1);
> > +
> > +     assert(opt->xdl_opts >= 0);
> > +     assert(opt->recursive_variant >= MERGE_VARIANT_NORMAL &&
> > +            opt->recursive_variant <= MERGE_VARIANT_THEIRS);
> > +
> > +     /*
> > +      * detect_renames, verbosity, buffer_output, and obuf are ignored
> > +      * fields that were used by "recursive" rather than "ort" -- but
> > +      * sanity check them anyway.
> > +      */
> > +     assert(opt->detect_renames >= -1 &&
> > +            opt->detect_renames <= DIFF_DETECT_COPY);
> > +     assert(opt->verbosity >= 0 && opt->verbosity <= 5);
> > +     assert(opt->buffer_output <= 2);
> > +     assert(opt->obuf.len == 0);
> > +
> > +     assert(opt->priv == NULL);
> > +
> > +     /* Initialization of opt->priv, our internal merge data */
> > +     opt->priv = xcalloc(1, sizeof(*opt->priv));
>
> nit: I would insert an empty line between this code and the
> multi-line comment below.
>
> > +     /*
> > +      * Although we initialize opt->priv->paths with strdup_strings=0,
> > +      * that's just to avoid making yet another copy of an allocated
> > +      * string.  Putting the entry into paths means we are taking
> > +      * ownership, so we will later free it.
> > +      *
> > +      * In contrast, unmerged just has a subset of keys from paths, so
> > +      * we don't want to free those (it'd be a duplicate free).
> > +      */
> > +     strmap_init_with_options(&opt->priv->paths, NULL, 0);
> > +     strmap_init_with_options(&opt->priv->unmerged, NULL, 0);
> >  }
>
> This approach looks fine to me.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 04/20] merge-ort: use histogram diff
  2020-11-11 13:54   ` Derrick Stolee
@ 2020-11-11 16:47     ` Elijah Newren
  2020-11-11 16:51       ` Derrick Stolee
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 16:47 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

On Wed, Nov 11, 2020 at 5:54 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/2/2020 3:43 PM, Elijah Newren wrote:
> > I have some ideas for using a histogram diff to improve content merges,
> > which fundamentally relies on the idea of a histogram.  Since the diffs
> > are never displayed to the user but just used internally for merging,
> > the typical user preference shouldn't matter anyway, and I want to make
> > sure that all my testing works with this algorithm.
> >
> > Granted, I don't yet know if those ideas will pan out and I haven't even
> > tried any of them out yet, but it's easy to change the diff algorithm in
> > the future if needed or wanted.  For now, just set it to histogram.
>
> If you are not making use of the histogram yet, then could you set this
> patch aside until you _do_ use it? Or are there performance implications
> that are also a side benefit?

Long story...

git folks tend to value performance pretty strongly -- including
sometimes valuing it OVER correctness.  For example, if fast-export
completely munges some merge commits and you send your first ever
patch in to the list to fix it (by turning on topo_order), you might
run into folks asking if it should be made an option so that we don't
slow down exports except for people who happen to know they need it
(and thus risk breaking exports for people who happen to be unaware
that they need it...).  Luckily, Peff bailed me out in that situation
by doing some timings and finding that topo_order actually made
fast-export _faster_, to everyone's surprise at the time.  Or, to take
another example, perhaps someone will introduce some commit-date
cutoff on the revision walking machinery that sometimes breaks the
answer but makes things faster...and then causes a bunch of headaches
years down the road when someone tries to introduce commit-graphs to
get us always correct _and_ fast answers.

In this case, histogram diffs in my cursory investigation are about 2%
slower than Myers diffs.  I think others may have done even more
detailed benchmarks.  They've been around for years, but haven't been
made the default, despite giving obviously better looking diffs to
users in a number of cases where Myers diffs are unintelligible.  But,
far more importantly, there are real merge bugs we know about that are
even affecting git.git and linux.git that I don't have a clue how to
address without the additional information that I believe is provided
by histogram diffs.  See the following:

https://lore.kernel.org/git/20190816184051.GB13894@sigill.intra.peff.net/
https://lore.kernel.org/git/CABPp-BHvJHpSJT7sdFwfNcPn_sOXwJi3=o14qjZS3M8Rzcxe2A@mail.gmail.com/
https://lore.kernel.org/git/CABPp-BGtez4qjbtFT1hQoREfcJPmk9MzjhY5eEq1QhXT23tFOw@mail.gmail.com/

I don't like mismerges.  I really don't like silent mismerges.  While
I am sometimes willing to make performance and correctness tradeoff,
I'm much more interested in correctness in general.  I want to fix the
above bugs.  I have not yet started doing so, but I believe histogram
diff at least gives me an angle.  But I can't rely on using the
information from histogram diff unless it's in use.  And it hasn't
been used because of a few percentage performance hit.

But, since I happen to be speeding up typical non-trivial
merges/rebases/cherry-picks by factors of 10 or more (at least,
anywhere that read-and-update-the-index is a trivial percentage of
overall time instead of 99.9% of overall time like it is for you), now
is golden opportunity to switch out the underlying diff algorithm so
that I can get the data I need to fix the bugs I know are there.
Whether I can actually fix them is yet to be seen; I won't even start
until merge-ort is complete and merged.

Does that help?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 04/20] merge-ort: use histogram diff
  2020-11-11 16:47     ` Elijah Newren
@ 2020-11-11 16:51       ` Derrick Stolee
  2020-11-11 17:03         ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-11 16:51 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

On 11/11/2020 11:47 AM, Elijah Newren wrote:
> On Wed, Nov 11, 2020 at 5:54 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 11/2/2020 3:43 PM, Elijah Newren wrote:
>>> I have some ideas for using a histogram diff to improve content merges,
>>> which fundamentally relies on the idea of a histogram.  Since the diffs
>>> are never displayed to the user but just used internally for merging,
>>> the typical user preference shouldn't matter anyway, and I want to make
>>> sure that all my testing works with this algorithm.
>>>
>>> Granted, I don't yet know if those ideas will pan out and I haven't even
>>> tried any of them out yet, but it's easy to change the diff algorithm in
>>> the future if needed or wanted.  For now, just set it to histogram.
>>
>> If you are not making use of the histogram yet, then could you set this
>> patch aside until you _do_ use it? Or are there performance implications
>> that are also a side benefit?
> 
> Long story...

...

> Does that help?

In summary, you have some concrete reasons to prefer the histogram
diff other than just "I have some ideas that might pan out later" so
this code change is a good one but could be better justified in the
commit message. Does that sound correct?

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info()
  2020-11-11 14:38   ` Derrick Stolee
@ 2020-11-11 17:02     ` Elijah Newren
  0 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 17:02 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

On Wed, Nov 11, 2020 at 6:38 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/2/2020 3:43 PM, Elijah Newren wrote:
> > +     /* +1 in both of the following lines to include the NUL byte */
> > +     fullpath = xmalloc(len+1);
> > +     make_traverse_path(fullpath, len+1, info, p->path, p->pathlen);
>
> nit: s/len+1/len + 1/g
>
> > +             void *buf[3] = {NULL,};
>
> This "{NULL,}" seems odd to me. I suppose there is a reason why it
> isn't "{ NULL, NULL, NULL }"?

Probably because I was copying from unpack-trees.c, which deals with a
variable number of trees instead of always exactly 3.  But yeah, it'd
probably be more straightforward as { NULL, NULL, NULL }.

> > +             const char *original_dir_name;
> > +             int i, ret;
> > +
> > +             ci->match_mask &= filemask;
> > +             newinfo = *info;
> > +             newinfo.prev = info;
> > +             newinfo.name = p->path;
> > +             newinfo.namelen = p->pathlen;
> > +             newinfo.pathlen = st_add3(newinfo.pathlen, p->pathlen, 1);
> > +
> > +             for (i = 0; i < 3; i++, dirmask >>= 1) {
>
> This multi-action iterator borders on "too clever". It seems like
> placing "dirmask >>= 1;" or "dirmask = dirmask >> 1;" at the end
> of the block would be equivalent and less jarring to a reader.
>
> I was thinking it doesn't really matter, except that dirmask is not
> in the initializer or sentinel of the for(), so having it here does
> not immediately make sense.
>
> (This has been too much writing for such an inconsequential line
> of code. Sorry.)

Yeah, copied from unpack-trees.c:traverse_trees_recursive().  The
newinfo variable name and a bunch of the surrounding lines were copied
from there too.  I can switch it, though, if it makes it easier.

> > +                     const struct object_id *oid = NULL;
> > +                     if (dirmask & 1)
> > +                             oid = &names[i].oid;
> > +                     buf[i] = fill_tree_descriptor(opt->repo, t + i, oid);
> > +             }
>
>
> >  static int collect_merge_info(struct merge_options *opt,
> >                             struct tree *merge_base,
> >                             struct tree *side1,
> >                             struct tree *side2)
> >  {
> > -     /* TODO: Implement this using traverse_trees() */
> > -     die("Not yet implemented.");
> > +     int ret;
> > +     struct tree_desc t[3];
> > +     struct traverse_info info;
> > +     char *toplevel_dir_placeholder = "";
>
> It seems like this should be "const char *"
>
> > +     init_tree_desc(t+0, merge_base->buffer, merge_base->size);
> > +     init_tree_desc(t+1, side1->buffer, side1->size);
> > +     init_tree_desc(t+2, side2->buffer, side2->size);
>
> More space issues: s/t+/t + /g

In my defense:

$ git grep init_tree_desc.*t.*\+ | grep -v merge-ort
builtin/merge.c: init_tree_desc(t+i, trees[i]->buffer, trees[i]->size);
builtin/read-tree.c: init_tree_desc(t+i, tree->buffer, tree->size);
merge-recursive.c: init_tree_desc_from_tree(t+0, common);
merge-recursive.c: init_tree_desc_from_tree(t+1, head);
merge-recursive.c: init_tree_desc_from_tree(t+2, merge);
merge.c: init_tree_desc(t+i, trees[i]->buffer, trees[i]->size);

None of which blames to me.  :-)

I can fix it up, though...at least the merge-ort one.  Someone else
can go through existing code if they so desire.

> I'm only really able to engage in this at a surface level, it
> seems, but maybe I'll have more to say as the implementation
> grows.

It _might_ be helpful to compare to unpack-trees.c's unpack_callback()
and traverse_trees_recursive(), but there's so much unrelated stuff
there that it's possible that just gets in the way more than it helps.
Regardless, thanks for taking a look and spotting little fixes; every
bit helps.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 04/20] merge-ort: use histogram diff
  2020-11-11 16:51       ` Derrick Stolee
@ 2020-11-11 17:03         ` Elijah Newren
  0 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 17:03 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

On Wed, Nov 11, 2020 at 8:51 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/11/2020 11:47 AM, Elijah Newren wrote:
> > On Wed, Nov 11, 2020 at 5:54 AM Derrick Stolee <stolee@gmail.com> wrote:
> >>
> >> On 11/2/2020 3:43 PM, Elijah Newren wrote:
> >>> I have some ideas for using a histogram diff to improve content merges,
> >>> which fundamentally relies on the idea of a histogram.  Since the diffs
> >>> are never displayed to the user but just used internally for merging,
> >>> the typical user preference shouldn't matter anyway, and I want to make
> >>> sure that all my testing works with this algorithm.
> >>>
> >>> Granted, I don't yet know if those ideas will pan out and I haven't even
> >>> tried any of them out yet, but it's easy to change the diff algorithm in
> >>> the future if needed or wanted.  For now, just set it to histogram.
> >>
> >> If you are not making use of the histogram yet, then could you set this
> >> patch aside until you _do_ use it? Or are there performance implications
> >> that are also a side benefit?
> >
> > Long story...
>
> ...
>
> > Does that help?
>
> In summary, you have some concrete reasons to prefer the histogram
> diff other than just "I have some ideas that might pan out later" so
> this code change is a good one but could be better justified in the
> commit message. Does that sound correct?

Sure, I can add some additional wording about those ideas and the
concrete issues I want to fix.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/20] merge-ort: add an err() function similar to one from merge-recursive
  2020-11-11 13:58   ` Derrick Stolee
@ 2020-11-11 17:07     ` Elijah Newren
  2020-11-11 17:10       ` Derrick Stolee
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 17:07 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List, Jeff Hostetler

On Wed, Nov 11, 2020 at 5:58 AM Derrick Stolee <stolee@gmail.com> wrote:

> > +static int err(struct merge_options *opt, const char *err, ...)
> > +{
> > +     va_list params;
> > +     struct strbuf sb = STRBUF_INIT;
> > +
> > +     strbuf_addstr(&sb, "error: ");
> > +     va_start(params, err);
> > +     strbuf_vaddf(&sb, err, params);
> > +     va_end(params);
> > +
> > +     error("%s", sb.buf);
> > +     strbuf_release(&sb);
> > +
> > +     return -1;
> > +}
> > +
>
> This seems simple enough to have a duplicate copy lying
> around. Do you anticipate that all common code will live
> in the same file eventually? Or will these two static err()
> method always be duplicated?

I anticipate that merge-recursive.[ch] will be deleted.

merge-ort was what I wanted to modify merge-recursive.[ch] to be, but
Junio suggested doing it as a different merge backend since it had
invasive changes, so that we could have an easy way to try it out and
fallback to the known good algorithm until we had sufficient comfort
with the new algorithm to switch over to it.

> Aside: I wonder if these errors could be logged using trace2
> primitives, to assist diagnosing problems with merges. CC'ing
> Jeff Hostetler if he has an opinion.
>
> >  static int collect_merge_info(struct merge_options *opt,
> >                             struct tree *merge_base,
> >                             struct tree *side1,
> >                             struct tree *side2)
> >  {
> > +     /* TODO: Implement this using traverse_trees() */
> >       die("Not yet implemented.");
> >  }
>
> This comment looks to be applied to the wrong patch.

Oops!  You are correct; will fix.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
                   ` (20 preceding siblings ...)
  2020-11-03 14:49 ` [PATCH v2 00/20] fundamentals of merge-ort implementation Derrick Stolee
@ 2020-11-11 17:08 ` Derrick Stolee
  2020-11-11 18:35   ` Elijah Newren
  21 siblings, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-11 17:08 UTC (permalink / raw)
  To: Elijah Newren, git

On 11/2/2020 3:43 PM, Elijah Newren wrote:
> Elijah Newren (20):
>   merge-ort: setup basic internal data structures
>   merge-ort: add some high-level algorithm structure
>   merge-ort: port merge_start() from merge-recursive
>   merge-ort: use histogram diff
>   merge-ort: add an err() function similar to one from merge-recursive
>   merge-ort: implement a very basic collect_merge_info()
>   merge-ort: avoid repeating fill_tree_descriptor() on the same tree
>   merge-ort: compute a few more useful fields for collect_merge_info
>   merge-ort: record stage and auxiliary info for every path
>   merge-ort: avoid recursing into identical trees
>   merge-ort: add a preliminary simple process_entries() implementation
>   merge-ort: have process_entries operate in a defined order

I got this far before my attention to detail really started slipping.

>   merge-ort: step 1 of tree writing -- record basenames, modes, and oids
>   merge-ort: step 2 of tree writing -- function to create tree object
>   merge-ort: step 3 of tree writing -- handling subdirectories as we go
>   merge-ort: basic outline for merge_switch_to_result()
>   merge-ort: add implementation of checkout()
>   tree: enable cmp_cache_name_compare() to be used elsewhere
>   merge-ort: add implementation of record_unmerged_index_entries()
>   merge-ort: free data structures in merge_finalize()

I'll try to take another pass on these commits tomorrow.

For the series as a whole I'd love to see at least one test that
demonstrates that this code does something, if even only for a very
narrow case.

There's a lot of code being moved here, and it would be nice to have
even a very simple test case that can check that we didn't leave any
important die("not implemented") calls lying around or worse accessing
an uninitialized pointer or something.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/20] merge-ort: add an err() function similar to one from merge-recursive
  2020-11-11 17:07     ` Elijah Newren
@ 2020-11-11 17:10       ` Derrick Stolee
  0 siblings, 0 replies; 84+ messages in thread
From: Derrick Stolee @ 2020-11-11 17:10 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Jeff Hostetler

On 11/11/2020 12:07 PM, Elijah Newren wrote:
> On Wed, Nov 11, 2020 at 5:58 AM Derrick Stolee <stolee@gmail.com> wrote:
>> This seems simple enough to have a duplicate copy lying
>> around. Do you anticipate that all common code will live
>> in the same file eventually? Or will these two static err()
>> method always be duplicated?
> 
> I anticipate that merge-recursive.[ch] will be deleted.
> 
> merge-ort was what I wanted to modify merge-recursive.[ch] to be, but
> Junio suggested doing it as a different merge backend since it had
> invasive changes, so that we could have an easy way to try it out and
> fallback to the known good algorithm until we had sufficient comfort
> with the new algorithm to switch over to it.

OK, I missed that context. Your approach is fine as long as these
are not going to both exist forever.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 07/20] merge-ort: avoid repeating fill_tree_descriptor() on the same tree
  2020-11-11 14:51   ` Derrick Stolee
@ 2020-11-11 17:13     ` Elijah Newren
  2020-11-11 17:21       ` Eric Sunshine
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 17:13 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

On Wed, Nov 11, 2020 at 6:51 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/2/2020 3:43 PM, Elijah Newren wrote:
> > @@ -99,6 +99,15 @@ static int collect_merge_info_callback(int n,
> >       unsigned mbase_null = !(mask & 1);
> >       unsigned side1_null = !(mask & 2);
> >       unsigned side2_null = !(mask & 4);
> > +     unsigned side1_matches_mbase = (!side1_null && !mbase_null &&
> > +                                     names[0].mode == names[1].mode &&
> > +                                     oideq(&names[0].oid, &names[1].oid));
> > +     unsigned side2_matches_mbase = (!side2_null && !mbase_null &&
> > +                                     names[0].mode == names[2].mode &&
> > +                                     oideq(&names[0].oid, &names[2].oid));
> > +     unsigned sides_match = (!side1_null && !side2_null &&
> > +                             names[1].mode == names[2].mode &&
> > +                             oideq(&names[1].oid, &names[2].oid));
>
> If the *_null values were in an array, instead, then all of these
> lines could be grouped as a macro:
>
>         unsigned null_oid[3] = {
>                 !(mask & 1),
>                 !(mask & 2),
>                 !(mask & 4)
>         };
>
>         #define trivial_merge(i,j) (!null_oid[i] && !null_oid[j] && \
>                                     names[i].mode == names[j].mode && \
>                                     oideq(&names[i].oid, &names[j].oid))
>
>         unsigned side1_matches_mbase = trivial_merge(0, 1);
>         unsigned side2_matches_mbase = trivial_merge(0, 2);
>         unsigned sides_match = trivial_merge(1, 2);

Hmm, I like it.  I think I'll rename trivial_merge() to
non_null_match() (trivial merge suggests it can immediately be
resolved which is not necessarily true if rename detection is on), but
otherwise I'll use this.

> I briefly considered making these last three an array, as well,
> except the loop below doesn't use 'i' in a symmetrical way:
>
> > +                     if (i == 1 && side1_matches_mbase)
> > +                             t[1] = t[0];
> > +                     else if (i == 2 && side2_matches_mbase)
> > +                             t[2] = t[0];
> > +                     else if (i == 2 && sides_match)
> > +                             t[2] = t[1];
>
> Since the 'i == 2' case has two possible options, it wouldn't be
> possible to just have 'side_matches[i]' here.
>
> > +                     else {
> > +                             const struct object_id *oid = NULL;
> > +                             if (dirmask & 1)
> > +                                     oid = &names[i].oid;
> > +                             buf[i] = fill_tree_descriptor(opt->repo,
> > +                                                           t + i, oid);
> > +                     }
>
> I do appreciate the reduced recursion here!

Technically, not my own optimization; I just copied from
unpack-trees.c:traverse_trees_recursive() -- though the code looks
slightly different because I didn't want to compare oids multiple
times (I use the side match variables earlier in the function as
well).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 07/20] merge-ort: avoid repeating fill_tree_descriptor() on the same tree
  2020-11-11 17:13     ` Elijah Newren
@ 2020-11-11 17:21       ` Eric Sunshine
  0 siblings, 0 replies; 84+ messages in thread
From: Eric Sunshine @ 2020-11-11 17:21 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Derrick Stolee, Git Mailing List

On Wed, Nov 11, 2020 at 12:13 PM Elijah Newren <newren@gmail.com> wrote:
> On Wed, Nov 11, 2020 at 6:51 AM Derrick Stolee <stolee@gmail.com> wrote:
> > If the *_null values were in an array, instead, then all of these
> > lines could be grouped as a macro:
> >
> >         unsigned null_oid[3] = {
> >                 !(mask & 1),
> >                 !(mask & 2),
> >                 !(mask & 4)
> >         };
>
> Hmm, I like it.  I think I'll rename trivial_merge() to
> non_null_match() (trivial merge suggests it can immediately be
> resolved which is not necessarily true if rename detection is on), but
> otherwise I'll use this.

Are we allowing non-constant array initializers in the codebase these
days? I don't see anything in CodingGuidelines suggesting the use of
them.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path
  2020-11-11 15:26   ` Derrick Stolee
@ 2020-11-11 18:16     ` Elijah Newren
  2020-11-11 22:06       ` Elijah Newren
  2020-11-12 18:39       ` Derrick Stolee
  0 siblings, 2 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 18:16 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

On Wed, Nov 11, 2020 at 7:26 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/2/2020 3:43 PM, Elijah Newren wrote:
> > +static void setup_path_info(struct merge_options *opt,
> > +                         struct string_list_item *result,
> > +                         const char *current_dir_name,
> > +                         int current_dir_name_len,
> > +                         char *fullpath, /* we'll take over ownership */
> > +                         struct name_entry *names,
> > +                         struct name_entry *merged_version,
> > +                         unsigned is_null,     /* boolean */
> > +                         unsigned df_conflict, /* boolean */
> > +                         unsigned filemask,
> > +                         unsigned dirmask,
> > +                         int resolved          /* boolean */)
> > +{
> > +     struct conflict_info *path_info;
>
> In addition to my concerns below about 'conflict_info' versus
> 'merged_info', I was doubly confused that 'result' in the parameter
> list is given a variable named 'pi' for "path info" and result->util
> eventually is equal to this path_info. What if we renamed 'result'
> to 'pi' for "path info" here, then operated on 'pi->util' in this
> method?

result->util (or pi->util if you rename) is void *, making it hard to
operate on; you'd have to typecast at every usage.  Since it is used a
*lot*, it makes sense to have a typed pointer, and then just set
result->util to a copy of that value at the end.  That is what
path_info is for.

>
> > +     path_info = xcalloc(1, resolved ? sizeof(struct merged_info) :
> > +                                       sizeof(struct conflict_info));
>
> Hm. I'm happy to have a `struct merged_info *` pointing to a
> `struct conflict_info`, but the opposite seems very dangerous.

Yeah, this is perhaps the scariest bit, and if it were a side data
structure rather than the fundamental main one that was central to the
algorithm, then safety would trump performance concerns.  But since it
is the main data structure and likely the biggest (once you count the
various copies for each relevant path), then it might be worth the
extra care needed to shave off the extra memory.  Maybe we can still
tweak things to get some safety back without killing performance so
let me consider each of your suggestions/questions.

If I define it as a merged_info*, the compiler will only let me modify
fields within the merged_info portion of the struct.  Should I
typecast every line that touches the bits in the resolved==0 path
where I need to set fields within the conflict_info portion?
Alternatively, would a code flow like the following make you happier?

    struct conflict_info *ci = NULL;
    struct merge_info *mi = xcalloc(...);
    result->util = mi;
    /* Operate on mi */
    ...
    if (resolved)
      return;
   ci = mi;
   /* Operate on ci */
   ...

In either case, the returned item has potentially different sizes, so
the caller will still have to take care so I'm not sure how much extra
this structure within setup_path_info() buys us.

> Perhaps we should always use sizeof(struct conflict_info)?

We could do that; it'd certainly waste memory as I expect many entries
to be unmodified (on one or both sides of history).  But I'd way
rather go this route than splitting or re-arranging this data
structure.

> We can use path_info->merged.clean to detect whether the rest of
> the data is worth looking at. (Or, in your case, whether or not
> it is allocated.)

ci->merged.clean is used to determine whether to look at the rest of
the data, yes -- and that's an enforced assumption throughout the code
(as alluded to by the comment in the merge_options_internal data
structure that "paths" maps pathanemes to merge_info and conflict_info
types).  However, that is not quite the same as using the clean bit to
determine if more data is allocated; something can be allocated as a
conflict_info rather than a merged_info due to both sides making
modifying the same path, but then a threeway content merge comes back
clean and ci->merged.clean is updated from 0 to 1.  The extra data
remains allocated, but nothing in the algorithm ever needs to use
anything outside the merged bits for that path again.  (Actually, let
me state that more forcefully: nothing is *allowed* to look outside
the merged bits for that path once the clean bit is updated to 1).

> I imagine that in a large repo we will need many of these structs,
> but very few of them will actually need to be conflicts, so using
> 'struct conflict_info' always will lead to memory bloat. But in
> that case, would we not be better off with an array instead of a
> scattering of data across the heap?

Not sure what you're trying to solve here.  Putting them in an array
would mean copying every single one of them every time the array is
resized.  It would also make insertion or deletion very expensive.
And it'd prevent O(1) lookup.  It'd be a horrible data structure all
around.  Maybe you're assuming you know exactly how many entries you
need and what they are before the merge algorithm starts?  I don't.
In fact, I can't even give a good magnitude approximation of how many
it'll be before a merge starts.  (Even if you assume it's a case where
you have an index loaded and that index is related to the merge being
done, the number can be and often is much smaller than the number of
entries in the index.  And just to cover the extremes, in unusual
cases the number might be much larger than the number of index entries
if the merge base and side being merged in has far more paths).

This was the whole point of the strmap API[1] I recently added --
provide a hashmap specialized for the case where the key is a string.
That way I get fast lookup, and relatively fast resize as the hash
only contains pointers to the values, not a copy of the values.

Is your concern that allocating many small structs is more expensive
than allocating a huge block of them?  If so, yes that matters, but
see the mem_pool related patches of the strmap API[1].

[1] https://lore.kernel.org/git/pull.835.v5.git.git.1604622298.gitgitgadget@gmail.com/

> Perhaps 'struct conflict_info' shouldn't contain a 'struct merged_info'
> and instead be just the "extra" data. Then we could have a contiguous
> array of 'struct merged_info' values for most of the paths, but heap
> pointers for 'struct conflict_info' as necessary.
>
> It's also true that I haven't fully formed a mental model for how these
> are used in your algorithm, so I'll keep reading.

I don't understand how contiguous arrays are practical or desirable
(I'm close to saying they're not possible, but one could employ some
extremes to get them, as mentioned above).

I could possibly have two strmaps; one mapping paths to a merge_info,
and another (with fewer entries) mapping paths to a conflict_info.
Seems like a royal pain, and would make for some pretty ugly code (I
have other places that had to use two strmaps and I've hated it every
time -- but those were cases of strmaps that were used much, much less
than the "paths" one).  Might also slightly hurt perf

> > +     path_info->merged.directory_name = current_dir_name;
> > +     path_info->merged.basename_offset = current_dir_name_len;
> > +     path_info->merged.clean = !!resolved;
> > +     if (resolved) {
> > +             path_info->merged.result.mode = merged_version->mode;
> > +             oidcpy(&path_info->merged.result.oid, &merged_version->oid);
> > +             path_info->merged.is_null = !!is_null;
> > +     } else {
> > +             int i;
> > +
> > +             for (i = 0; i < 3; i++) {
> > +                     path_info->pathnames[i] = fullpath;
> > +                     path_info->stages[i].mode = names[i].mode;
> > +                     oidcpy(&path_info->stages[i].oid, &names[i].oid);
> > +             }
> > +             path_info->filemask = filemask;
> > +             path_info->dirmask = dirmask;
> > +             path_info->df_conflict = !!df_conflict;
> > +     }
> > +     strmap_put(&opt->priv->paths, fullpath, path_info);
> > +     result->string = fullpath;
> > +     result->util = path_info;
>
> This is set in all cases, so should we use it everywhere? Naturally,
> there might be a cost to the extra pointer indirection, so maybe we
> create a 'struct conflict_info *util' to operate on during this
> method, but set 'result->util = util' right after allocating so we
> know how it should behave?

result->util is void*, so it's not just an extra pointer indirection,
it's also the need to cast it to the appropriate type every time you
want to use it.  It's easier to have that done via another copy of the
pointer with the correct type, which is the reason for path_info.  So,
essentially, I did use util everywhere, it's just that I spelled it as
"path_info".  If I had named "path_info" "util" as you suggest,
wouldn't everyone be annoyed that I used a lame name that didn't name
the variable's purpose?

Perhaps I should just add a comment saying that path_util is a typed
alias/copy of result->util when I define it?

> > @@ -91,10 +136,12 @@ static int collect_merge_info_callback(int n,
> >        */
> >       struct merge_options *opt = info->data;
> >       struct merge_options_internal *opti = opt->priv;
> > -     struct conflict_info *ci;
> > +     struct string_list_item pi;  /* Path Info */
> > +     struct conflict_info *ci; /* pi.util when there's a conflict */

Perhaps here I should mention that ci is just a typed copy of pi.util
(since pi.util is a void*).

> ...
>
> > +     setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
> > +                     names, NULL, 0, df_conflict, filemask, dirmask, 0);
> > +     ci = pi.util;
>
> Here is the use of 'pi' that I was talking about earlier.

...although, to be fair, I don't actually have all that many uses of
ci (at least not anymore) in this function.  So maybe typecasting
pi.util each of the three-or-so times it is used isn't so bad?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-11 17:08 ` Derrick Stolee
@ 2020-11-11 18:35   ` Elijah Newren
  2020-11-11 20:48     ` Derrick Stolee
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 18:35 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

On Wed, Nov 11, 2020 at 9:09 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/2/2020 3:43 PM, Elijah Newren wrote:
> > Elijah Newren (20):
> >   merge-ort: setup basic internal data structures
> >   merge-ort: add some high-level algorithm structure
> >   merge-ort: port merge_start() from merge-recursive
> >   merge-ort: use histogram diff
> >   merge-ort: add an err() function similar to one from merge-recursive
> >   merge-ort: implement a very basic collect_merge_info()
> >   merge-ort: avoid repeating fill_tree_descriptor() on the same tree
> >   merge-ort: compute a few more useful fields for collect_merge_info
> >   merge-ort: record stage and auxiliary info for every path
> >   merge-ort: avoid recursing into identical trees
> >   merge-ort: add a preliminary simple process_entries() implementation
> >   merge-ort: have process_entries operate in a defined order
>
> I got this far before my attention to detail really started slipping.
>
> >   merge-ort: step 1 of tree writing -- record basenames, modes, and oids
> >   merge-ort: step 2 of tree writing -- function to create tree object
> >   merge-ort: step 3 of tree writing -- handling subdirectories as we go
> >   merge-ort: basic outline for merge_switch_to_result()
> >   merge-ort: add implementation of checkout()
> >   tree: enable cmp_cache_name_compare() to be used elsewhere
> >   merge-ort: add implementation of record_unmerged_index_entries()
> >   merge-ort: free data structures in merge_finalize()
>
> I'll try to take another pass on these commits tomorrow.
>
> For the series as a whole I'd love to see at least one test that
> demonstrates that this code does something, if even only for a very
> narrow case.
>
> There's a lot of code being moved here, and it would be nice to have
> even a very simple test case that can check that we didn't leave any
> important die("not implemented") calls lying around or worse accessing
> an uninitialized pointer or something.

We absolutely left several die("not implemented") calls lying around.
The series was long enough at 20 patches; reviewers lose steam at 10
(at least both you and Jonathan have), so maybe I should have left
even more in there as an attempt to split up this series more.

However, if you run the testsuite with GIT_TEST_MERGE_ALGORITHM=ort,
then this series drops the number of failures in the testsuite from
around 2200, down to 1500.  So, there's about 700 testcases for you.

Also, there were several preparatory series all designed for getting
the testsuite in order for this new merge algorithm.  See the
following currently cooking topics:
  * en/merge-tests topic
  * en/dir-rename-tests
and the following topics that were previously merged:
  * 36d225c7d4 ("Merge branch 'en/merge-tests'", 2020-08-19)
  * cf372dc815 ("Merge branch 'en/test-cleanup'", 2020-03-09)
  * ac193e0e0a ("Merge branch 'en/merge-path-collision'", 2019-01-04)
  * c99033060f ("Merge branch
'en/t7405-recursive-submodule-conflicts'", 2018-08-02)
  * e6da45c7cd ("Merge branch 'en/t6036-merge-recursive-tests'", 2018-08-02)
  * 84e74c6403 ("Merge branch
'en/t6042-insane-merge-rename-testcases'", 2018-08-02)
  * bba1a5559c ("Merge branch 'en/t6036-recursive-corner-cases'", 2018-08-02)
  * 93b74a7cfa ("Merge branch 'en/merge-recursive-tests'", 2018-06-25)
and maybe others I missed.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 12/20] merge-ort: have process_entries operate in a defined order
  2020-11-11 16:09   ` Derrick Stolee
@ 2020-11-11 18:58     ` Elijah Newren
  0 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 18:58 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

On Wed, Nov 11, 2020 at 8:09 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/2/2020 3:43 PM, Elijah Newren wrote:
> > We want to handle paths below a directory before needing to handle the
> > directory itself.  Also, we want to handle the directory immediately
> > after the paths below it, so we can't use simple lexicographic ordering
> > from strcmp (which would insert foo.txt between foo and foo/file.c).
> > Copy string_list_df_name_compare() from merge-recursive.c, and set up a
> > string list of paths sorted by that function so that we can iterate in
> > the desired order.
>
> This is at least the second time we've copied something from
> merge-recursive.c. Should we be starting a merge-utils.[c|h] to group
> these together under a common implementation?

I'm actually going to replace the function later for performance
reasons, but trying to make the series as simple as possible prompted
me to "just copy something for a starting point".

There will be more functions that I copy, yes, but since I sometimes
also tweak and since the goal is to delete merge-recursive.[ch], I
didn't really want to set up an infrastructure to share stuff.

> > +     /* Put every entry from paths into plist, then sort */
> >       strmap_for_each_entry(&opt->priv->paths, &iter, e) {
> > +             string_list_append(&plist, e->key)->util = e->value;
> > +     }
>
> nit: are braces required here?

It might not be with the current macro definition of
strmap_for_each_entry(), but I think at one point it was (the macro
has undergone some changes over time).  Given the difficulty of
digging through the layers of macros (and the possible risk of it
changing in the future with hashmap or strmap changes), I wonder if
it's simpler for readers to just keep them?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 11/20] merge-ort: add a preliminary simple process_entries() implementation
  2020-11-02 20:43 ` [PATCH v2 11/20] merge-ort: add a preliminary simple process_entries() implementation Elijah Newren
@ 2020-11-11 19:51   ` Jonathan Tan
  2020-11-12  1:48     ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Tan @ 2020-11-11 19:51 UTC (permalink / raw)
  To: newren; +Cc: git, Jonathan Tan

Okay...let me review patches 11-15. (Patches 16-20 deal with checkout
and might be better reviewed by someone who is already familiar with how
the existing merge performs checkout. If no one reviews it, I might come
back to it if I have time.)

> +/* Per entry merge function */
> +static void process_entry(struct merge_options *opt,
> +			  const char *path,
> +			  struct conflict_info *ci)
> +{
> +	assert(!ci->merged.clean);
> +	assert(ci->filemask >= 0 && ci->filemask <= 7);

I see below that this function doesn't handle ci->match_mask == 7 (and
it doesn't need to because, I believe, there is a function in one of the
earlier patches that optimizes the case wherein all 3 match with each
other). Maybe add an assert here for that too.

> +
> +	if (ci->filemask == 0) {
> +		/*
> +		 * This is a placeholder for directories that were recursed
> +		 * into; nothing to do in this case.
> +		 */
> +		return;
> +	}
> +
> +	if (ci->df_conflict) {
> +		die("Not yet implemented.");
> +	}
> +
> +	/*
> +	 * NOTE: Below there is a long switch-like if-elseif-elseif... block
> +	 *       which the code goes through even for the df_conflict cases
> +	 *       above.  Well, it will once we don't die-not-implemented above.
> +	 */
> +	if (ci->match_mask) {
> +		ci->merged.clean = 1;

OK, looks straightforward so far. It's a clean merge if 2 match. (As I
said earlier, at this point in the code, it is not possible for 3 to
match.)

> +		if (ci->match_mask == 6) {
> +			/* stages[1] == stages[2] */
> +			ci->merged.result.mode = ci->stages[1].mode;
> +			oidcpy(&ci->merged.result.oid, &ci->stages[1].oid);

If OURS and THEIRS match, use one of them arbitrarily (because they are
the same anyway). OK.

> +		} else {
> +			/* determine the mask of the side that didn't match */
> +			unsigned int othermask = 7 & ~ci->match_mask;
> +			int side = (othermask == 4) ? 2 : 1;

BASE matches with either OURS or THEIRS, so use the side that doesn't
match. OK.

> +
> +			ci->merged.is_null = (ci->filemask == ci->match_mask);

This works (if the non-matching bit in filemask is set, the file exists;
the comparison will be false and therefore is_null is false - and
correctly false because the file exists), but seems unnecessarily
clever. Couldn't you just check nullness of the OID (or through the
mode, like the line below it) and set it here?

Admittedly, the way you wrote it also verifies that filemask is what we
expect. I don't think it is important to verify it, but if you think it
is important, I think it is this verification that should go in the
assert statement.

> +			ci->merged.result.mode = ci->stages[side].mode;
> +			oidcpy(&ci->merged.result.oid, &ci->stages[side].oid);
> +
> +			assert(othermask == 2 || othermask == 4);
> +			assert(ci->merged.is_null == !ci->merged.result.mode);
> +		}
> +	} else if (ci->filemask >= 6 &&
> +		   (S_IFMT & ci->stages[1].mode) !=
> +		   (S_IFMT & ci->stages[2].mode)) {
> +		/*
> +		 * Two different items from (file/submodule/symlink)
> +		 */
> +		die("Not yet implemented.");

There are no matches, and OURS and THEIRS have different types. OK.

> +	} else if (ci->filemask >= 6) {
> +		/*
> +		 * TODO: Needs a two-way or three-way content merge, but we're
> +		 * just being lazy and copying the version from HEAD and
> +		 * leaving it as conflicted.
> +		 */
> +		ci->merged.clean = 0;
> +		ci->merged.result.mode = ci->stages[1].mode;
> +		oidcpy(&ci->merged.result.oid, &ci->stages[1].oid);

OK.

> +	} else if (ci->filemask == 3 || ci->filemask == 5) {
> +		/* Modify/delete */
> +		die("Not yet implemented.");
> +	} else if (ci->filemask == 2 || ci->filemask == 4) {
> +		/* Added on one side */
> +		int side = (ci->filemask == 4) ? 2 : 1;
> +		ci->merged.result.mode = ci->stages[side].mode;
> +		oidcpy(&ci->merged.result.oid, &ci->stages[side].oid);
> +		ci->merged.clean = !ci->df_conflict && !ci->path_conflict;
> +	} else if (ci->filemask == 1) {
> +		/* Deleted on both sides */
> +		ci->merged.is_null = 1;
> +		ci->merged.result.mode = 0;
> +		oidcpy(&ci->merged.result.oid, &null_oid);
> +		ci->merged.clean = !ci->path_conflict;
> +	}

The rest is OK.

> +
> +	/*
> +	 * If still unmerged, record it separately.  This allows us to later
> +	 * iterate over just unmerged entries when updating the index instead
> +	 * of iterating over all entries.
> +	 */
> +	if (!ci->merged.clean)
> +		strmap_put(&opt->priv->unmerged, path, ci);
> +}
> +
>  static void process_entries(struct merge_options *opt,
>  			    struct object_id *result_oid)
>  {
> -	die("Not yet implemented.");
> +	struct hashmap_iter iter;
> +	struct strmap_entry *e;
> +
> +	if (strmap_empty(&opt->priv->paths)) {
> +		oidcpy(result_oid, opt->repo->hash_algo->empty_tree);
> +		return;
> +	}
> +
> +	strmap_for_each_entry(&opt->priv->paths, &iter, e) {
> +		/*
> +		 * WARNING: If ci->merged.clean is true, then ci does not
> +		 * actually point to a conflict_info but a struct merge_info.
> +		 */
> +		struct conflict_info *ci = e->value;
> +
> +		if (!ci->merged.clean)
> +			process_entry(opt, e->key, e->value);
> +	}
> +
> +	die("Tree creation not yet implemented");

The rest looks straightforward.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 13/20] merge-ort: step 1 of tree writing -- record basenames, modes, and oids
  2020-11-02 20:43 ` [PATCH v2 13/20] merge-ort: step 1 of tree writing -- record basenames, modes, and oids Elijah Newren
@ 2020-11-11 20:01   ` Jonathan Tan
  2020-11-11 20:24     ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Tan @ 2020-11-11 20:01 UTC (permalink / raw)
  To: newren; +Cc: git, Jonathan Tan

> +struct directory_versions {
> +	struct string_list versions;

Maybe comment that this is an unordered list of basenames to <whatever
the type of ci->merged.result is>.

> @@ -442,6 +464,7 @@ static void process_entries(struct merge_options *opt,
>  	struct strmap_entry *e;
>  	struct string_list plist = STRING_LIST_INIT_NODUP;
>  	struct string_list_item *entry;
> +	struct directory_versions dir_metadata;
>  
>  	if (strmap_empty(&opt->priv->paths)) {
>  		oidcpy(result_oid, opt->repo->hash_algo->empty_tree);
> @@ -458,6 +481,9 @@ static void process_entries(struct merge_options *opt,
>  	plist.cmp = string_list_df_name_compare;
>  	string_list_sort(&plist);
>  
> +	/* other setup */
> +	string_list_init(&dir_metadata.versions, 0);
> +

Might be clearer to just initialize dir_metadata as {
STRING_LIST_INIT_NODUP }.

The rest makes sense.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 13/20] merge-ort: step 1 of tree writing -- record basenames, modes, and oids
  2020-11-11 20:01   ` Jonathan Tan
@ 2020-11-11 20:24     ` Elijah Newren
  2020-11-12 20:39       ` Jonathan Tan
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 20:24 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git Mailing List

On Wed, Nov 11, 2020 at 12:01 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > +struct directory_versions {
> > +     struct string_list versions;
>
> Maybe comment that this is an unordered list of basenames to <whatever
> the type of ci->merged.result is>.

There actually is an order, and it's important.  It's reverse
lexicographic order of full pathnames (the ordering comes from the
fact that process_entries() iterates paths in that order).  The
reasons for that ordering are (1) all the basenames within a directory
are adjacent so that I can write out a tree for a directory as soon as
it is done, and (2) paths within a directory are listed before the
directory itself so that I get the necessary info for subtrees before
trying to write out their parent trees.

It's not until later patches that I take advantage of this ordering
(and when I do I have a very long commit message to describe it all),
but I can add a comment that this is a list of basenames to
merge_info.

>
> > @@ -442,6 +464,7 @@ static void process_entries(struct merge_options *opt,
> >       struct strmap_entry *e;
> >       struct string_list plist = STRING_LIST_INIT_NODUP;
> >       struct string_list_item *entry;
> > +     struct directory_versions dir_metadata;
> >
> >       if (strmap_empty(&opt->priv->paths)) {
> >               oidcpy(result_oid, opt->repo->hash_algo->empty_tree);
> > @@ -458,6 +481,9 @@ static void process_entries(struct merge_options *opt,
> >       plist.cmp = string_list_df_name_compare;
> >       string_list_sort(&plist);
> >
> > +     /* other setup */
> > +     string_list_init(&dir_metadata.versions, 0);
> > +
>
> Might be clearer to just initialize dir_metadata as {
> STRING_LIST_INIT_NODUP }.

It'll eventually grow to { STRING_LIST_INIT_NODUP,
STRING_LIST_INIT_NODUP, NULL, 0 }, which is a tad long, but if the
initializer is clearer I'm happy to switch over to it.

> The rest makes sense.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 14/20] merge-ort: step 2 of tree writing -- function to create tree object
  2020-11-02 20:43 ` [PATCH v2 14/20] merge-ort: step 2 of tree writing -- function to create tree object Elijah Newren
@ 2020-11-11 20:47   ` Jonathan Tan
  2020-11-11 21:21     ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Tan @ 2020-11-11 20:47 UTC (permalink / raw)
  To: newren; +Cc: git, Jonathan Tan

> +static void write_tree(struct object_id *result_oid,
> +		       struct string_list *versions,
> +		       unsigned int offset)
> +{
> +	size_t maxlen = 0;
> +	unsigned int nr = versions->nr - offset;
> +	struct strbuf buf = STRBUF_INIT;
> +	struct string_list relevant_entries = STRING_LIST_INIT_NODUP;
> +	int i;
> +
> +	/*
> +	 * We want to sort the last (versions->nr-offset) entries in versions.
> +	 * Do so by abusing the string_list API a bit: make another string_list
> +	 * that contains just those entries and then sort them.
> +	 *
> +	 * We won't use relevant_entries again and will let it just pop off the
> +	 * stack, so there won't be allocation worries or anything.
> +	 */
> +	relevant_entries.items = versions->items + offset;
> +	relevant_entries.nr = versions->nr - offset;
> +	string_list_sort(&relevant_entries);
> +
> +	/* Pre-allocate some space in buf */
> +	for (i = 0; i < nr; i++) {
> +		maxlen += strlen(versions->items[offset+i].string) + 34;

Probably should include the_hash_algo->rawsz instead of hardcoding 34.

The rest looks straightforward.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-11 18:35   ` Elijah Newren
@ 2020-11-11 20:48     ` Derrick Stolee
  2020-11-11 21:18       ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Derrick Stolee @ 2020-11-11 20:48 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

On 11/11/2020 1:35 PM, Elijah Newren wrote:
> On Wed, Nov 11, 2020 at 9:09 AM Derrick Stolee <stolee@gmail.com> wrote:
>> For the series as a whole I'd love to see at least one test that
>> demonstrates that this code does something, if even only for a very
>> narrow case.
>>
>> There's a lot of code being moved here, and it would be nice to have
>> even a very simple test case that can check that we didn't leave any
>> important die("not implemented") calls lying around or worse accessing
>> an uninitialized pointer or something.
> 
> We absolutely left several die("not implemented") calls lying around.
> The series was long enough at 20 patches; reviewers lose steam at 10
> (at least both you and Jonathan have), so maybe I should have left
> even more in there as an attempt to split up this series more.
> 
> However, if you run the testsuite with GIT_TEST_MERGE_ALGORITHM=ort,
> then this series drops the number of failures in the testsuite from
> around 2200, down to 1500.  So, there's about 700 testcases for you.

Sorry that I'm jumping in to the series-of-series in the middle, so
I am unfamiliar with the previous progress and testing strategy. This
"number of test failures" metric is sufficient to demonstrate the
progress provided in this series. Perhaps it was even in your v1 cover
letter.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-11 20:48     ` Derrick Stolee
@ 2020-11-11 21:18       ` Elijah Newren
  0 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 21:18 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

Hi Derrick,

On Wed, Nov 11, 2020 at 12:48 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/11/2020 1:35 PM, Elijah Newren wrote:
> > On Wed, Nov 11, 2020 at 9:09 AM Derrick Stolee <stolee@gmail.com> wrote:
> >> For the series as a whole I'd love to see at least one test that
> >> demonstrates that this code does something, if even only for a very
> >> narrow case.
> >>
> >> There's a lot of code being moved here, and it would be nice to have
> >> even a very simple test case that can check that we didn't leave any
> >> important die("not implemented") calls lying around or worse accessing
> >> an uninitialized pointer or something.
> >
> > We absolutely left several die("not implemented") calls lying around.
> > The series was long enough at 20 patches; reviewers lose steam at 10
> > (at least both you and Jonathan have), so maybe I should have left
> > even more in there as an attempt to split up this series more.
> >
> > However, if you run the testsuite with GIT_TEST_MERGE_ALGORITHM=ort,
> > then this series drops the number of failures in the testsuite from
> > around 2200, down to 1500.  So, there's about 700 testcases for you.
>
> Sorry that I'm jumping in to the series-of-series in the middle, so
> I am unfamiliar with the previous progress and testing strategy. This

Not a problem at all.  Thanks much for jumping in and taking a look!
You always provide some good feedback and suggestions.

(Besides, those testcase changes have been spread over two and a half
years...hard to stay on top of all of them.)

> "number of test failures" metric is sufficient to demonstrate the
> progress provided in this series. Perhaps it was even in your v1 cover
> letter.

Um, oops; it's not.  I did mention there were still some "not
implemented" messages left, but didn't mention the testcase counts.
But even that mention is apparently in the v1 cover letter rather than
v2, and v2 wasn't sent in-reply-to v1, so it's harder to catch that.
Sorry about that; I'll include the testcase counts in the v3 cover
letter.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 14/20] merge-ort: step 2 of tree writing -- function to create tree object
  2020-11-11 20:47   ` Jonathan Tan
@ 2020-11-11 21:21     ` Elijah Newren
  0 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 21:21 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git Mailing List

On Wed, Nov 11, 2020 at 12:47 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > +static void write_tree(struct object_id *result_oid,
> > +                    struct string_list *versions,
> > +                    unsigned int offset)
> > +{
> > +     size_t maxlen = 0;
> > +     unsigned int nr = versions->nr - offset;
> > +     struct strbuf buf = STRBUF_INIT;
> > +     struct string_list relevant_entries = STRING_LIST_INIT_NODUP;
> > +     int i;
> > +
> > +     /*
> > +      * We want to sort the last (versions->nr-offset) entries in versions.
> > +      * Do so by abusing the string_list API a bit: make another string_list
> > +      * that contains just those entries and then sort them.
> > +      *
> > +      * We won't use relevant_entries again and will let it just pop off the
> > +      * stack, so there won't be allocation worries or anything.
> > +      */
> > +     relevant_entries.items = versions->items + offset;
> > +     relevant_entries.nr = versions->nr - offset;
> > +     string_list_sort(&relevant_entries);
> > +
> > +     /* Pre-allocate some space in buf */
> > +     for (i = 0; i < nr; i++) {
> > +             maxlen += strlen(versions->items[offset+i].string) + 34;
>
> Probably should include the_hash_algo->rawsz instead of hardcoding 34.

Ah, indeed.  And I should submit a patch for fast-import.c to update
it to not hardcode 34 either (though I'll submit the fast-import
change separate from this series).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path
  2020-11-11 18:16     ` Elijah Newren
@ 2020-11-11 22:06       ` Elijah Newren
  2020-11-12 18:23         ` Derrick Stolee
  2020-11-12 18:39       ` Derrick Stolee
  1 sibling, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-11 22:06 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

On Wed, Nov 11, 2020 at 10:16 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Wed, Nov 11, 2020 at 7:26 AM Derrick Stolee <stolee@gmail.com> wrote:
> >
> > On 11/2/2020 3:43 PM, Elijah Newren wrote:
> > > +static void setup_path_info(struct merge_options *opt,
> > > +                         struct string_list_item *result,
> > > +                         const char *current_dir_name,
> > > +                         int current_dir_name_len,
> > > +                         char *fullpath, /* we'll take over ownership */
> > > +                         struct name_entry *names,
> > > +                         struct name_entry *merged_version,
> > > +                         unsigned is_null,     /* boolean */
> > > +                         unsigned df_conflict, /* boolean */
> > > +                         unsigned filemask,
> > > +                         unsigned dirmask,
> > > +                         int resolved          /* boolean */)
> > > +{
> > > +     struct conflict_info *path_info;
> >
> > In addition to my concerns below about 'conflict_info' versus
> > 'merged_info', I was doubly confused that 'result' in the parameter
> > list is given a variable named 'pi' for "path info" and result->util
> > eventually is equal to this path_info. What if we renamed 'result'
> > to 'pi' for "path info" here, then operated on 'pi->util' in this
> > method?
>
> result->util (or pi->util if you rename) is void *, making it hard to
> operate on; you'd have to typecast at every usage.  Since it is used a
> *lot*, it makes sense to have a typed pointer, and then just set
> result->util to a copy of that value at the end.  That is what
> path_info is for.
>
> >
> > > +     path_info = xcalloc(1, resolved ? sizeof(struct merged_info) :
> > > +                                       sizeof(struct conflict_info));
> >
> > Hm. I'm happy to have a `struct merged_info *` pointing to a
> > `struct conflict_info`, but the opposite seems very dangerous.
>
> Yeah, this is perhaps the scariest bit, and if it were a side data
> structure rather than the fundamental main one that was central to the
> algorithm, then safety would trump performance concerns.  But since it
> is the main data structure and likely the biggest (once you count the
> various copies for each relevant path), then it might be worth the
> extra care needed to shave off the extra memory.  Maybe we can still
> tweak things to get some safety back without killing performance so
> let me consider each of your suggestions/questions.
>
> If I define it as a merged_info*, the compiler will only let me modify
> fields within the merged_info portion of the struct.  Should I
> typecast every line that touches the bits in the resolved==0 path
> where I need to set fields within the conflict_info portion?
> Alternatively, would a code flow like the following make you happier?
>
>     struct conflict_info *ci = NULL;
>     struct merge_info *mi = xcalloc(...);
>     result->util = mi;
>     /* Operate on mi */
>     ...
>     if (resolved)
>       return;
>    ci = mi;
>    /* Operate on ci */
>    ...
>
> In either case, the returned item has potentially different sizes, so
> the caller will still have to take care so I'm not sure how much extra
> this structure within setup_path_info() buys us.
>
> > Perhaps we should always use sizeof(struct conflict_info)?
>
> We could do that; it'd certainly waste memory as I expect many entries
> to be unmodified (on one or both sides of history).  But I'd way
> rather go this route than splitting or re-arranging this data
> structure.
>
> > We can use path_info->merged.clean to detect whether the rest of
> > the data is worth looking at. (Or, in your case, whether or not
> > it is allocated.)
>
> ci->merged.clean is used to determine whether to look at the rest of
> the data, yes -- and that's an enforced assumption throughout the code
> (as alluded to by the comment in the merge_options_internal data
> structure that "paths" maps pathanemes to merge_info and conflict_info
> types).  However, that is not quite the same as using the clean bit to
> determine if more data is allocated; something can be allocated as a
> conflict_info rather than a merged_info due to both sides making
> modifying the same path, but then a threeway content merge comes back
> clean and ci->merged.clean is updated from 0 to 1.  The extra data
> remains allocated, but nothing in the algorithm ever needs to use
> anything outside the merged bits for that path again.  (Actually, let
> me state that more forcefully: nothing is *allowed* to look outside
> the merged bits for that path once the clean bit is updated to 1).
>
> > I imagine that in a large repo we will need many of these structs,
> > but very few of them will actually need to be conflicts, so using
> > 'struct conflict_info' always will lead to memory bloat. But in
> > that case, would we not be better off with an array instead of a
> > scattering of data across the heap?
>
> Not sure what you're trying to solve here.  Putting them in an array
> would mean copying every single one of them every time the array is
> resized.  It would also make insertion or deletion very expensive.
> And it'd prevent O(1) lookup.  It'd be a horrible data structure all
> around.  Maybe you're assuming you know exactly how many entries you
> need and what they are before the merge algorithm starts?  I don't.
> In fact, I can't even give a good magnitude approximation of how many
> it'll be before a merge starts.  (Even if you assume it's a case where
> you have an index loaded and that index is related to the merge being
> done, the number can be and often is much smaller than the number of
> entries in the index.  And just to cover the extremes, in unusual
> cases the number might be much larger than the number of index entries
> if the merge base and side being merged in has far more paths).
>
> This was the whole point of the strmap API[1] I recently added --
> provide a hashmap specialized for the case where the key is a string.
> That way I get fast lookup, and relatively fast resize as the hash
> only contains pointers to the values, not a copy of the values.
>
> Is your concern that allocating many small structs is more expensive
> than allocating a huge block of them?  If so, yes that matters, but
> see the mem_pool related patches of the strmap API[1].
>
> [1] https://lore.kernel.org/git/pull.835.v5.git.git.1604622298.gitgitgadget@gmail.com/


I just re-read what I wrote, here and below...and I need to apologize.
I tend to write, edit, revise, and repeat while composing emails and
the end result of my emails doesn't tend to reflect the path to get
there; I looped through that cycle more times than most on this email.
But, even worse, I added in a sentence or two that just shouldn't be
included regardless.  I think in particular this one sounds extremely
aggressive and dismissive which was not at all my intent.

I find your reviews to be very helpful, and I don't want to discourage
them.  Hopefully my comments didn't come across anywhere near as
strongly as they did to me on a second reading, but if they did, I'm
sorry.

> > Perhaps 'struct conflict_info' shouldn't contain a 'struct merged_info'
> > and instead be just the "extra" data. Then we could have a contiguous
> > array of 'struct merged_info' values for most of the paths, but heap
> > pointers for 'struct conflict_info' as necessary.
> >
> > It's also true that I haven't fully formed a mental model for how these
> > are used in your algorithm, so I'll keep reading.
>
> I don't understand how contiguous arrays are practical or desirable
> (I'm close to saying they're not possible, but one could employ some
> extremes to get them, as mentioned above).
>
> I could possibly have two strmaps; one mapping paths to a merge_info,
> and another (with fewer entries) mapping paths to a conflict_info.
> Seems like a royal pain, and would make for some pretty ugly code (I
> have other places that had to use two strmaps and I've hated it every
> time -- but those were cases of strmaps that were used much, much less
> than the "paths" one).  Might also slightly hurt perf
>
> > > +     path_info->merged.directory_name = current_dir_name;
> > > +     path_info->merged.basename_offset = current_dir_name_len;
> > > +     path_info->merged.clean = !!resolved;
> > > +     if (resolved) {
> > > +             path_info->merged.result.mode = merged_version->mode;
> > > +             oidcpy(&path_info->merged.result.oid, &merged_version->oid);
> > > +             path_info->merged.is_null = !!is_null;
> > > +     } else {
> > > +             int i;
> > > +
> > > +             for (i = 0; i < 3; i++) {
> > > +                     path_info->pathnames[i] = fullpath;
> > > +                     path_info->stages[i].mode = names[i].mode;
> > > +                     oidcpy(&path_info->stages[i].oid, &names[i].oid);
> > > +             }
> > > +             path_info->filemask = filemask;
> > > +             path_info->dirmask = dirmask;
> > > +             path_info->df_conflict = !!df_conflict;
> > > +     }
> > > +     strmap_put(&opt->priv->paths, fullpath, path_info);
> > > +     result->string = fullpath;
> > > +     result->util = path_info;
> >
> > This is set in all cases, so should we use it everywhere? Naturally,
> > there might be a cost to the extra pointer indirection, so maybe we
> > create a 'struct conflict_info *util' to operate on during this
> > method, but set 'result->util = util' right after allocating so we
> > know how it should behave?
>
> result->util is void*, so it's not just an extra pointer indirection,
> it's also the need to cast it to the appropriate type every time you
> want to use it.  It's easier to have that done via another copy of the
> pointer with the correct type, which is the reason for path_info.  So,
> essentially, I did use util everywhere, it's just that I spelled it as
> "path_info".  If I had named "path_info" "util" as you suggest,
> wouldn't everyone be annoyed that I used a lame name that didn't name
> the variable's purpose?
>
> Perhaps I should just add a comment saying that path_util is a typed
> alias/copy of result->util when I define it?
>
> > > @@ -91,10 +136,12 @@ static int collect_merge_info_callback(int n,
> > >        */
> > >       struct merge_options *opt = info->data;
> > >       struct merge_options_internal *opti = opt->priv;
> > > -     struct conflict_info *ci;
> > > +     struct string_list_item pi;  /* Path Info */
> > > +     struct conflict_info *ci; /* pi.util when there's a conflict */
>
> Perhaps here I should mention that ci is just a typed copy of pi.util
> (since pi.util is a void*).
>
> > ...
> >
> > > +     setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
> > > +                     names, NULL, 0, df_conflict, filemask, dirmask, 0);
> > > +     ci = pi.util;
> >
> > Here is the use of 'pi' that I was talking about earlier.
>
> ...although, to be fair, I don't actually have all that many uses of
> ci (at least not anymore) in this function.  So maybe typecasting
> pi.util each of the three-or-so times it is used isn't so bad?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 11/20] merge-ort: add a preliminary simple process_entries() implementation
  2020-11-11 19:51   ` Jonathan Tan
@ 2020-11-12  1:48     ` Elijah Newren
  0 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren @ 2020-11-12  1:48 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git Mailing List

On Wed, Nov 11, 2020 at 11:51 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> Okay...let me review patches 11-15. (Patches 16-20 deal with checkout
> and might be better reviewed by someone who is already familiar with how
> the existing merge performs checkout. If no one reviews it, I might come
> back to it if I have time.)

Thanks for the reviews!  I was hoping to see some comments on patch
15, as it's possibly the gnarliest.  It's a relatively straightforward
algorithm, just lots of bookkeeping.

And I think you took on the harder part of the remaining reviews.  :-)
 The checkout stuff is much easier, IMO -- and knowledge of how the
existing merge performs checkout wouldn't help at all with reviewing
that; it's just too different.

If you do find the time to look at the last five patches, or parts of
them here's some tips on the reviewing:
  * Patches 16, 18, and 20 are very straightforward; patches 17 and 19
are the ones that would benefit more from review.
  * Patch 17 is basically the twoway_merge subset of
merge_working_tree() from builtin/checkout.c.  Find that bit of code
and it's a direct comparison.
  * Patch 19 amounts to "how do I remove stage 0 entries in the index
and replace them with 1-3 higher order stages?".

> > +/* Per entry merge function */
> > +static void process_entry(struct merge_options *opt,
> > +                       const char *path,
> > +                       struct conflict_info *ci)
> > +{
> > +     assert(!ci->merged.clean);
> > +     assert(ci->filemask >= 0 && ci->filemask <= 7);
>
> I see below that this function doesn't handle ci->match_mask == 7 (and
> it doesn't need to because, I believe, there is a function in one of the
> earlier patches that optimizes the case wherein all 3 match with each
> other). Maybe add an assert here for that too.
>
> > +
> > +     if (ci->filemask == 0) {
> > +             /*
> > +              * This is a placeholder for directories that were recursed
> > +              * into; nothing to do in this case.
> > +              */
> > +             return;
> > +     }
> > +
> > +     if (ci->df_conflict) {
> > +             die("Not yet implemented.");
> > +     }
> > +
> > +     /*
> > +      * NOTE: Below there is a long switch-like if-elseif-elseif... block
> > +      *       which the code goes through even for the df_conflict cases
> > +      *       above.  Well, it will once we don't die-not-implemented above.
> > +      */
> > +     if (ci->match_mask) {
> > +             ci->merged.clean = 1;
>
> OK, looks straightforward so far. It's a clean merge if 2 match. (As I
> said earlier, at this point in the code, it is not possible for 3 to
> match.)
>
> > +             if (ci->match_mask == 6) {
> > +                     /* stages[1] == stages[2] */
> > +                     ci->merged.result.mode = ci->stages[1].mode;
> > +                     oidcpy(&ci->merged.result.oid, &ci->stages[1].oid);
>
> If OURS and THEIRS match, use one of them arbitrarily (because they are
> the same anyway). OK.
>
> > +             } else {
> > +                     /* determine the mask of the side that didn't match */
> > +                     unsigned int othermask = 7 & ~ci->match_mask;
> > +                     int side = (othermask == 4) ? 2 : 1;
>
> BASE matches with either OURS or THEIRS, so use the side that doesn't
> match. OK.
>
> > +
> > +                     ci->merged.is_null = (ci->filemask == ci->match_mask);
>
> This works (if the non-matching bit in filemask is set, the file exists;
> the comparison will be false and therefore is_null is false - and
> correctly false because the file exists), but seems unnecessarily
> clever. Couldn't you just check nullness of the OID (or through the
> mode, like the line below it) and set it here?
>
> Admittedly, the way you wrote it also verifies that filemask is what we
> expect. I don't think it is important to verify it, but if you think it
> is important, I think it is this verification that should go in the
> assert statement.

These points and the others earlier in this file and other points I
didn't comment on are all good points; thanks for all the suggestions.

> > +                     ci->merged.result.mode = ci->stages[side].mode;
> > +                     oidcpy(&ci->merged.result.oid, &ci->stages[side].oid);
> > +
> > +                     assert(othermask == 2 || othermask == 4);
> > +                     assert(ci->merged.is_null == !ci->merged.result.mode);
> > +             }
> > +     } else if (ci->filemask >= 6 &&
> > +                (S_IFMT & ci->stages[1].mode) !=
> > +                (S_IFMT & ci->stages[2].mode)) {
> > +             /*
> > +              * Two different items from (file/submodule/symlink)
> > +              */
> > +             die("Not yet implemented.");
>
> There are no matches, and OURS and THEIRS have different types. OK.
>
> > +     } else if (ci->filemask >= 6) {
> > +             /*
> > +              * TODO: Needs a two-way or three-way content merge, but we're
> > +              * just being lazy and copying the version from HEAD and
> > +              * leaving it as conflicted.
> > +              */
> > +             ci->merged.clean = 0;
> > +             ci->merged.result.mode = ci->stages[1].mode;
> > +             oidcpy(&ci->merged.result.oid, &ci->stages[1].oid);
>
> OK.
>
> > +     } else if (ci->filemask == 3 || ci->filemask == 5) {
> > +             /* Modify/delete */
> > +             die("Not yet implemented.");
> > +     } else if (ci->filemask == 2 || ci->filemask == 4) {
> > +             /* Added on one side */
> > +             int side = (ci->filemask == 4) ? 2 : 1;
> > +             ci->merged.result.mode = ci->stages[side].mode;
> > +             oidcpy(&ci->merged.result.oid, &ci->stages[side].oid);
> > +             ci->merged.clean = !ci->df_conflict && !ci->path_conflict;
> > +     } else if (ci->filemask == 1) {
> > +             /* Deleted on both sides */
> > +             ci->merged.is_null = 1;
> > +             ci->merged.result.mode = 0;
> > +             oidcpy(&ci->merged.result.oid, &null_oid);
> > +             ci->merged.clean = !ci->path_conflict;
> > +     }
>
> The rest is OK.
>
> > +
> > +     /*
> > +      * If still unmerged, record it separately.  This allows us to later
> > +      * iterate over just unmerged entries when updating the index instead
> > +      * of iterating over all entries.
> > +      */
> > +     if (!ci->merged.clean)
> > +             strmap_put(&opt->priv->unmerged, path, ci);
> > +}
> > +
> >  static void process_entries(struct merge_options *opt,
> >                           struct object_id *result_oid)
> >  {
> > -     die("Not yet implemented.");
> > +     struct hashmap_iter iter;
> > +     struct strmap_entry *e;
> > +
> > +     if (strmap_empty(&opt->priv->paths)) {
> > +             oidcpy(result_oid, opt->repo->hash_algo->empty_tree);
> > +             return;
> > +     }
> > +
> > +     strmap_for_each_entry(&opt->priv->paths, &iter, e) {
> > +             /*
> > +              * WARNING: If ci->merged.clean is true, then ci does not
> > +              * actually point to a conflict_info but a struct merge_info.
> > +              */
> > +             struct conflict_info *ci = e->value;
> > +
> > +             if (!ci->merged.clean)
> > +                     process_entry(opt, e->key, e->value);
> > +     }
> > +
> > +     die("Tree creation not yet implemented");
>
> The rest looks straightforward.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path
  2020-11-11 22:06       ` Elijah Newren
@ 2020-11-12 18:23         ` Derrick Stolee
  0 siblings, 0 replies; 84+ messages in thread
From: Derrick Stolee @ 2020-11-12 18:23 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

On 11/11/2020 5:06 PM, Elijah Newren wrote:
> On Wed, Nov 11, 2020 at 10:16 AM Elijah Newren <newren@gmail.com> wrote:
>> This was the whole point of the strmap API[1] I recently added --
>> provide a hashmap specialized for the case where the key is a string.
>> That way I get fast lookup, and relatively fast resize as the hash
>> only contains pointers to the values, not a copy of the values.
>>
>> Is your concern that allocating many small structs is more expensive
>> than allocating a huge block of them?  If so, yes that matters, but
>> see the mem_pool related patches of the strmap API[1].
>>
>> [1] https://lore.kernel.org/git/pull.835.v5.git.git.1604622298.gitgitgadget@gmail.com/
> 
> 
> I just re-read what I wrote, here and below...and I need to apologize.
> I tend to write, edit, revise, and repeat while composing emails and
> the end result of my emails doesn't tend to reflect the path to get
> there; I looped through that cycle more times than most on this email.
> But, even worse, I added in a sentence or two that just shouldn't be
> included regardless.  I think in particular this one sounds extremely
> aggressive and dismissive which was not at all my intent.
> 
> I find your reviews to be very helpful, and I don't want to discourage
> them.  Hopefully my comments didn't come across anywhere near as
> strongly as they did to me on a second reading, but if they did, I'm
> sorry.

I did not feel any animosity. Your response was just so involved that
I didn't have time to respond in kind. I hope to do so later today.

The short version of my reply is "You know more about this than me,
and I appreciate the additional detail."

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path
  2020-11-11 18:16     ` Elijah Newren
  2020-11-11 22:06       ` Elijah Newren
@ 2020-11-12 18:39       ` Derrick Stolee
  1 sibling, 0 replies; 84+ messages in thread
From: Derrick Stolee @ 2020-11-12 18:39 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List

On 11/11/2020 1:16 PM, Elijah Newren wrote:
> On Wed, Nov 11, 2020 at 7:26 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 11/2/2020 3:43 PM, Elijah Newren wrote:
>>> +static void setup_path_info(struct merge_options *opt,
>>> +                         struct string_list_item *result,
>>> +                         const char *current_dir_name,
>>> +                         int current_dir_name_len,
>>> +                         char *fullpath, /* we'll take over ownership */
>>> +                         struct name_entry *names,
>>> +                         struct name_entry *merged_version,
>>> +                         unsigned is_null,     /* boolean */
>>> +                         unsigned df_conflict, /* boolean */
>>> +                         unsigned filemask,
>>> +                         unsigned dirmask,
>>> +                         int resolved          /* boolean */)
>>> +{
>>> +     struct conflict_info *path_info;
>>
>> In addition to my concerns below about 'conflict_info' versus
>> 'merged_info', I was doubly confused that 'result' in the parameter
>> list is given a variable named 'pi' for "path info" and result->util
>> eventually is equal to this path_info. What if we renamed 'result'
>> to 'pi' for "path info" here, then operated on 'pi->util' in this
>> method?
> 
> result->util (or pi->util if you rename) is void *, making it hard to
> operate on; you'd have to typecast at every usage.  Since it is used a
> *lot*, it makes sense to have a typed pointer, and then just set
> result->util to a copy of that value at the end.  That is what
> path_info is for.

Good point. I need to be more careful thinking about types during
review.

>>
>>> +     path_info = xcalloc(1, resolved ? sizeof(struct merged_info) :
>>> +                                       sizeof(struct conflict_info));
>>
>> Hm. I'm happy to have a `struct merged_info *` pointing to a
>> `struct conflict_info`, but the opposite seems very dangerous.
> 
> Yeah, this is perhaps the scariest bit, and if it were a side data
> structure rather than the fundamental main one that was central to the
> algorithm, then safety would trump performance concerns.  But since it
> is the main data structure and likely the biggest (once you count the
> various copies for each relevant path), then it might be worth the
> extra care needed to shave off the extra memory.  Maybe we can still
> tweak things to get some safety back without killing performance so
> let me consider each of your suggestions/questions.
> 
> If I define it as a merged_info*, the compiler will only let me modify
> fields within the merged_info portion of the struct.  Should I
> typecast every line that touches the bits in the resolved==0 path
> where I need to set fields within the conflict_info portion?
> Alternatively, would a code flow like the following make you happier?
> 
>     struct conflict_info *ci = NULL;
>     struct merge_info *mi = xcalloc(...);
>     result->util = mi;
>     /* Operate on mi */
>     ...
>     if (resolved)
>       return;
>    ci = mi;
>    /* Operate on ci */
>    ...
> 
> In either case, the returned item has potentially different sizes, so
> the caller will still have to take care so I'm not sure how much extra
> this structure within setup_path_info() buys us.

There might be good reason to use this example. Specifically,
always first cast into a 'struct merge_info *mi' then check
'mi->clean' before casting into 'struct conflict_info *ci'. It
definitely helps that something within the smaller memory
footprint gives an indicator as to whether the larger struct
should exist.
 
>> Perhaps we should always use sizeof(struct conflict_info)?
> 
> We could do that; it'd certainly waste memory as I expect many entries
> to be unmodified (on one or both sides of history).  But I'd way
> rather go this route than splitting or re-arranging this data
> structure.

Yes, I agree exactly on this.

>> We can use path_info->merged.clean to detect whether the rest of
>> the data is worth looking at. (Or, in your case, whether or not
>> it is allocated.)
> 
> ci->merged.clean is used to determine whether to look at the rest of
> the data, yes -- and that's an enforced assumption throughout the code
> (as alluded to by the comment in the merge_options_internal data
> structure that "paths" maps pathanemes to merge_info and conflict_info
> types).  However, that is not quite the same as using the clean bit to
> determine if more data is allocated; something can be allocated as a
> conflict_info rather than a merged_info due to both sides making
> modifying the same path, but then a threeway content merge comes back
> clean and ci->merged.clean is updated from 0 to 1.  The extra data
> remains allocated, but nothing in the algorithm ever needs to use
> anything outside the merged bits for that path again.  (Actually, let
> me state that more forcefully: nothing is *allowed* to look outside
> the merged bits for that path once the clean bit is updated to 1).

Ok, so the two-stage casting (merge_info then conflict_info) would
still work even after the clean bit is enabled eventually. This assumes
that the threeway content merge data is cleaned up before losing
the conflict_info pointer.

>> I imagine that in a large repo we will need many of these structs,
>> but very few of them will actually need to be conflicts, so using
>> 'struct conflict_info' always will lead to memory bloat. But in
>> that case, would we not be better off with an array instead of a
>> scattering of data across the heap?
> 
> Not sure what you're trying to solve here.  Putting them in an array
> would mean copying every single one of them every time the array is
> resized.  It would also make insertion or deletion very expensive.
> And it'd prevent O(1) lookup.  It'd be a horrible data structure all
> around.  Maybe you're assuming you know exactly how many entries you
> need and what they are before the merge algorithm starts?  I don't.
> In fact, I can't even give a good magnitude approximation of how many
> it'll be before a merge starts.  (Even if you assume it's a case where
> you have an index loaded and that index is related to the merge being
> done, the number can be and often is much smaller than the number of
> entries in the index.  And just to cover the extremes, in unusual
> cases the number might be much larger than the number of index entries
> if the merge base and side being merged in has far more paths).

Yeah, I was mostly thinking about pooling allocations to reduce memory
fragmentation. But its likely that we don't need to do that, or rather
you are already doing some of that in the strmap structure.

> This was the whole point of the strmap API[1] I recently added --
> provide a hashmap specialized for the case where the key is a string.
> That way I get fast lookup, and relatively fast resize as the hash
> only contains pointers to the values, not a copy of the values.
> 
> Is your concern that allocating many small structs is more expensive
> than allocating a huge block of them?  If so, yes that matters, but
> see the mem_pool related patches of the strmap API[1].
> 
> [1] https://lore.kernel.org/git/pull.835.v5.git.git.1604622298.gitgitgadget@gmail.com/

It appears that you already do some mempool stuff there [2], so I'm sure
you know more about how to optimize memory here. My initial reaction of
"that's a lot of calloc()s" could easily be punted to a later improvement
_if_ it is valuable at all.

[2] https://lore.kernel.org/git/3926c4c97bd08aac93d3f521273db9d76b4d5cd3.1605124942.git.gitgitgadget@gmail.com/

>> Perhaps 'struct conflict_info' shouldn't contain a 'struct merged_info'
>> and instead be just the "extra" data. Then we could have a contiguous
>> array of 'struct merged_info' values for most of the paths, but heap
>> pointers for 'struct conflict_info' as necessary.
>>
>> It's also true that I haven't fully formed a mental model for how these
>> are used in your algorithm, so I'll keep reading.
> 
> I don't understand how contiguous arrays are practical or desirable
> (I'm close to saying they're not possible, but one could employ some
> extremes to get them, as mentioned above).
> 
> I could possibly have two strmaps; one mapping paths to a merge_info,
> and another (with fewer entries) mapping paths to a conflict_info.
> Seems like a royal pain, and would make for some pretty ugly code (I
> have other places that had to use two strmaps and I've hated it every
> time -- but those were cases of strmaps that were used much, much less
> than the "paths" one).  Might also slightly hurt perf

I am convinced that the alternatives are harder to implement with no
clear benefit.

>>> +     path_info->merged.directory_name = current_dir_name;
>>> +     path_info->merged.basename_offset = current_dir_name_len;
>>> +     path_info->merged.clean = !!resolved;
>>> +     if (resolved) {
>>> +             path_info->merged.result.mode = merged_version->mode;
>>> +             oidcpy(&path_info->merged.result.oid, &merged_version->oid);
>>> +             path_info->merged.is_null = !!is_null;
>>> +     } else {
>>> +             int i;
>>> +
>>> +             for (i = 0; i < 3; i++) {
>>> +                     path_info->pathnames[i] = fullpath;
>>> +                     path_info->stages[i].mode = names[i].mode;
>>> +                     oidcpy(&path_info->stages[i].oid, &names[i].oid);
>>> +             }
>>> +             path_info->filemask = filemask;
>>> +             path_info->dirmask = dirmask;
>>> +             path_info->df_conflict = !!df_conflict;
>>> +     }
>>> +     strmap_put(&opt->priv->paths, fullpath, path_info);
>>> +     result->string = fullpath;
>>> +     result->util = path_info;
>>
>> This is set in all cases, so should we use it everywhere? Naturally,
>> there might be a cost to the extra pointer indirection, so maybe we
>> create a 'struct conflict_info *util' to operate on during this
>> method, but set 'result->util = util' right after allocating so we
>> know how it should behave?
> 
> result->util is void*, so it's not just an extra pointer indirection,
> it's also the need to cast it to the appropriate type every time you
> want to use it.  It's easier to have that done via another copy of the
> pointer with the correct type, which is the reason for path_info.  So,
> essentially, I did use util everywhere, it's just that I spelled it as
> "path_info".  If I had named "path_info" "util" as you suggest,
> wouldn't everyone be annoyed that I used a lame name that didn't name
> the variable's purpose?
> 
> Perhaps I should just add a comment saying that path_util is a typed
> alias/copy of result->util when I define it?

A comment wouldn't hurt.

>>> @@ -91,10 +136,12 @@ static int collect_merge_info_callback(int n,
>>>        */
>>>       struct merge_options *opt = info->data;
>>>       struct merge_options_internal *opti = opt->priv;
>>> -     struct conflict_info *ci;
>>> +     struct string_list_item pi;  /* Path Info */
>>> +     struct conflict_info *ci; /* pi.util when there's a conflict */
> 
> Perhaps here I should mention that ci is just a typed copy of pi.util
> (since pi.util is a void*).
> 
>> ...
>>
>>> +     setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
>>> +                     names, NULL, 0, df_conflict, filemask, dirmask, 0);
>>> +     ci = pi.util;
>>
>> Here is the use of 'pi' that I was talking about earlier.
> 
> ...although, to be fair, I don't actually have all that many uses of
> ci (at least not anymore) in this function.  So maybe typecasting
> pi.util each of the three-or-so times it is used isn't so bad?
 
While I usually use "three is many" for justifying extracting
duplicated code, for casting issues I usually think "two is
too many" is appropriate. Keep 'ci'.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go
  2020-11-02 20:43 ` [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go Elijah Newren
@ 2020-11-12 20:15   ` Jonathan Tan
  2020-11-12 22:30     ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Tan @ 2020-11-12 20:15 UTC (permalink / raw)
  To: newren; +Cc: git, Jonathan Tan

Firstly, from [1]:

> Thanks for the reviews!  I was hoping to see some comments on patch
> 15, as it's possibly the gnarliest.  It's a relatively straightforward
> algorithm, just lots of bookkeeping.

I was planning to send this out yesterday, but couldn't finish it. :-P
Indeed, a lot of things to think about.

[1] https://lore.kernel.org/git/CABPp-BFgQX6Ash03x7z+RfE3ytbw3x0DzDSBrGddgMr_soODoA@mail.gmail.com/

[snip commit message]

Thanks for the thorough explanation.

> @@ -353,6 +353,9 @@ static int string_list_df_name_compare(const char *one, const char *two)
>  
>  struct directory_versions {
>  	struct string_list versions;
> +	struct string_list offsets;

Looking below (and at the explanation in the commit message), this is a
mapping from full paths to integers casted to the pointer type.

> +	const char *last_directory;
> +	unsigned last_directory_len;

Is this just the last entry in "versions"?

>  static void write_tree(struct object_id *result_oid,
> @@ -409,12 +412,100 @@ static void record_entry_for_tree(struct directory_versions *dir_metadata,
>  		/* nothing to record */
>  		return;
>  
> +	/*
> +	 * Note: write_completed_directories() already added
> +	 * entries for directories to dir_metadata->versions,
> +	 * so no need to handle ci->filemask == 0 again.
> +	 */
> +	if (!ci->merged.clean && !ci->filemask)
> +		return;
> +
>  	basename = path + ci->merged.basename_offset;
>  	assert(strchr(basename, '/') == NULL);
>  	string_list_append(&dir_metadata->versions,
>  			   basename)->util = &ci->merged.result;
>  }

Conceptually, I can see how the algorithm below inserts directories, but
I don't understand the significance of "!ci->merged.clean" in the change
above.

> +static void write_completed_directories(struct merge_options *opt,
> +					const char *new_directory_name,
> +					struct directory_versions *info)
> +{
> +	const char *prev_dir;
> +	struct merged_info *dir_info = NULL;
> +	unsigned int offset;
> +	int wrote_a_new_tree = 0;
> +
> +	if (new_directory_name == info->last_directory)
> +		return;

Pointer equality is OK here presumably because of the string interning
of directory names.

I'm starting to think that it might be too difficult to keep track of
where strings are interned. Maybe there should be a data structure
containing all interned strings, and make the path a struct or something
like that (to clearly indicate that the string inside comes from the
interned string data structure).

> +	/*
> +	 * If we are just starting (last_directory is NULL), or last_directory
> +	 * is a prefix of the current directory, then we can just update
> +	 * last_directory and record the offset where we started this directory.
> +	 */
> +	if (info->last_directory == NULL ||
> +	    !strncmp(new_directory_name, info->last_directory,
> +		     info->last_directory_len)) {

Git has starts_with() for prefix checking. (May not be as optimized as
this one, though.)

> +		uintptr_t offset = info->versions.nr;
> +
> +		info->last_directory = new_directory_name;
> +		info->last_directory_len = strlen(info->last_directory);
> +		string_list_append(&info->offsets,
> +				   info->last_directory)->util = (void*)offset;
> +		return;
> +	}

Due to the way this is sorted, there might be a jump of 2 or more
directories. (The commit message also provides such an example - from ""
to "src/moduleB", without going through "src".)

> +	/*
> +	 * At this point, ne (next entry) is within a different directory
> +	 * than the last entry, so we need to create a tree object for all
> +	 * the entries in info->versions that are under info->last_directory.
> +	 */

There's no "ne" below.

> +	dir_info = strmap_get(&opt->priv->paths, info->last_directory);
> +	assert(dir_info);
> +	offset = (uintptr_t)info->offsets.items[info->offsets.nr-1].util;
> +	if (offset == info->versions.nr) {
> +		dir_info->is_null = 1;
> +	} else {
> +		dir_info->result.mode = S_IFDIR;
> +		write_tree(&dir_info->result.oid, &info->versions, offset);
> +		wrote_a_new_tree = 1;
> +	}

I was trying to figure out the cases in which offset would be
info->versions.nr - if such a case existed, and if yes, would it be
incorrect to skip creating such a tree because presumably this offset
exists in info->offsets for a reason. Do you know in which situation
offset would equal info->versions.nr?

> +	/*
> +	 * We've now used several entries from info->versions and one entry
> +	 * from info->offsets, so we get rid of those values.
> +	 */
> +	info->offsets.nr--;
> +	info->versions.nr = offset;

OK.

> +	/*
> +	 * Now we've got an OID for last_directory in dir_info.  We need to
> +	 * add it to info->versions for it to be part of the computation of
> +	 * its parent directories' OID.  But first, we have to find out what
> +	 * its' parent name was and whether that matches the previous
> +	 * info->offsets or we need to set up a new one.
> +	 */
> +	prev_dir = info->offsets.nr == 0 ? NULL :
> +		   info->offsets.items[info->offsets.nr-1].string;
> +	if (new_directory_name != prev_dir) {
> +		uintptr_t c = info->versions.nr;
> +		string_list_append(&info->offsets,
> +				   new_directory_name)->util = (void*)c;
> +	}

Because of the possible jump of 2 or more directories that I mentioned
earlier, there may be gaps in the offsets. So it makes sense that we
sometimes need to insert an intermediate one.

I wonder if the code would be clearer if we had explicit "begin tree"
and "end tree" steps just like in list-objects-filter.c (LOFS_BEGIN_TREE
and LOFS_END_TREE). Here we have "end tree" (because of the way the
entries were sorted) but not "begin tree". If we had "begin tree", we
probably would be able to create the necessary offsets in a loop at that
stage, and the reasoning about the contents of the offsets would not be
so complicated.

If we really only want one side (i.e. you don't want to introduce a
synthetic entry just to mark the end or the beginning), then my personal
experience is that having the "begin" side is easier to understand, as
the state is more natural and easier to reason about. (Unlike here,
where there could be gaps in the offsets and the reader has to
understand that the gaps will be filled just in time.) But that may just
be my own experience.

> +	/*
> +	 * Okay, finally record OID for last_directory in info->versions,
> +	 * and update last_directory.
> +	 */
> +	if (wrote_a_new_tree) {
> +		const char *dir_name = strrchr(info->last_directory, '/');
> +		dir_name = dir_name ? dir_name+1 : info->last_directory;
> +		string_list_append(&info->versions, dir_name)->util = dir_info;
> +	}
> +	info->last_directory = new_directory_name;
> +	info->last_directory_len = strlen(info->last_directory);
> +}

OK - several entries in info->versions were deleted earlier (through
info->versions.nr = offset), and we add one here to represent the tree
containing all those deleted versions.

> @@ -541,22 +635,27 @@ static void process_entries(struct merge_options *opt,
>  		 */
>  		struct conflict_info *ci = entry->util;
>  
> +		write_completed_directories(opt, ci->merged.directory_name,
> +					    &dir_metadata);
>  		if (ci->merged.clean)
>  			record_entry_for_tree(&dir_metadata, path, ci);
>  		else
>  			process_entry(opt, path, ci, &dir_metadata);
>  	}

Trying to make sense of this: we pass in the directory name of the
current entry so that if the last directory is *not* a prefix of the
current directory (so we either went up a directory or went sideways),
then we write a tree (unless offset == info->versions.nr, which as I
stated above, I still don't fully understand - I thought we would always
have to write a tree). So maybe the name of the function should be
"write_completed_directory" (and document it as "write a tree if
???"), since we write at most one.

In this kind of algorithm (where intermediate accumulated results are
being written), there needs to be a last write after the loop that
writes whatever's left in the accumulation buffer. I do see it below
("write_tree"), so that's good.

> -	/*
> -	 * TODO: We can't actually write a tree yet, because dir_metadata just
> -	 * contains all basenames of all files throughout the tree with their
> -	 * mode and hash.  Not only is that a nonsensical tree, it will have
> -	 * lots of duplicates for paths such as "Makefile" or ".gitignore".
> -	 */
> -	die("Not yet implemented; need to process subtrees separately");
> +	if (dir_metadata.offsets.nr != 1 ||
> +	    (uintptr_t)dir_metadata.offsets.items[0].util != 0) {
> +		printf("dir_metadata.offsets.nr = %d (should be 1)\n",
> +		       dir_metadata.offsets.nr);
> +		printf("dir_metadata.offsets.items[0].util = %u (should be 0)\n",
> +		       (unsigned)(uintptr_t)dir_metadata.offsets.items[0].util);
> +		fflush(stdout);
> +		BUG("dir_metadata accounting completely off; shouldn't happen");
> +	}

Sanity check, OK.

[snip rest]

In summary, I think that the code would be easier to understand (for
everyone) if there were both BEGIN_TREE and END_TREE entries. And for me
personally, once the offset == info->versions.nr part is clarified
(perhaps there is something obvious that I'm missing).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 13/20] merge-ort: step 1 of tree writing -- record basenames, modes, and oids
  2020-11-11 20:24     ` Elijah Newren
@ 2020-11-12 20:39       ` Jonathan Tan
  0 siblings, 0 replies; 84+ messages in thread
From: Jonathan Tan @ 2020-11-12 20:39 UTC (permalink / raw)
  To: newren; +Cc: jonathantanmy, git

> On Wed, Nov 11, 2020 at 12:01 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > > +struct directory_versions {
> > > +     struct string_list versions;
> >
> > Maybe comment that this is an unordered list of basenames to <whatever
> > the type of ci->merged.result is>.
> 
> There actually is an order, and it's important.  It's reverse
> lexicographic order of full pathnames (the ordering comes from the
> fact that process_entries() iterates paths in that order).  The
> reasons for that ordering are (1) all the basenames within a directory
> are adjacent so that I can write out a tree for a directory as soon as
> it is done, and (2) paths within a directory are listed before the
> directory itself so that I get the necessary info for subtrees before
> trying to write out their parent trees.
> 
> It's not until later patches that I take advantage of this ordering
> (and when I do I have a very long commit message to describe it all),
> but I can add a comment that this is a list of basenames to
> merge_info.

Ah, yes you're right. I'm not sure what I was thinking of.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go
  2020-11-12 20:15   ` Jonathan Tan
@ 2020-11-12 22:30     ` Elijah Newren
  2020-11-24 20:19       ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-12 22:30 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git Mailing List

On Thu, Nov 12, 2020 at 12:15 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> Firstly, from [1]:
>
> > Thanks for the reviews!  I was hoping to see some comments on patch
> > 15, as it's possibly the gnarliest.  It's a relatively straightforward
> > algorithm, just lots of bookkeeping.
>
> I was planning to send this out yesterday, but couldn't finish it. :-P
> Indeed, a lot of things to think about.
>
> [1] https://lore.kernel.org/git/CABPp-BFgQX6Ash03x7z+RfE3ytbw3x0DzDSBrGddgMr_soODoA@mail.gmail.com/
>
> [snip commit message]
>
> Thanks for the thorough explanation.
>
> > @@ -353,6 +353,9 @@ static int string_list_df_name_compare(const char *one, const char *two)
> >
> >  struct directory_versions {
> >       struct string_list versions;
> > +     struct string_list offsets;
>
> Looking below (and at the explanation in the commit message), this is a
> mapping from full paths to integers casted to the pointer type.
>
> > +     const char *last_directory;
> > +     unsigned last_directory_len;
>
> Is this just the last entry in "versions"?

No, it's a simple cache of strlen(info->last_directory), so I don't
have to recompute that length multiple times.  Perhaps I should add a
comment to that effect.

> >  static void write_tree(struct object_id *result_oid,
> > @@ -409,12 +412,100 @@ static void record_entry_for_tree(struct directory_versions *dir_metadata,
> >               /* nothing to record */
> >               return;
> >
> > +     /*
> > +      * Note: write_completed_directories() already added
> > +      * entries for directories to dir_metadata->versions,
> > +      * so no need to handle ci->filemask == 0 again.
> > +      */
> > +     if (!ci->merged.clean && !ci->filemask)
> > +             return;
> > +
> >       basename = path + ci->merged.basename_offset;
> >       assert(strchr(basename, '/') == NULL);
> >       string_list_append(&dir_metadata->versions,
> >                          basename)->util = &ci->merged.result;
> >  }
>
> Conceptually, I can see how the algorithm below inserts directories, but
> I don't understand the significance of "!ci->merged.clean" in the change
> above.

Checking ci->filemask is likely an out-of-bounds memory read if
ci->merged.clean is true.  (ci may point to something that was
allocated with the size of a merge_info or a conflict_info.)  Perhaps
I could extend the comment to say that conflicted directories, i.e.
paths that are unclean with ci->filemask == 0 can be skipped because
they were already handled.

> > +static void write_completed_directories(struct merge_options *opt,
> > +                                     const char *new_directory_name,
> > +                                     struct directory_versions *info)
> > +{
> > +     const char *prev_dir;
> > +     struct merged_info *dir_info = NULL;
> > +     unsigned int offset;
> > +     int wrote_a_new_tree = 0;
> > +
> > +     if (new_directory_name == info->last_directory)
> > +             return;
>
> Pointer equality is OK here presumably because of the string interning
> of directory names.

Yes, precisely.

> I'm starting to think that it might be too difficult to keep track of
> where strings are interned. Maybe there should be a data structure
> containing all interned strings, and make the path a struct or something
> like that (to clearly indicate that the string inside comes from the
> interned string data structure).

Good news: Interned strings already are stored as the keys of our
primary data structure -- the strmap known as opt->priv->paths.  All
relative paths from the root of the repository that are relevant to
the merge at all -- both for files and directories -- are interned by
collect_merge_info() inside that "paths" strmap.

I guess I should note that it eventually gets _slightly_ more
complicated.  Due to renames and directory renames, I might need to
remove a path from opt->priv->paths.  In such a case there will be an
auxiliary string_list named "paths_to_free" that stores the interned
strings which are no longer part of opt->priv->paths.

> > +     /*
> > +      * If we are just starting (last_directory is NULL), or last_directory
> > +      * is a prefix of the current directory, then we can just update
> > +      * last_directory and record the offset where we started this directory.
> > +      */
> > +     if (info->last_directory == NULL ||
> > +         !strncmp(new_directory_name, info->last_directory,
> > +                  info->last_directory_len)) {
>
> Git has starts_with() for prefix checking. (May not be as optimized as
> this one, though.)
>
> > +             uintptr_t offset = info->versions.nr;
> > +
> > +             info->last_directory = new_directory_name;
> > +             info->last_directory_len = strlen(info->last_directory);
> > +             string_list_append(&info->offsets,
> > +                                info->last_directory)->util = (void*)offset;
> > +             return;
> > +     }
>
> Due to the way this is sorted, there might be a jump of 2 or more
> directories. (The commit message also provides such an example - from ""
> to "src/moduleB", without going through "src".)
>
> > +     /*
> > +      * At this point, ne (next entry) is within a different directory
> > +      * than the last entry, so we need to create a tree object for all
> > +      * the entries in info->versions that are under info->last_directory.
> > +      */
>
> There's no "ne" below.

Oops, that code has been heavily refactored since that comment.
Something like this would be more up-to-date:
    /*
     * The next entry will be within new_directory_name.  Since at this
     * point we know that new_directory_name is within a different
     * directory than info->last_directory, we have all entries for
     * info->last_directory in info->versions and we need to create a
     * tree object for them.
     */

> > +     dir_info = strmap_get(&opt->priv->paths, info->last_directory);
> > +     assert(dir_info);
> > +     offset = (uintptr_t)info->offsets.items[info->offsets.nr-1].util;
> > +     if (offset == info->versions.nr) {
> > +             dir_info->is_null = 1;
> > +     } else {
> > +             dir_info->result.mode = S_IFDIR;
> > +             write_tree(&dir_info->result.oid, &info->versions, offset);
> > +             wrote_a_new_tree = 1;
> > +     }
>
> I was trying to figure out the cases in which offset would be
> info->versions.nr - if such a case existed, and if yes, would it be
> incorrect to skip creating such a tree because presumably this offset
> exists in info->offsets for a reason. Do you know in which situation
> offset would equal info->versions.nr?

Yes, it is possible that all files within the directory become empty
as a result of merging[1], and in such cases this line of logic will
trigger (note that record_entry_for_tree(), which is what adds more
items to info->versions, returns early if ci->merged.is_null).  We do
not want to write out an empty tree for the directory or record the
tree's hash for its parent directory, we simply want to omit it
entirely.  Omitting it entirely is handled by the line
"dir_info->is_null = 1".

[1] The simplest example is when one side doesn't touch anything
within a directory but the other side deletes the whole directory.
Files can also disappear in a merge for other reasons, such as being
deleted on both sides, or being renamed.  If _all_ files within the
directory are removed by the merge logic, the directory has no
entries.

>
> > +     /*
> > +      * We've now used several entries from info->versions and one entry
> > +      * from info->offsets, so we get rid of those values.
> > +      */
> > +     info->offsets.nr--;
> > +     info->versions.nr = offset;
>
> OK.
>
> > +     /*
> > +      * Now we've got an OID for last_directory in dir_info.  We need to
> > +      * add it to info->versions for it to be part of the computation of
> > +      * its parent directories' OID.  But first, we have to find out what
> > +      * its' parent name was and whether that matches the previous
> > +      * info->offsets or we need to set up a new one.
> > +      */
> > +     prev_dir = info->offsets.nr == 0 ? NULL :
> > +                info->offsets.items[info->offsets.nr-1].string;
> > +     if (new_directory_name != prev_dir) {
> > +             uintptr_t c = info->versions.nr;
> > +             string_list_append(&info->offsets,
> > +                                new_directory_name)->util = (void*)c;
> > +     }
>
> Because of the possible jump of 2 or more directories that I mentioned
> earlier, there may be gaps in the offsets. So it makes sense that we
> sometimes need to insert an intermediate one.
>
> I wonder if the code would be clearer if we had explicit "begin tree"
> and "end tree" steps just like in list-objects-filter.c (LOFS_BEGIN_TREE
> and LOFS_END_TREE). Here we have "end tree" (because of the way the
> entries were sorted) but not "begin tree". If we had "begin tree", we
> probably would be able to create the necessary offsets in a loop at that
> stage, and the reasoning about the contents of the offsets would not be
> so complicated.
>
> If we really only want one side (i.e. you don't want to introduce a
> synthetic entry just to mark the end or the beginning), then my personal
> experience is that having the "begin" side is easier to understand, as
> the state is more natural and easier to reason about. (Unlike here,
> where there could be gaps in the offsets and the reader has to
> understand that the gaps will be filled just in time.) But that may just
> be my own experience.

Interesting, I'll take a look into it.

>
> > +     /*
> > +      * Okay, finally record OID for last_directory in info->versions,
> > +      * and update last_directory.
> > +      */
> > +     if (wrote_a_new_tree) {
> > +             const char *dir_name = strrchr(info->last_directory, '/');
> > +             dir_name = dir_name ? dir_name+1 : info->last_directory;
> > +             string_list_append(&info->versions, dir_name)->util = dir_info;
> > +     }
> > +     info->last_directory = new_directory_name;
> > +     info->last_directory_len = strlen(info->last_directory);
> > +}
>
> OK - several entries in info->versions were deleted earlier (through
> info->versions.nr = offset), and we add one here to represent the tree
> containing all those deleted versions.
>
> > @@ -541,22 +635,27 @@ static void process_entries(struct merge_options *opt,
> >                */
> >               struct conflict_info *ci = entry->util;
> >
> > +             write_completed_directories(opt, ci->merged.directory_name,
> > +                                         &dir_metadata);
> >               if (ci->merged.clean)
> >                       record_entry_for_tree(&dir_metadata, path, ci);
> >               else
> >                       process_entry(opt, path, ci, &dir_metadata);
> >       }
>
> Trying to make sense of this: we pass in the directory name of the
> current entry so that if the last directory is *not* a prefix of the
> current directory (so we either went up a directory or went sideways),
> then we write a tree (unless offset == info->versions.nr, which as I
> stated above, I still don't fully understand - I thought we would always
> have to write a tree). So maybe the name of the function should be
> "write_completed_directory" (and document it as "write a tree if
> ???"), since we write at most one.

Yeah, write_completed_directory() would be better.  And just to
reiterate on the offset == info->versions.nr thing, we do not want to
write a tree if it turns out that the merged result of all files
within the directory is to delete them all.

> In this kind of algorithm (where intermediate accumulated results are
> being written), there needs to be a last write after the loop that
> writes whatever's left in the accumulation buffer. I do see it below
> ("write_tree"), so that's good.
>
> > -     /*
> > -      * TODO: We can't actually write a tree yet, because dir_metadata just
> > -      * contains all basenames of all files throughout the tree with their
> > -      * mode and hash.  Not only is that a nonsensical tree, it will have
> > -      * lots of duplicates for paths such as "Makefile" or ".gitignore".
> > -      */
> > -     die("Not yet implemented; need to process subtrees separately");
> > +     if (dir_metadata.offsets.nr != 1 ||
> > +         (uintptr_t)dir_metadata.offsets.items[0].util != 0) {
> > +             printf("dir_metadata.offsets.nr = %d (should be 1)\n",
> > +                    dir_metadata.offsets.nr);
> > +             printf("dir_metadata.offsets.items[0].util = %u (should be 0)\n",
> > +                    (unsigned)(uintptr_t)dir_metadata.offsets.items[0].util);
> > +             fflush(stdout);
> > +             BUG("dir_metadata accounting completely off; shouldn't happen");
> > +     }
>
> Sanity check, OK.
>
> [snip rest]
>
> In summary, I think that the code would be easier to understand (for
> everyone) if there were both BEGIN_TREE and END_TREE entries. And for me
> personally, once the offset == info->versions.nr part is clarified
> (perhaps there is something obvious that I'm missing).

I'm not sure how the BEGIN_TREE/END_TREE entries would look, but I'll
investigate.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go
  2020-11-12 22:30     ` Elijah Newren
@ 2020-11-24 20:19       ` Elijah Newren
  2020-11-25  2:07         ` Jonathan Tan
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-24 20:19 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git Mailing List

On Thu, Nov 12, 2020 at 2:30 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Thu, Nov 12, 2020 at 12:15 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > > +     /*
> > > +      * Now we've got an OID for last_directory in dir_info.  We need to
> > > +      * add it to info->versions for it to be part of the computation of
> > > +      * its parent directories' OID.  But first, we have to find out what
> > > +      * its' parent name was and whether that matches the previous
> > > +      * info->offsets or we need to set up a new one.
> > > +      */
> > > +     prev_dir = info->offsets.nr == 0 ? NULL :
> > > +                info->offsets.items[info->offsets.nr-1].string;
> > > +     if (new_directory_name != prev_dir) {
> > > +             uintptr_t c = info->versions.nr;
> > > +             string_list_append(&info->offsets,
> > > +                                new_directory_name)->util = (void*)c;
> > > +     }
> >
> > Because of the possible jump of 2 or more directories that I mentioned
> > earlier, there may be gaps in the offsets. So it makes sense that we
> > sometimes need to insert an intermediate one.
> >
> > I wonder if the code would be clearer if we had explicit "begin tree"
> > and "end tree" steps just like in list-objects-filter.c (LOFS_BEGIN_TREE
> > and LOFS_END_TREE). Here we have "end tree" (because of the way the
> > entries were sorted) but not "begin tree". If we had "begin tree", we
> > probably would be able to create the necessary offsets in a loop at that
> > stage, and the reasoning about the contents of the offsets would not be
> > so complicated.
> >
> > If we really only want one side (i.e. you don't want to introduce a
> > synthetic entry just to mark the end or the beginning), then my personal
> > experience is that having the "begin" side is easier to understand, as
> > the state is more natural and easier to reason about. (Unlike here,
> > where there could be gaps in the offsets and the reader has to
> > understand that the gaps will be filled just in time.) But that may just
> > be my own experience.
>
> Interesting, I'll take a look into it.
>

So, I've been going through making all the changes you and Derrick
suggested or highlighted...but I don't see how to tackle this one.
Perhaps I'm missing something.

Using your example of LOFS_BEGIN_TREE and LOFS_END_TREE from
list-objects-filter.c, I note that you handle it as part of
traverse_trees(), and thus you have a very natural "I'm going to
process this tree" point and "I'm done processing this tree" point.
There is no equivalent mapping to merge-ort that I can figure out.

merge-ort does use traverse_trees() in collect_merge_info(), and fills
opt->priv->paths with all full pathnames (both files and directories)
found in any of the three trees.  But I cannot process
files/directories at that time; rename detection needs
traverse_trees() to be finished to have all paths so far.  Further,
the list of pathnames from traverse_trees is not necessarily complete;
additional paths could be added by any of
  * Directory/file conflicts (need to move the file to a different location)
  * Directory/submodule conflicts (need to move something to a
different location)
  * Add/add conflicts of files of different types (e.g.
symlink/regular file; can't just content merge them with conflict
markers)
  * Directory rename detection (can move new files or even directories
on one side of history into a new directory on other side)

Thus, after traverse_trees() ends, my rename detection stuff can add
paths (including new directories), then process_entries() can add
paths -- and remove some when the resolution is to delete.  And the
code here in question runs as part of the process_entries() loop.

Now, we'd still be able to create synthetic BEGIN_TREE markers if we
operated in lexicographic ordering, but process_entries() *must*
operate in _reverse_ lexicographic ordering because:
  1) subtrees need to be written out before trees are; hashes of those
subtrees are used in the parent tree
  2) it's the only sane way to handle directory/file conflicts; I need
to know if all entries under the directory resolve to nothing; if not,
the directory is still in the way when it comes time to process the
file.

Granted, I could do some tricky calculations based on the reverse
lexicographic ordering of fullpaths (and their containing directories)
to figure out where trees begin and end -- but that takes us to
exactly what I *did* do.  It was precisely this step that you thought
should be made simpler, but I'm not seeing how to avoid it.

For now, I'll keep the code as-is, but add more comments to both the
data structure and the code.  If I've missed something about how I
could make use of your BEGIN_TREE idea, let me know and I'll look at
it again.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go
  2020-11-24 20:19       ` Elijah Newren
@ 2020-11-25  2:07         ` Jonathan Tan
  2020-11-26 18:13           ` Elijah Newren
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Tan @ 2020-11-25  2:07 UTC (permalink / raw)
  To: newren; +Cc: jonathantanmy, git

> For now, I'll keep the code as-is, but add more comments to both the
> data structure and the code.  If I've missed something about how I
> could make use of your BEGIN_TREE idea, let me know and I'll look at
> it again.

In collect_merge_info_callback(), you call setup_path_info() to add to
opt->priv->paths, then call traverse_trees() (which recursively calls
collect_merge_info_callback()). I was thinking that in addition to doing
that, you could call setup_path_info() a second time, but teach it to
add a synthetic path (maybe have a special bit in struct conflict_info
or something like that) that indicates "this is the end of the tree".
Subsequent code can notice that bit and not do the normal processing,
but instead do end-of-tree processing.

Having said that, maybe it will turn out that your additional comments
in v3 will be clearer, and we wouldn't need the synthetic entry.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go
  2020-11-25  2:07         ` Jonathan Tan
@ 2020-11-26 18:13           ` Elijah Newren
  2020-11-30 18:41             ` Jonathan Tan
  0 siblings, 1 reply; 84+ messages in thread
From: Elijah Newren @ 2020-11-26 18:13 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git Mailing List

On Tue, Nov 24, 2020 at 6:07 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > For now, I'll keep the code as-is, but add more comments to both the
> > data structure and the code.  If I've missed something about how I
> > could make use of your BEGIN_TREE idea, let me know and I'll look at
> > it again.
>
> In collect_merge_info_callback(), you call setup_path_info() to add to
> opt->priv->paths, then call traverse_trees() (which recursively calls
> collect_merge_info_callback()). I was thinking that in addition to doing
> that, you could call setup_path_info() a second time, but teach it to
> add a synthetic path (maybe have a special bit in struct conflict_info
> or something like that) that indicates "this is the end of the tree".
> Subsequent code can notice that bit and not do the normal processing,
> but instead do end-of-tree processing.

So, I realized that I already had end-of-tree markers -- the
directories themselves.  But due to some other weirdness in how I had
built up the processing, the existence of those markers was both
obscured, and deliberately side-stepped.  So, I did a little
restructuring so we can use these as actual end-of-tree markers more
directly.

> Having said that, maybe it will turn out that your additional comments
> in v3 will be clearer, and we wouldn't need the synthetic entry.

Hopefully it's clearer now, but the entries aren't synthetic.  My big
opt->priv->paths strmap with all full relative paths contained all
files _and_ directories already, and now I just use the directory
markers more directly.  Hopefully the extra comments help too.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go
  2020-11-26 18:13           ` Elijah Newren
@ 2020-11-30 18:41             ` Jonathan Tan
  0 siblings, 0 replies; 84+ messages in thread
From: Jonathan Tan @ 2020-11-30 18:41 UTC (permalink / raw)
  To: newren; +Cc: jonathantanmy, git

> On Tue, Nov 24, 2020 at 6:07 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > > For now, I'll keep the code as-is, but add more comments to both the
> > > data structure and the code.  If I've missed something about how I
> > > could make use of your BEGIN_TREE idea, let me know and I'll look at
> > > it again.
> >
> > In collect_merge_info_callback(), you call setup_path_info() to add to
> > opt->priv->paths, then call traverse_trees() (which recursively calls
> > collect_merge_info_callback()). I was thinking that in addition to doing
> > that, you could call setup_path_info() a second time, but teach it to
> > add a synthetic path (maybe have a special bit in struct conflict_info
> > or something like that) that indicates "this is the end of the tree".
> > Subsequent code can notice that bit and not do the normal processing,
> > but instead do end-of-tree processing.
> 
> So, I realized that I already had end-of-tree markers -- the
> directories themselves.  But due to some other weirdness in how I had
> built up the processing, the existence of those markers was both
> obscured, and deliberately side-stepped.  So, I did a little
> restructuring so we can use these as actual end-of-tree markers more
> directly.

Ah sorry...what I meant was to have both begin-of-tree and end-of-tree
elements in the path list, so one of them is real and the other
synthetic. Right now you have an end-of-tree real path in the list of
paths, yes.

> > Having said that, maybe it will turn out that your additional comments
> > in v3 will be clearer, and we wouldn't need the synthetic entry.
> 
> Hopefully it's clearer now, but the entries aren't synthetic.  My big
> opt->priv->paths strmap with all full relative paths contained all
> files _and_ directories already, and now I just use the directory
> markers more directly.  Hopefully the extra comments help too.

OK - I see that you have a new version [1] and hopefully I'll be able to
take a look soon.

[1] https://lore.kernel.org/git/pull.923.git.git.1606635803.gitgitgadget@gmail.com/

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v2 00/20] fundamentals of merge-ort implementation
  2020-11-29  7:43 [PATCH " Elijah Newren via GitGitGadget
@ 2020-12-04 20:47 ` Elijah Newren via GitGitGadget
  0 siblings, 0 replies; 84+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-04 20:47 UTC (permalink / raw)
  To: git
  Cc: jonathantanmy, dstolee, Elijah Newren,
	Ævar Arnfjörð Bjarmason, Elijah Newren

This is actually v4 of this series (the first two rounds depended on topics
that hadn't graduated yet, so I hadn't yet used gitgitgadget for submitting
it). As a reminder, if you need to see the first two rounds before I started
submitting this series with gitgitgadget, you can see them over here: 
https://lore.kernel.org/git/20201102204344.342633-1-newren@gmail.com/

Changes since v3:

 * Made the small tweaks suggested by Ævar
 * Fixed an embarrassing tree ordering bug in commit 13; base_name_compare()
   != strcmp() is important.

(Tree ordering bug found due to the fact that merge-ort, including many
patches not yet submitted to this list, is in live use at $DAYJOB.)

Elijah Newren (20):
  merge-ort: setup basic internal data structures
  merge-ort: add some high-level algorithm structure
  merge-ort: port merge_start() from merge-recursive
  merge-ort: use histogram diff
  merge-ort: add an err() function similar to one from merge-recursive
  merge-ort: implement a very basic collect_merge_info()
  merge-ort: avoid repeating fill_tree_descriptor() on the same tree
  merge-ort: compute a few more useful fields for collect_merge_info
  merge-ort: record stage and auxiliary info for every path
  merge-ort: avoid recursing into identical trees
  merge-ort: add a preliminary simple process_entries() implementation
  merge-ort: have process_entries operate in a defined order
  merge-ort: step 1 of tree writing -- record basenames, modes, and oids
  merge-ort: step 2 of tree writing -- function to create tree object
  merge-ort: step 3 of tree writing -- handling subdirectories as we go
  merge-ort: basic outline for merge_switch_to_result()
  merge-ort: add implementation of checkout()
  tree: enable cmp_cache_name_compare() to be used elsewhere
  merge-ort: add implementation of record_conflicted_index_entries()
  merge-ort: free data structures in merge_finalize()

 merge-ort.c | 1221 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 tree.c      |    2 +-
 tree.h      |    2 +
 3 files changed, 1221 insertions(+), 4 deletions(-)


base-commit: e67fbf927dfdf13d0b21dc6ea15dc3c7ef448ea0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-923%2Fnewren%2Fort-basics-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-923/newren/ort-basics-v2
Pull-Request: https://github.com/git/git/pull/923

Range-diff vs v1:

  1:  2568ec92c6 =  1:  2568ec92c6 merge-ort: setup basic internal data structures
  2:  3a063865c3 !  2:  b658536f59 merge-ort: add some high-level algorithm structure
     @@ merge-ort.c: struct conflict_info {
      +			      struct tree *side1,
      +			      struct tree *side2)
      +{
     ++	/* TODO: Implement this using traverse_trees() */
      +	die("Not yet implemented.");
      +}
      +
  3:  5615f0eecb =  3:  acb40f5c16 merge-ort: port merge_start() from merge-recursive
  4:  564b072ac1 =  4:  22fecf6ccd merge-ort: use histogram diff
  5:  91516799e4 !  5:  6c4c0c15b3 merge-ort: add an err() function similar to one from merge-recursive
     @@ merge-ort.c: struct conflict_info {
       			      struct tree *side1,
       			      struct tree *side2)
       {
     -+	/* TODO: Implement this using traverse_trees() */
     +-	/* TODO: Implement this using traverse_trees() */
       	die("Not yet implemented.");
       }
       
     @@ merge-ort.c: static void merge_ort_nonrecursive_internal(struct merge_options *o
       
      -	collect_merge_info(opt, merge_base, side1, side2);
      +	if (collect_merge_info(opt, merge_base, side1, side2) != 0) {
     ++		/*
     ++		 * TRANSLATORS: The %s arguments are: 1) tree hash of a merge
     ++		 * base, and 2-3) the trees for the two trees we're merging.
     ++		 */
      +		err(opt, _("collecting merge info failed for trees %s, %s, %s"),
      +		    oid_to_hex(&merge_base->object.oid),
      +		    oid_to_hex(&side1->object.oid),
  6:  ab743967aa !  6:  27268ef8a3 merge-ort: implement a very basic collect_merge_info()
     @@ merge-ort.c: static int err(struct merge_options *opt, const char *err, ...)
       			      struct tree *side1,
       			      struct tree *side2)
       {
     --	/* TODO: Implement this using traverse_trees() */
      -	die("Not yet implemented.");
      +	int ret;
      +	struct tree_desc t[3];
  7:  bff758c5dd =  7:  c6e5621c21 merge-ort: avoid repeating fill_tree_descriptor() on the same tree
  8:  61b3d66fdc =  8:  93fd69fa3c merge-ort: compute a few more useful fields for collect_merge_info
  9:  4e4298fa70 =  9:  decff4b375 merge-ort: record stage and auxiliary info for every path
 10:  3ec087eb68 = 10:  86c661fe1e merge-ort: avoid recursing into identical trees
 11:  0c89cee34e = 11:  aa3b13ffd8 merge-ort: add a preliminary simple process_entries() implementation
 12:  605cbc19d2 = 12:  b54306fd0e merge-ort: have process_entries operate in a defined order
 13:  242c3cab13 = 13:  8ee8561d7a merge-ort: step 1 of tree writing -- record basenames, modes, and oids
 14:  33a5d23c85 ! 14:  6ff56824c3 merge-ort: step 2 of tree writing -- function to create tree object
     @@ merge-ort.c: struct directory_versions {
       	struct string_list versions;
       };
       
     ++static int tree_entry_order(const void *a_, const void *b_)
     ++{
     ++	const struct string_list_item *a = a_;
     ++	const struct string_list_item *b = b_;
     ++
     ++	const struct merged_info *ami = a->util;
     ++	const struct merged_info *bmi = b->util;
     ++	return base_name_compare(a->string, strlen(a->string), ami->result.mode,
     ++				 b->string, strlen(b->string), bmi->result.mode);
     ++}
     ++
      +static void write_tree(struct object_id *result_oid,
      +		       struct string_list *versions,
      +		       unsigned int offset,
     @@ merge-ort.c: struct directory_versions {
      +	 */
      +	relevant_entries.items = versions->items + offset;
      +	relevant_entries.nr = versions->nr - offset;
     -+	string_list_sort(&relevant_entries);
     ++	QSORT(relevant_entries.items, relevant_entries.nr, tree_entry_order);
      +
      +	/* Pre-allocate some space in buf */
      +	extra = hash_size + 8; /* 8: 6 for mode, 1 for space, 1 for NUL char */
 15:  29615c366f ! 15:  da4fe90049 merge-ort: step 3 of tree writing -- handling subdirectories as we go
     @@ merge-ort.c: static int string_list_df_name_compare(const char *one, const char
      +	unsigned last_directory_len;
       };
       
     - static void write_tree(struct object_id *result_oid,
     + static int tree_entry_order(const void *a_, const void *b_)
      @@ merge-ort.c: static void record_entry_for_tree(struct directory_versions *dir_metadata,
       			   basename)->util = &mi->result;
       }
 16:  da54fa454a = 16:  8e90d211c5 merge-ort: basic outline for merge_switch_to_result()
 17:  68307f1b67 = 17:  61fada146c merge-ort: add implementation of checkout()
 18:  a3cd563621 = 18:  f5a13a0b08 tree: enable cmp_cache_name_compare() to be used elsewhere
 19:  56b162c609 ! 19:  4efac38116 merge-ort: add implementation of record_conflicted_index_entries()
     @@ merge-ort.c: static int record_conflicted_index_entries(struct merge_options *op
      +		pos = index_name_pos(index, path, strlen(path));
      +		SWAP(index->cache_nr, original_cache_nr);
      +		if (pos < 0) {
     -+			if (ci->filemask == 1)
     -+				cache_tree_invalidate_path(index, path);
     -+			else
     ++			if (ci->filemask != 1)
      +				BUG("Conflicted %s but nothing in basic working tree or index; this shouldn't happen", path);
     ++			cache_tree_invalidate_path(index, path);
      +		} else {
      +			ce = index->cache[pos];
      +
 20:  a4f722a46e = 20:  fbeb527d67 merge-ort: free data structures in merge_finalize()

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2020-12-04 20:48 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
2020-11-02 20:43 ` [PATCH v2 01/20] merge-ort: setup basic internal data structures Elijah Newren
2020-11-06 22:05   ` Jonathan Tan
2020-11-06 22:45     ` Elijah Newren
2020-11-09 20:55       ` Jonathan Tan
2020-11-02 20:43 ` [PATCH v2 02/20] merge-ort: add some high-level algorithm structure Elijah Newren
2020-11-02 20:43 ` [PATCH v2 03/20] merge-ort: port merge_start() from merge-recursive Elijah Newren
2020-11-11 13:52   ` Derrick Stolee
2020-11-11 16:22     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 04/20] merge-ort: use histogram diff Elijah Newren
2020-11-11 13:54   ` Derrick Stolee
2020-11-11 16:47     ` Elijah Newren
2020-11-11 16:51       ` Derrick Stolee
2020-11-11 17:03         ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 05/20] merge-ort: add an err() function similar to one from merge-recursive Elijah Newren
2020-11-11 13:58   ` Derrick Stolee
2020-11-11 17:07     ` Elijah Newren
2020-11-11 17:10       ` Derrick Stolee
2020-11-02 20:43 ` [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info() Elijah Newren
2020-11-06 22:19   ` Jonathan Tan
2020-11-06 23:10     ` Elijah Newren
2020-11-09 20:59       ` Jonathan Tan
2020-11-11 14:38   ` Derrick Stolee
2020-11-11 17:02     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 07/20] merge-ort: avoid repeating fill_tree_descriptor() on the same tree Elijah Newren
2020-11-11 14:51   ` Derrick Stolee
2020-11-11 17:13     ` Elijah Newren
2020-11-11 17:21       ` Eric Sunshine
2020-11-02 20:43 ` [PATCH v2 08/20] merge-ort: compute a few more useful fields for collect_merge_info Elijah Newren
2020-11-06 22:52   ` Jonathan Tan
2020-11-06 23:41     ` Elijah Newren
2020-11-09 22:04       ` Jonathan Tan
2020-11-09 23:05         ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path Elijah Newren
2020-11-06 22:58   ` Jonathan Tan
2020-11-07  0:26     ` Elijah Newren
2020-11-09 22:09       ` Jonathan Tan
2020-11-09 23:08         ` Elijah Newren
2020-11-11 15:26   ` Derrick Stolee
2020-11-11 18:16     ` Elijah Newren
2020-11-11 22:06       ` Elijah Newren
2020-11-12 18:23         ` Derrick Stolee
2020-11-12 18:39       ` Derrick Stolee
2020-11-02 20:43 ` [PATCH v2 10/20] merge-ort: avoid recursing into identical trees Elijah Newren
2020-11-11 15:31   ` Derrick Stolee
2020-11-02 20:43 ` [PATCH v2 11/20] merge-ort: add a preliminary simple process_entries() implementation Elijah Newren
2020-11-11 19:51   ` Jonathan Tan
2020-11-12  1:48     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 12/20] merge-ort: have process_entries operate in a defined order Elijah Newren
2020-11-11 16:09   ` Derrick Stolee
2020-11-11 18:58     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 13/20] merge-ort: step 1 of tree writing -- record basenames, modes, and oids Elijah Newren
2020-11-11 20:01   ` Jonathan Tan
2020-11-11 20:24     ` Elijah Newren
2020-11-12 20:39       ` Jonathan Tan
2020-11-02 20:43 ` [PATCH v2 14/20] merge-ort: step 2 of tree writing -- function to create tree object Elijah Newren
2020-11-11 20:47   ` Jonathan Tan
2020-11-11 21:21     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go Elijah Newren
2020-11-12 20:15   ` Jonathan Tan
2020-11-12 22:30     ` Elijah Newren
2020-11-24 20:19       ` Elijah Newren
2020-11-25  2:07         ` Jonathan Tan
2020-11-26 18:13           ` Elijah Newren
2020-11-30 18:41             ` Jonathan Tan
2020-11-02 20:43 ` [PATCH v2 16/20] merge-ort: basic outline for merge_switch_to_result() Elijah Newren
2020-11-02 20:43 ` [PATCH v2 17/20] merge-ort: add implementation of checkout() Elijah Newren
2020-11-02 20:43 ` [PATCH v2 18/20] tree: enable cmp_cache_name_compare() to be used elsewhere Elijah Newren
2020-11-02 20:43 ` [PATCH v2 19/20] merge-ort: add implementation of record_unmerged_index_entries() Elijah Newren
2020-11-02 20:43 ` [PATCH v2 20/20] merge-ort: free data structures in merge_finalize() Elijah Newren
2020-11-03 14:49 ` [PATCH v2 00/20] fundamentals of merge-ort implementation Derrick Stolee
2020-11-03 16:36   ` Elijah Newren
2020-11-07  6:06     ` Elijah Newren
2020-11-07 15:02       ` Derrick Stolee
2020-11-07 19:39         ` Elijah Newren
2020-11-09 12:30           ` Derrick Stolee
2020-11-09 17:13             ` Elijah Newren
2020-11-09 19:51               ` Derrick Stolee
2020-11-09 22:44                 ` Elijah Newren
2020-11-11 17:08 ` Derrick Stolee
2020-11-11 18:35   ` Elijah Newren
2020-11-11 20:48     ` Derrick Stolee
2020-11-11 21:18       ` Elijah Newren
2020-11-29  7:43 [PATCH " Elijah Newren via GitGitGadget
2020-12-04 20:47 ` [PATCH v2 " Elijah Newren via GitGitGadget

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).