All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/17] Remove assumptions about refname lifetimes
@ 2013-05-19 20:26 Michael Haggerty
  2013-05-19 20:26 ` [PATCH 01/17] describe: make own copy of refname Michael Haggerty
                   ` (17 more replies)
  0 siblings, 18 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

Prior to this patch series, the refs API said nothing about the
lifetime of the refname parameter passed to each_ref_fn callbacks by
the for_each_ref()-style iteration functions.  De facto, the refnames
usually had long lives because they were pointers into the ref_cache
data structures, and those are only invalidated under rare
circumstances.  And some callers were assuming a long lifetime, for
example storing references to the refname string instead of copying
it.

But it has long been the case that ref caches could be invalidated,
for example when a packed ref is deleted.  AFAIK there was never much
clarity about what that might mean for callers.

Recently a number of race conditions related to references have been
discovered.  There is likely to be a two-pronged solution to the
races:

* For traditional, filesystem-based references, there will have to be
  more checks that the ref caches are still up-to-date at the time of
  their use (see, for example, [1]).  If not, the ref cache will have
  to be invalidated and reloaded.  Assuming that the invalidation of
  the old cache includes freeing its memory, such an invalidation will
  cause lots of refname strings to be freed even though callers might
  still hold references to them.

* For server-class installations, filesystem-based references might
  not be robust enough for 100% reliable operation, because the
  reading of the complete set of references is not an atomic
  operation.  If another reference storage mechanism is developed,
  there is no reason to expect the refnames strings to have long
  lifetimes.

A prerequisite for either of these approaches is to harmonize what
callers assume and what the API guarantees.

The purpose of this patch series is to track down callers who assume
that the refnames that they receive via a each_ref_fn callback have
lifetimes beyond the duration of the callback invocation and to
rewrite them to work without that assumption.  The final patch
documents explicitly that callers should not retain references to the
refnames.

A word about how I audited the code:

To find callers making unwarranted assumptions, I (temporarily)
changed do_one_ref() to do a xstrdup() of the refname, pass the copy
to the callback function, then free() the copy.  This caused
ill-behaved callers to access freed memory, which could be detected by
running the testsuite under valgrind.  There were indeed a number of
such errors.  All of them are fixed by this patch series, and the test
just described now runs without errors.

I plan to do a second audit by hand to see if the test suite and/or
valgrind missed anything.

The last two patches are RFCs.  I would like some input on the second
to last because I am not very familiar with how the object array entry
names are used, how many might be created, etc.  The last patch is an
illustration of how I would like to change the API docs, but it will
only be ready when all of the code has been audited and adapted.
Please see especially my comments on these two patches for more
information.

[1] http://thread.gmane.org/gmane.comp.version-control.git/223299

Michael Haggerty (17):
  describe: make own copy of refname
  fetch: make own copies of refnames
  add_rev_cmdline(): make a copy of the name argument
  builtin_diff_tree(): make it obvious that function wants two entries
  cmd_diff(): use an object_array for holding trees
  cmd_diff(): rename local variable "list" -> "entry"
  cmd_diff(): make it obvious which cases are exclusive of each other
  revision: split some overly-long lines
  gc_boundary(): move the check "alloc <= nr" to caller
  get_revision_internal(): make check less mysterious
  object_array: add function object_array_filter()
  object_array_remove_duplicates(): rewrite to reduce copying
  fsck: don't put a void*-shaped peg in a char*-shaped hole
  find_first_merges(): initialize merges variable using initializer
  find_first_merges(): remove unnecessary code
  object_array_entry: copy name before storing in name field
  refs: document the lifetime of the refname passed to each_ref_fn

 builtin/describe.c |  6 +++--
 builtin/diff.c     | 68 ++++++++++++++++++++++++++----------------------------
 builtin/fetch.c    |  4 ++--
 builtin/fsck.c     |  2 +-
 object.c           | 50 +++++++++++++++++++++++++++++++--------
 object.h           | 23 ++++++++++++++++--
 refs.h             | 22 +++++++++++++-----
 revision.c         | 61 +++++++++++++++++++++++++-----------------------
 revision.h         | 32 ++++++++++++++++---------
 submodule.c        |  6 ++---
 10 files changed, 172 insertions(+), 102 deletions(-)

-- 
1.8.2.3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/17] describe: make own copy of refname
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
@ 2013-05-19 20:26 ` Michael Haggerty
  2013-05-19 20:26 ` [PATCH 02/17] fetch: make own copies of refnames Michael Haggerty
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

Do not retain a reference to the refname passed to the each_ref_fn
callback get_name(), because there is no guarantee of the lifetimes of
these names.  Instead, make a local copy when needed.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 builtin/describe.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/builtin/describe.c b/builtin/describe.c
index 6636a68..3dc09eb 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -42,7 +42,7 @@ struct commit_name {
 	unsigned prio:2; /* annotated tag = 2, tag = 1, head = 0 */
 	unsigned name_checked:1;
 	unsigned char sha1[20];
-	const char *path;
+	char *path;
 };
 static const char *prio_names[] = {
 	"head", "lightweight", "annotated",
@@ -126,12 +126,14 @@ static void add_to_known_names(const char *path,
 			} else {
 				e->next = NULL;
 			}
+			e->path = NULL;
 		}
 		e->tag = tag;
 		e->prio = prio;
 		e->name_checked = 0;
 		hashcpy(e->sha1, sha1);
-		e->path = path;
+		free(e->path);
+		e->path = xstrdup(path);
 	}
 }
 
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/17] fetch: make own copies of refnames
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
  2013-05-19 20:26 ` [PATCH 01/17] describe: make own copy of refname Michael Haggerty
@ 2013-05-19 20:26 ` Michael Haggerty
  2013-05-19 20:26 ` [PATCH 03/17] add_rev_cmdline(): make a copy of the name argument Michael Haggerty
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

Do not retain references to refnames passed to the each_ref_fn
callback add_existing(), because their lifetime is not guaranteed.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 builtin/fetch.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 4b6b1df..f949115 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -590,7 +590,7 @@ static void find_non_local_tags(struct transport *transport,
 			struct ref **head,
 			struct ref ***tail)
 {
-	struct string_list existing_refs = STRING_LIST_INIT_NODUP;
+	struct string_list existing_refs = STRING_LIST_INIT_DUP;
 	struct string_list remote_refs = STRING_LIST_INIT_NODUP;
 	const struct ref *ref;
 	struct string_list_item *item = NULL;
@@ -693,7 +693,7 @@ static int truncate_fetch_head(void)
 static int do_fetch(struct transport *transport,
 		    struct refspec *refs, int ref_count)
 {
-	struct string_list existing_refs = STRING_LIST_INIT_NODUP;
+	struct string_list existing_refs = STRING_LIST_INIT_DUP;
 	struct string_list_item *peer_item = NULL;
 	struct ref *ref_map;
 	struct ref *rm;
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/17] add_rev_cmdline(): make a copy of the name argument
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
  2013-05-19 20:26 ` [PATCH 01/17] describe: make own copy of refname Michael Haggerty
  2013-05-19 20:26 ` [PATCH 02/17] fetch: make own copies of refnames Michael Haggerty
@ 2013-05-19 20:26 ` Michael Haggerty
  2013-05-19 20:26 ` [PATCH 04/17] builtin_diff_tree(): make it obvious that function wants two entries Michael Haggerty
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

Instead of assuming that the memory pointed to by the name argument
will live forever, make a local copy of it before storing it in the
ref_cmdline_info.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 revision.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/revision.c b/revision.c
index a67b615..25e424c 100644
--- a/revision.c
+++ b/revision.c
@@ -898,6 +898,10 @@ static int limit_list(struct rev_info *revs)
 	return 0;
 }
 
+/*
+ * Add an entry to refs->cmdline with the specified information.
+ * *name is copied.
+ */
 static void add_rev_cmdline(struct rev_info *revs,
 			    struct object *item,
 			    const char *name,
@@ -909,7 +913,7 @@ static void add_rev_cmdline(struct rev_info *revs,
 
 	ALLOC_GROW(info->rev, nr + 1, info->alloc);
 	info->rev[nr].item = item;
-	info->rev[nr].name = name;
+	info->rev[nr].name = xstrdup(name);
 	info->rev[nr].whence = whence;
 	info->rev[nr].flags = flags;
 	info->nr++;
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/17] builtin_diff_tree(): make it obvious that function wants two entries
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (2 preceding siblings ...)
  2013-05-19 20:26 ` [PATCH 03/17] add_rev_cmdline(): make a copy of the name argument Michael Haggerty
@ 2013-05-19 20:26 ` Michael Haggerty
  2013-05-21 17:27   ` Junio C Hamano
  2013-05-19 20:27 ` [PATCH 05/17] cmd_diff(): use an object_array for holding trees Michael Haggerty
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

Instead of accepting an array and using exactly two elements from the
array, take two single (struct object_array_entry *) arguments.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 builtin/diff.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/builtin/diff.c b/builtin/diff.c
index 8c2af6c..ba68c6c 100644
--- a/builtin/diff.c
+++ b/builtin/diff.c
@@ -153,7 +153,8 @@ static int builtin_diff_index(struct rev_info *revs,
 
 static int builtin_diff_tree(struct rev_info *revs,
 			     int argc, const char **argv,
-			     struct object_array_entry *ent)
+			     struct object_array_entry *ent0,
+			     struct object_array_entry *ent1)
 {
 	const unsigned char *(sha1[2]);
 	int swap = 0;
@@ -161,13 +162,13 @@ static int builtin_diff_tree(struct rev_info *revs,
 	if (argc > 1)
 		usage(builtin_diff_usage);
 
-	/* We saw two trees, ent[0] and ent[1].
-	 * if ent[1] is uninteresting, they are swapped
+	/* We saw two trees, ent0 and ent1.
+	 * if ent1 is uninteresting, they are swapped
 	 */
-	if (ent[1].item->flags & UNINTERESTING)
+	if (ent1->item->flags & UNINTERESTING)
 		swap = 1;
-	sha1[swap] = ent[0].item->sha1;
-	sha1[1-swap] = ent[1].item->sha1;
+	sha1[swap] = ent0->item->sha1;
+	sha1[1-swap] = ent1->item->sha1;
 	diff_tree_sha1(sha1[0], sha1[1], "", &revs->diffopt);
 	log_tree_diff_flush(revs);
 	return 0;
@@ -403,7 +404,7 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
 	else if (ents == 1)
 		result = builtin_diff_index(&rev, argc, argv);
 	else if (ents == 2)
-		result = builtin_diff_tree(&rev, argc, argv, ent);
+		result = builtin_diff_tree(&rev, argc, argv, &ent[0], &ent[1]);
 	else if (ent[0].item->flags & UNINTERESTING) {
 		/*
 		 * diff A...B where there is at least one merge base
@@ -412,8 +413,7 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
 		 * between the base and B.  Note that we pick one
 		 * merge base at random if there are more than one.
 		 */
-		ent[1] = ent[ents-1];
-		result = builtin_diff_tree(&rev, argc, argv, ent);
+		result = builtin_diff_tree(&rev, argc, argv, &ent[0], &ent[ents-1]);
 	} else
 		result = builtin_diff_combined(&rev, argc, argv,
 					       ent, ents);
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/17] cmd_diff(): use an object_array for holding trees
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (3 preceding siblings ...)
  2013-05-19 20:26 ` [PATCH 04/17] builtin_diff_tree(): make it obvious that function wants two entries Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-21 17:30   ` Junio C Hamano
  2013-05-19 20:27 ` [PATCH 06/17] cmd_diff(): rename local variable "list" -> "entry" Michael Haggerty
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

Change cmd_diff() to use a (struct object_array) for holding the trees
that it accumulates, rather than rolling its own equivalent.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 builtin/diff.c | 37 ++++++++++++++++++-------------------
 1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/builtin/diff.c b/builtin/diff.c
index ba68c6c..72d99c0 100644
--- a/builtin/diff.c
+++ b/builtin/diff.c
@@ -252,8 +252,8 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
 {
 	int i;
 	struct rev_info rev;
-	struct object_array_entry ent[100];
-	int ents = 0, blobs = 0, paths = 0;
+	struct object_array ent = OBJECT_ARRAY_INIT;
+	int blobs = 0, paths = 0;
 	const char *path = NULL;
 	struct blobinfo blob[2];
 	int nongit;
@@ -350,13 +350,8 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
 		if (obj->type == OBJ_COMMIT)
 			obj = &((struct commit *)obj)->tree->object;
 		if (obj->type == OBJ_TREE) {
-			if (ARRAY_SIZE(ent) <= ents)
-				die(_("more than %d trees given: '%s'"),
-				    (int) ARRAY_SIZE(ent), name);
 			obj->flags |= flags;
-			ent[ents].item = obj;
-			ent[ents].name = name;
-			ents++;
+			add_object_array(obj, name, &ent);
 			continue;
 		}
 		if (obj->type == OBJ_BLOB) {
@@ -380,7 +375,7 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
 	/*
 	 * Now, do the arguments look reasonable?
 	 */
-	if (!ents) {
+	if (!ent.nr) {
 		switch (blobs) {
 		case 0:
 			result = builtin_diff_files(&rev, argc, argv);
@@ -401,22 +396,26 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
 	}
 	else if (blobs)
 		usage(builtin_diff_usage);
-	else if (ents == 1)
+	else if (ent.nr == 1)
 		result = builtin_diff_index(&rev, argc, argv);
-	else if (ents == 2)
-		result = builtin_diff_tree(&rev, argc, argv, &ent[0], &ent[1]);
-	else if (ent[0].item->flags & UNINTERESTING) {
+	else if (ent.nr == 2)
+		result = builtin_diff_tree(&rev, argc, argv,
+					   &ent.objects[0], &ent.objects[1]);
+	else if (ent.objects[0].item->flags & UNINTERESTING) {
 		/*
 		 * diff A...B where there is at least one merge base
-		 * between A and B.  We have ent[0] == merge-base,
-		 * ent[ents-2] == A, and ent[ents-1] == B.  Show diff
-		 * between the base and B.  Note that we pick one
-		 * merge base at random if there are more than one.
+		 * between A and B.  We have ent.objects[0] ==
+		 * merge-base, ent.objects[ents-2] == A, and
+		 * ent.objects[ents-1] == B.  Show diff between the
+		 * base and B.  Note that we pick one merge base at
+		 * random if there are more than one.
 		 */
-		result = builtin_diff_tree(&rev, argc, argv, &ent[0], &ent[ents-1]);
+		result = builtin_diff_tree(&rev, argc, argv,
+					   &ent.objects[0],
+					   &ent.objects[ent.nr-1]);
 	} else
 		result = builtin_diff_combined(&rev, argc, argv,
-					       ent, ents);
+					       ent.objects, ent.nr);
 	result = diff_result_code(&rev.diffopt, result);
 	if (1 < rev.diffopt.skip_stat_unmatch)
 		refresh_index_quietly();
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/17] cmd_diff(): rename local variable "list" -> "entry"
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (4 preceding siblings ...)
  2013-05-19 20:27 ` [PATCH 05/17] cmd_diff(): use an object_array for holding trees Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-19 20:27 ` [PATCH 07/17] cmd_diff(): make it obvious which cases are exclusive of each other Michael Haggerty
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

It's not a list, it's an array entry.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 builtin/diff.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/builtin/diff.c b/builtin/diff.c
index 72d99c0..7cac6de 100644
--- a/builtin/diff.c
+++ b/builtin/diff.c
@@ -338,9 +338,9 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
 	}
 
 	for (i = 0; i < rev.pending.nr; i++) {
-		struct object_array_entry *list = rev.pending.objects+i;
-		struct object *obj = list->item;
-		const char *name = list->name;
+		struct object_array_entry *entry = &rev.pending.objects[i];
+		struct object *obj = entry->item;
+		const char *name = entry->name;
 		int flags = (obj->flags & UNINTERESTING);
 		if (!obj->parsed)
 			obj = parse_object(obj->sha1);
@@ -359,7 +359,7 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
 				die(_("more than two blobs given: '%s'"), name);
 			hashcpy(blob[blobs].sha1, obj->sha1);
 			blob[blobs].name = name;
-			blob[blobs].mode = list->mode;
+			blob[blobs].mode = entry->mode;
 			blobs++;
 			continue;
 
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/17] cmd_diff(): make it obvious which cases are exclusive of each other
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (5 preceding siblings ...)
  2013-05-19 20:27 ` [PATCH 06/17] cmd_diff(): rename local variable "list" -> "entry" Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-19 20:27 ` [PATCH 08/17] revision: split some overly-long lines Michael Haggerty
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

At first glance the OBJ_COMMIT, OBJ_TREE, and OBJ_BLOB cases look like
they might be mutually exclusive.  But the OBJ_COMMIT case doesn't end
the loop iteration with "continue" like the other two cases, but
rather falls through.  So use if...else if...else construct to make it
more obvious that only the last two cases are mutually exclusive.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 builtin/diff.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/builtin/diff.c b/builtin/diff.c
index 7cac6de..3b90ee6 100644
--- a/builtin/diff.c
+++ b/builtin/diff.c
@@ -349,22 +349,21 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
 			die(_("invalid object '%s' given."), name);
 		if (obj->type == OBJ_COMMIT)
 			obj = &((struct commit *)obj)->tree->object;
+
 		if (obj->type == OBJ_TREE) {
 			obj->flags |= flags;
 			add_object_array(obj, name, &ent);
-			continue;
-		}
-		if (obj->type == OBJ_BLOB) {
+		} else if (obj->type == OBJ_BLOB) {
 			if (2 <= blobs)
 				die(_("more than two blobs given: '%s'"), name);
 			hashcpy(blob[blobs].sha1, obj->sha1);
 			blob[blobs].name = name;
 			blob[blobs].mode = entry->mode;
 			blobs++;
-			continue;
 
+		} else {
+			die(_("unhandled object '%s' given."), name);
 		}
-		die(_("unhandled object '%s' given."), name);
 	}
 	if (rev.prune_data.nr) {
 		if (!path)
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/17] revision: split some overly-long lines
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (6 preceding siblings ...)
  2013-05-19 20:27 ` [PATCH 07/17] cmd_diff(): make it obvious which cases are exclusive of each other Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-21 17:34   ` Junio C Hamano
  2013-05-19 20:27 ` [PATCH 09/17] gc_boundary(): move the check "alloc <= nr" to caller Michael Haggerty
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty


Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 revision.c | 20 ++++++++++++++------
 revision.h | 32 +++++++++++++++++++++-----------
 2 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/revision.c b/revision.c
index 25e424c..8ac88d6 100644
--- a/revision.c
+++ b/revision.c
@@ -70,7 +70,8 @@ static int show_path_truncated(FILE *out, const struct name_path *path)
 	return ours || emitted;
 }
 
-void show_object_with_name(FILE *out, struct object *obj, const struct name_path *path, const char *component)
+void show_object_with_name(FILE *out, struct object *obj,
+			   const struct name_path *path, const char *component)
 {
 	struct name_path leaf;
 	leaf.up = (struct name_path *)path;
@@ -186,7 +187,9 @@ void mark_parents_uninteresting(struct commit *commit)
 	}
 }
 
-static void add_pending_object_with_mode(struct rev_info *revs, struct object *obj, const char *name, unsigned mode)
+static void add_pending_object_with_mode(struct rev_info *revs,
+					 struct object *obj,
+					 const char *name, unsigned mode)
 {
 	if (!obj)
 		return;
@@ -209,7 +212,8 @@ static void add_pending_object_with_mode(struct rev_info *revs, struct object *o
 	add_object_array_with_mode(obj, name, &revs->pending, mode);
 }
 
-void add_pending_object(struct rev_info *revs, struct object *obj, const char *name)
+void add_pending_object(struct rev_info *revs,
+			struct object *obj, const char *name)
 {
 	add_pending_object_with_mode(revs, obj, name, S_IFINVALID);
 }
@@ -226,7 +230,9 @@ void add_head_to_pending(struct rev_info *revs)
 	add_pending_object(revs, obj, "HEAD");
 }
 
-static struct object *get_reference(struct rev_info *revs, const char *name, const unsigned char *sha1, unsigned int flags)
+static struct object *get_reference(struct rev_info *revs, const char *name,
+				    const unsigned char *sha1,
+				    unsigned int flags)
 {
 	struct object *object;
 
@@ -247,7 +253,8 @@ void add_pending_sha1(struct rev_info *revs, const char *name,
 	add_pending_object(revs, object, name);
 }
 
-static struct commit *handle_commit(struct rev_info *revs, struct object *object, const char *name)
+static struct commit *handle_commit(struct rev_info *revs,
+				    struct object *object, const char *name)
 {
 	unsigned long flags = object->flags;
 
@@ -368,7 +375,8 @@ static void file_change(struct diff_options *options,
 	DIFF_OPT_SET(options, HAS_CHANGES);
 }
 
-static int rev_compare_tree(struct rev_info *revs, struct commit *parent, struct commit *commit)
+static int rev_compare_tree(struct rev_info *revs,
+			    struct commit *parent, struct commit *commit)
 {
 	struct tree *t1 = parent->tree;
 	struct tree *t2 = commit->tree;
diff --git a/revision.h b/revision.h
index 01bd2b7..9628465 100644
--- a/revision.h
+++ b/revision.h
@@ -195,19 +195,23 @@ struct setup_revision_opt {
 };
 
 extern void init_revisions(struct rev_info *revs, const char *prefix);
-extern int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct setup_revision_opt *);
+extern int setup_revisions(int argc, const char **argv, struct rev_info *revs,
+			   struct setup_revision_opt *);
 extern void parse_revision_opt(struct rev_info *revs, struct parse_opt_ctx_t *ctx,
-				 const struct option *options,
-				 const char * const usagestr[]);
+			       const struct option *options,
+			       const char * const usagestr[]);
 #define REVARG_CANNOT_BE_FILENAME 01
 #define REVARG_COMMITTISH 02
-extern int handle_revision_arg(const char *arg, struct rev_info *revs, int flags, unsigned revarg_opt);
+extern int handle_revision_arg(const char *arg, struct rev_info *revs,
+			       int flags, unsigned revarg_opt);
 
 extern void reset_revision_walk(void);
 extern int prepare_revision_walk(struct rev_info *revs);
 extern struct commit *get_revision(struct rev_info *revs);
-extern char *get_revision_mark(const struct rev_info *revs, const struct commit *commit);
-extern void put_revision_mark(const struct rev_info *revs, const struct commit *commit);
+extern char *get_revision_mark(const struct rev_info *revs,
+			       const struct commit *commit);
+extern void put_revision_mark(const struct rev_info *revs,
+			      const struct commit *commit);
 
 extern void mark_parents_uninteresting(struct commit *commit);
 extern void mark_tree_uninteresting(struct tree *tree);
@@ -220,15 +224,19 @@ struct name_path {
 
 char *path_name(const struct name_path *path, const char *name);
 
-extern void show_object_with_name(FILE *, struct object *, const struct name_path *, const char *);
+extern void show_object_with_name(FILE *, struct object *,
+				  const struct name_path *, const char *);
 
 extern void add_object(struct object *obj,
 		       struct object_array *p,
 		       struct name_path *path,
 		       const char *name);
 
-extern void add_pending_object(struct rev_info *revs, struct object *obj, const char *name);
-extern void add_pending_sha1(struct rev_info *revs, const char *name, const unsigned char *sha1, unsigned int flags);
+extern void add_pending_object(struct rev_info *revs,
+			       struct object *obj, const char *name);
+extern void add_pending_sha1(struct rev_info *revs,
+			     const char *name, const unsigned char *sha1,
+			     unsigned int flags);
 
 extern void add_head_to_pending(struct rev_info *);
 
@@ -238,7 +246,9 @@ enum commit_action {
 	commit_error
 };
 
-extern enum commit_action get_commit_action(struct rev_info *revs, struct commit *commit);
-extern enum commit_action simplify_commit(struct rev_info *revs, struct commit *commit);
+extern enum commit_action get_commit_action(struct rev_info *revs,
+					    struct commit *commit);
+extern enum commit_action simplify_commit(struct rev_info *revs,
+					  struct commit *commit);
 
 #endif
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/17] gc_boundary(): move the check "alloc <= nr" to caller
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (7 preceding siblings ...)
  2013-05-19 20:27 ` [PATCH 08/17] revision: split some overly-long lines Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-21 17:49   ` Junio C Hamano
  2013-05-19 20:27 ` [PATCH 10/17] get_revision_internal(): make check less mysterious Michael Haggerty
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

There is no logical reason for this test to be here.  At the caller we
might be able to figure out its meaning.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 revision.c | 27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/revision.c b/revision.c
index 8ac88d6..2e0992b 100644
--- a/revision.c
+++ b/revision.c
@@ -2437,23 +2437,19 @@ static struct commit *get_revision_1(struct rev_info *revs)
 
 static void gc_boundary(struct object_array *array)
 {
-	unsigned nr = array->nr;
-	unsigned alloc = array->alloc;
+	unsigned nr = array->nr, i, j;
 	struct object_array_entry *objects = array->objects;
 
-	if (alloc <= nr) {
-		unsigned i, j;
-		for (i = j = 0; i < nr; i++) {
-			if (objects[i].item->flags & SHOWN)
-				continue;
-			if (i != j)
-				objects[j] = objects[i];
-			j++;
-		}
-		for (i = j; i < nr; i++)
-			objects[i].item = NULL;
-		array->nr = j;
+	for (i = j = 0; i < nr; i++) {
+		if (objects[i].item->flags & SHOWN)
+			continue;
+		if (i != j)
+			objects[j] = objects[i];
+		j++;
 	}
+	for (i = j; i < nr; i++)
+		objects[i].item = NULL;
+	array->nr = j;
 }
 
 static void create_boundary_commit_list(struct rev_info *revs)
@@ -2577,7 +2573,8 @@ static struct commit *get_revision_internal(struct rev_info *revs)
 		if (p->flags & (CHILD_SHOWN | SHOWN))
 			continue;
 		p->flags |= CHILD_SHOWN;
-		gc_boundary(&revs->boundary_commits);
+		if (revs->boundary_commits.alloc <= revs->boundary_commits.nr)
+			gc_boundary(&revs->boundary_commits);
 		add_object_array(p, NULL, &revs->boundary_commits);
 	}
 
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/17] get_revision_internal(): make check less mysterious
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (8 preceding siblings ...)
  2013-05-19 20:27 ` [PATCH 09/17] gc_boundary(): move the check "alloc <= nr" to caller Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-21 17:38   ` Junio C Hamano
  2013-05-19 20:27 ` [PATCH 11/17] object_array: add function object_array_filter() Michael Haggerty
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

The condition under which gc_boundary() is called was previously

    if (alloc <= nr)

.  But by construction, nr can never exceed alloc, so the check looks
unnecessarily mysterious.  In fact, the purpose of the check is to try
to avoid a realloc() call by shrinking the array if possible if it is
at its allocation limit when a new element is about to be added.  So
change the check to

    if (nr == alloc)

and add a comment to explain what's going on.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
Please check that I have properly described the purpose of this check.

The way the code is written, it looks like a bad pattern of growth and
shrinkage of the array (namely, just under the resize limit) could
cause gc_boundary() to be called over and over again with (most of)
the same data.  I hope that the author had some reason to believe that
such a pattern is unlikely.

 revision.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/revision.c b/revision.c
index 2e0992b..19c59f4 100644
--- a/revision.c
+++ b/revision.c
@@ -2573,8 +2573,10 @@ static struct commit *get_revision_internal(struct rev_info *revs)
 		if (p->flags & (CHILD_SHOWN | SHOWN))
 			continue;
 		p->flags |= CHILD_SHOWN;
-		if (revs->boundary_commits.alloc <= revs->boundary_commits.nr)
+		if (revs->boundary_commits.nr == revs->boundary_commits.alloc) {
+			/* Try to make space and thereby avoid a realloc(): */
 			gc_boundary(&revs->boundary_commits);
+		}
 		add_object_array(p, NULL, &revs->boundary_commits);
 	}
 
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/17] object_array: add function object_array_filter()
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (9 preceding siblings ...)
  2013-05-19 20:27 ` [PATCH 10/17] get_revision_internal(): make check less mysterious Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-19 20:27 ` [PATCH 12/17] object_array_remove_duplicates(): rewrite to reduce copying Michael Haggerty
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

Add a function that allows unwanted entries in an object_array to be
removed.  This encapsulation is a step towards giving object_array
ownership of its entries' name memory.

Use the new function to replace revision.c:gc_boundary().

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 object.c   | 16 ++++++++++++++++
 object.h   | 11 +++++++++++
 revision.c | 28 ++++++++++------------------
 3 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/object.c b/object.c
index 20703f5..fcd4a82 100644
--- a/object.c
+++ b/object.c
@@ -278,6 +278,22 @@ void add_object_array_with_mode(struct object *obj, const char *name, struct obj
 	array->nr = ++nr;
 }
 
+void object_array_filter(struct object_array *array,
+			 object_array_each_func_t want, void *cb_data)
+{
+	unsigned nr = array->nr, src, dst;
+	struct object_array_entry *objects = array->objects;
+
+	for (src = dst = 0; src < nr; src++) {
+		if (want(&objects[src], cb_data)) {
+			if (src != dst)
+				objects[dst] = objects[src];
+			dst++;
+		}
+	}
+	array->nr = dst;
+}
+
 void object_array_remove_duplicates(struct object_array *array)
 {
 	unsigned int ref, src, dst;
diff --git a/object.h b/object.h
index 97d384b..0d39ff4 100644
--- a/object.h
+++ b/object.h
@@ -85,6 +85,17 @@ int object_list_contains(struct object_list *list, struct object *obj);
 /* Object array handling .. */
 void add_object_array(struct object *obj, const char *name, struct object_array *array);
 void add_object_array_with_mode(struct object *obj, const char *name, struct object_array *array, unsigned mode);
+
+typedef int (*object_array_each_func_t)(struct object_array_entry *, void *);
+
+/*
+ * Apply want to each entry in array, retaining only the entries for
+ * which the function returns true.  Preserve the order of the entries
+ * that are retained.
+ */
+void object_array_filter(struct object_array *array,
+			 object_array_each_func_t want, void *cb_data);
+
 void object_array_remove_duplicates(struct object_array *);
 
 void clear_object_flags(unsigned flags);
diff --git a/revision.c b/revision.c
index 19c59f4..ddc6d7c 100644
--- a/revision.c
+++ b/revision.c
@@ -2435,23 +2435,6 @@ static struct commit *get_revision_1(struct rev_info *revs)
 	return NULL;
 }
 
-static void gc_boundary(struct object_array *array)
-{
-	unsigned nr = array->nr, i, j;
-	struct object_array_entry *objects = array->objects;
-
-	for (i = j = 0; i < nr; i++) {
-		if (objects[i].item->flags & SHOWN)
-			continue;
-		if (i != j)
-			objects[j] = objects[i];
-		j++;
-	}
-	for (i = j; i < nr; i++)
-		objects[i].item = NULL;
-	array->nr = j;
-}
-
 static void create_boundary_commit_list(struct rev_info *revs)
 {
 	unsigned i;
@@ -2493,6 +2476,15 @@ static void create_boundary_commit_list(struct rev_info *revs)
 	sort_in_topological_order(&revs->commits, revs->lifo);
 }
 
+/*
+ * Return true for entries that have not yet been shown.  (This is an
+ * object_array_each_func_t.)
+ */
+static int entry_unshown(struct object_array_entry *entry, void *cb_data_unused)
+{
+	return !(entry->item->flags & SHOWN);
+}
+
 static struct commit *get_revision_internal(struct rev_info *revs)
 {
 	struct commit *c = NULL;
@@ -2575,7 +2567,7 @@ static struct commit *get_revision_internal(struct rev_info *revs)
 		p->flags |= CHILD_SHOWN;
 		if (revs->boundary_commits.nr == revs->boundary_commits.alloc) {
 			/* Try to make space and thereby avoid a realloc(): */
-			gc_boundary(&revs->boundary_commits);
+			object_array_filter(&revs->boundary_commits, entry_unshown, NULL);
 		}
 		add_object_array(p, NULL, &revs->boundary_commits);
 	}
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 12/17] object_array_remove_duplicates(): rewrite to reduce copying
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (10 preceding siblings ...)
  2013-05-19 20:27 ` [PATCH 11/17] object_array: add function object_array_filter() Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-19 20:27 ` [PATCH 13/17] fsck: don't put a void*-shaped peg in a char*-shaped hole Michael Haggerty
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

The old version copied one entry to its destination position, then
deleted any matching entries from the tail of the array.  This
required the tail of the array to be copied multiple times.  It didn't
affect the complexity of the algorithm because the whole tail has to
be searched through anyway.  But all the copying was unnecessary.

Instead, check for the existence of an entry with the same name in the
*head* of the list before copying an entry to its final position.
This way each entry has to be copied at most one time.

Extract a helper function contains_name() to do a bit of the work.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 object.c | 32 +++++++++++++++++++++-----------
 object.h |  6 +++++-
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/object.c b/object.c
index fcd4a82..10b5349 100644
--- a/object.c
+++ b/object.c
@@ -294,22 +294,32 @@ void object_array_filter(struct object_array *array,
 	array->nr = dst;
 }
 
+/*
+ * Return true iff array already contains an entry with name.
+ */
+static int contains_name(struct object_array *array, const char *name)
+{
+	unsigned nr = array->nr, i;
+	struct object_array_entry *object = array->objects;
+
+	for (i = 0; i < nr; i++, object++)
+		if (!strcmp(object->name, name))
+			return 1;
+	return 0;
+}
+
 void object_array_remove_duplicates(struct object_array *array)
 {
-	unsigned int ref, src, dst;
+	unsigned nr = array->nr, src;
 	struct object_array_entry *objects = array->objects;
 
-	for (ref = 0; ref + 1 < array->nr; ref++) {
-		for (src = ref + 1, dst = src;
-		     src < array->nr;
-		     src++) {
-			if (!strcmp(objects[ref].name, objects[src].name))
-				continue;
-			if (src != dst)
-				objects[dst] = objects[src];
-			dst++;
+	array->nr = 0;
+	for (src = 0; src < nr; src++) {
+		if (!contains_name(array, objects[src].name)) {
+			if (src != array->nr)
+				objects[array->nr] = objects[src];
+			array->nr++;
 		}
-		array->nr = dst;
 	}
 }
 
diff --git a/object.h b/object.h
index 0d39ff4..6c1c27f 100644
--- a/object.h
+++ b/object.h
@@ -96,7 +96,11 @@ typedef int (*object_array_each_func_t)(struct object_array_entry *, void *);
 void object_array_filter(struct object_array *array,
 			 object_array_each_func_t want, void *cb_data);
 
-void object_array_remove_duplicates(struct object_array *);
+/*
+ * Remove from array all but the first entry with a given name.
+ * Warning: this function uses an O(N^2) algorithm.
+ */
+void object_array_remove_duplicates(struct object_array *array);
 
 void clear_object_flags(unsigned flags);
 
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 13/17] fsck: don't put a void*-shaped peg in a char*-shaped hole
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (11 preceding siblings ...)
  2013-05-19 20:27 ` [PATCH 12/17] object_array_remove_duplicates(): rewrite to reduce copying Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-19 20:27 ` [PATCH 14/17] find_first_merges(): initialize merges variable using initializer Michael Haggerty
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

The source of this nonsense was

    04d3975937 fsck: reduce stack footprint

, which wedged a pointer to parent into the object_array_entry's name
field.  The parent pointer was passed to traverse_one_object(), even
though that function *didn't use it*.

The useless code has been deleted over time.  Commit

    a1cdc25172 fsck: drop unused parameter from traverse_one_object()

removed the parent pointer from traverse_one_object()'s
signature. Commit

    c0aa335c95 Remove unused variables

removed the code that read the parent pointer back out of the name
field.

This commit takes the last step: don't write the parent pointer into
the name field in the first place.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
I thought that this misuse of the name field was going to be a
showstopper for changing how the name's memory is managed, but then I
noticed that the value stored here is never used.

 builtin/fsck.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index bb9a2cd..9909b6d 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -112,7 +112,7 @@ static int mark_object(struct object *obj, int type, void *data)
 		return 1;
 	}
 
-	add_object_array(obj, (void *) parent, &pending);
+	add_object_array(obj, NULL, &pending);
 	return 0;
 }
 
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 14/17] find_first_merges(): initialize merges variable using initializer
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (12 preceding siblings ...)
  2013-05-19 20:27 ` [PATCH 13/17] fsck: don't put a void*-shaped peg in a char*-shaped hole Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-19 20:27 ` [PATCH 15/17] find_first_merges(): remove unnecessary code Michael Haggerty
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty


Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 submodule.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/submodule.c b/submodule.c
index e728025..b837c04 100644
--- a/submodule.c
+++ b/submodule.c
@@ -846,7 +846,7 @@ static int find_first_merges(struct object_array *result, const char *path,
 		struct commit *a, struct commit *b)
 {
 	int i, j;
-	struct object_array merges;
+	struct object_array merges = OBJECT_ARRAY_INIT;
 	struct commit *commit;
 	int contains_another;
 
@@ -856,7 +856,6 @@ static int find_first_merges(struct object_array *result, const char *path,
 	struct rev_info revs;
 	struct setup_revision_opt rev_opts;
 
-	memset(&merges, 0, sizeof(merges));
 	memset(result, 0, sizeof(struct object_array));
 	memset(&rev_opts, 0, sizeof(rev_opts));
 
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 15/17] find_first_merges(): remove unnecessary code
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (13 preceding siblings ...)
  2013-05-19 20:27 ` [PATCH 14/17] find_first_merges(): initialize merges variable using initializer Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-19 20:27 ` [RFC 16/17] object_array_entry: copy name before storing in name field Michael Haggerty
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

No names are ever set for the object_array_entries in merges, so there
is no need to pretend to copy them to the result array.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 submodule.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/submodule.c b/submodule.c
index b837c04..ad476ce 100644
--- a/submodule.c
+++ b/submodule.c
@@ -893,8 +893,7 @@ static int find_first_merges(struct object_array *result, const char *path,
 		}
 
 		if (!contains_another)
-			add_object_array(merges.objects[i].item,
-					 merges.objects[i].name, result);
+			add_object_array(merges.objects[i].item, NULL, result);
 	}
 
 	free(merges.objects);
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC 16/17] object_array_entry: copy name before storing in name field
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (14 preceding siblings ...)
  2013-05-19 20:27 ` [PATCH 15/17] find_first_merges(): remove unnecessary code Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-20 10:33   ` Johan Herland
  2013-05-19 20:27 ` [RFC 17/17] refs: document the lifetime of the refname passed to each_ref_fn Michael Haggerty
  2013-05-20 10:28 ` [PATCH 00/17] Remove assumptions about refname lifetimes Johan Herland
  17 siblings, 1 reply; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

Change object_array and object_array_entry to copy the name before
storing it in the name field, and free it when an entry is deleted
from the array.  This is useful because some of the name strings
passed to add_object_array() or add_object_array_with_mode() are
refnames whose lifetime is not defined by the refs API (and which we
want to shorten).

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
This is the culmination of the last few commits.  Since some callers
want to store refnames in the name field of object_array elements, but
we don't want those callers to assume that the refnames that they got
from for_each_ref() have infinite lifetime, the easiest thing to do is
have object_array make a copy of the names before writing them in the
entries, and to free the names for entries that are no longer in use.
This change fixes the problem, but has some disadvantages:

* It requires extra copies to be made of strings that are already
  copies, for example when the results of path_name(path, name) are
  used as a name in revision.c:add_object().  This might be rare
  enough that it can be ignored (though the original result of
  path_name() would have to be freed, which this patch doesn't do so
  there is a memory leak).

* Many callers store the empty string ("") as the name; for example,
  most of the entries created during a run of rev-list have "" as
  their name.  This means that lots of needless copies of "" are being
  made.  I think that the best solution to this problem would be to
  store NULL rather than "" for such entries, but I haven't figured
  out all of the places where the name is used.

The alternative would be to have callers make the copies if necessary
*before* passing the names into add_object_array(), and themselves
ensure that those copies get freed sometime.  This would be more work:
effectively each object_array would have to have its own memory
ownership policy and we would have to figure out exactly where in the
code entries are added and removed from particular lists.

Since I'm not too clear on what these names are used for, how many
object_array entries are created in different scenarios, etc., I
decided to submit this patch as an RFC to get some feedback before I
work on a final solution.

 object.c | 6 +++++-
 object.h | 6 +++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/object.c b/object.c
index 10b5349..a678c1b 100644
--- a/object.c
+++ b/object.c
@@ -273,7 +273,7 @@ void add_object_array_with_mode(struct object *obj, const char *name, struct obj
 		array->objects = objects;
 	}
 	objects[nr].item = obj;
-	objects[nr].name = name;
+	objects[nr].name = name ? xstrdup(name) : NULL;
 	objects[nr].mode = mode;
 	array->nr = ++nr;
 }
@@ -289,6 +289,8 @@ void object_array_filter(struct object_array *array,
 			if (src != dst)
 				objects[dst] = objects[src];
 			dst++;
+		} else {
+			free(objects[src].name);
 		}
 	}
 	array->nr = dst;
@@ -319,6 +321,8 @@ void object_array_remove_duplicates(struct object_array *array)
 			if (src != array->nr)
 				objects[array->nr] = objects[src];
 			array->nr++;
+		} else {
+			free(objects[src].name);
 		}
 	}
 }
diff --git a/object.h b/object.h
index 6c1c27f..f2c503a 100644
--- a/object.h
+++ b/object.h
@@ -11,7 +11,11 @@ struct object_array {
 	unsigned int alloc;
 	struct object_array_entry {
 		struct object *item;
-		const char *name;
+		/*
+		 * name or NULL.  If non-NULL, the memory pointed to
+		 * is owned by this object.
+		 */
+		char *name;
 		unsigned mode;
 	} *objects;
 };
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC 17/17] refs: document the lifetime of the refname passed to each_ref_fn
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (15 preceding siblings ...)
  2013-05-19 20:27 ` [RFC 16/17] object_array_entry: copy name before storing in name field Michael Haggerty
@ 2013-05-19 20:27 ` Michael Haggerty
  2013-05-20 10:28 ` [PATCH 00/17] Remove assumptions about refname lifetimes Johan Herland
  17 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-19 20:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Johan Herland, git, Michael Haggerty

The lifetime of the refname was never documented, but some callers
used to assume that its lifetime was essentially permanent.  The
commits leading up to this have disabused such callers of that notion.

The new status quo is that the API explicitly does *not* guarantee
that the refname string lives beyond a single callback invocation.
Document that fact.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
This patch is the ultimate goal of the series, and I include it for
illustration, but it obviously shouldn't be committed before the
object_array questions have been answered and the rest of the code has
been audited more carefully.

 refs.h | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/refs.h b/refs.h
index a35eafc..e67fb07 100644
--- a/refs.h
+++ b/refs.h
@@ -15,13 +15,23 @@ struct ref_lock {
 #define REF_ISBROKEN 0x04
 
 /*
- * Calls the specified function for each ref file until it returns
- * nonzero, and returns the value.  Please note that it is not safe to
- * modify references while an iteration is in progress, unless the
- * same callback function invocation that modifies the reference also
- * returns a nonzero value to immediately stop the iteration.
+ * The callback functions for the for_each_*() functions below must
+ * have this signature.  The memory pointed to by the refname argument
+ * is only guaranteed to be valid for the duration of a single
+ * callback invocation.
+ */
+typedef int each_ref_fn(const char *refname,
+			const unsigned char *sha1, int flags, void *cb_data);
+
+/*
+ * The following functions invoke the specified callback function for
+ * each reference indicated.  If the function ever returns a nonzero
+ * value, stop the iteration and return that value.  Please note that
+ * it is not safe to modify references while an iteration is in
+ * progress, unless the same callback function invocation that
+ * modifies the reference also returns a nonzero value to immediately
+ * stop the iteration.
  */
-typedef int each_ref_fn(const char *refname, const unsigned char *sha1, int flags, void *cb_data);
 extern int head_ref(each_ref_fn, void *);
 extern int for_each_ref(each_ref_fn, void *);
 extern int for_each_ref_in(const char *, each_ref_fn, void *);
-- 
1.8.2.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/17] Remove assumptions about refname lifetimes
  2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
                   ` (16 preceding siblings ...)
  2013-05-19 20:27 ` [RFC 17/17] refs: document the lifetime of the refname passed to each_ref_fn Michael Haggerty
@ 2013-05-20 10:28 ` Johan Herland
  2013-05-20 12:15   ` Michael Haggerty
  2013-05-20 16:37   ` Junio C Hamano
  17 siblings, 2 replies; 42+ messages in thread
From: Johan Herland @ 2013-05-20 10:28 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, Junio C Hamano, git

On Sun, May 19, 2013 at 10:26 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> Recently a number of race conditions related to references have been
> discovered.  There is likely to be a two-pronged solution to the
> races:
>
> * For traditional, filesystem-based references, there will have to be
>   more checks that the ref caches are still up-to-date at the time of
>   their use (see, for example, [1]).  If not, the ref cache will have
>   to be invalidated and reloaded.  Assuming that the invalidation of
>   the old cache includes freeing its memory, such an invalidation will
>   cause lots of refname strings to be freed even though callers might
>   still hold references to them.
>
> * For server-class installations, filesystem-based references might
>   not be robust enough for 100% reliable operation, because the
>   reading of the complete set of references is not an atomic
>   operation.  If another reference storage mechanism is developed,
>   there is no reason to expect the refnames strings to have long
>   lifetimes.

(Sorry for going slightly off-topic and returning to the general
discussion on how to resolve the race conditions...)

For server-class installations we need ref storage that can be read
(and updated?) atomically, and the current system of loose + packed
files won't work since reading (and updating) more than a single file
is not an atomic operation. Trivially, one could resolve this by
dropping loose refs, and always using a single packed-refs file, but
that would make it prohibitively expensive to update refs (the entire
packed-refs file must be rewritten for every update).

Now, observe that we don't have these race conditions in the object
database, because it is an add-only immutable data store.

What if we stored the refs as a tree object in the object database,
referenced by a single (loose) ref? There would be a _single_ (albeit
highly contentious) file outside the object database that represent
the current state of the refs, but hopefully we can guarantee
atomicity when reading (and updating?) that one file. Transactions can
be done by:
 1. Recording the tree id holding the refs before starting manipulation.
 2. Creating a new tree object holding the manipulated state.
 3. Re-checking the tree id before replacing the loose ref. If
unchanged: commit, else: rollback/error out.

All readers would trivially have access to a consistent refs view,
since the state of the entire refs hierarchy is held in the tree id
read from that single loose ref.

It seems to me this should be somewhat less prohibitively expensive
than maintaining all refs in a single packed-refs file. That said, we
do end up producing a few new objects for every single ref update,
most of which would be thrown away by a future "gc". This might bog
things down, but I'm not sure how much.

I'm sure someone must have had this idea before (although I don't
remember this alternative being raised at the Git Merge conference),
so please enlighten me as to why this won't work... ;)

...Johan


PS: Keeping reflogs is just a matter of wrapping the ref tree in a
commit object using the previous state of the ref tree as its parent.

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 16/17] object_array_entry: copy name before storing in name field
  2013-05-19 20:27 ` [RFC 16/17] object_array_entry: copy name before storing in name field Michael Haggerty
@ 2013-05-20 10:33   ` Johan Herland
  2013-05-20 14:42     ` Michael Haggerty
  0 siblings, 1 reply; 42+ messages in thread
From: Johan Herland @ 2013-05-20 10:33 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, Junio C Hamano, git

On Sun, May 19, 2013 at 10:27 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> This is the culmination of the last few commits.  Since some callers
> want to store refnames in the name field of object_array elements, but
> we don't want those callers to assume that the refnames that they got
> from for_each_ref() have infinite lifetime, the easiest thing to do is
> have object_array make a copy of the names before writing them in the
> entries, and to free the names for entries that are no longer in use.
> This change fixes the problem, but has some disadvantages:
>
> * It requires extra copies to be made of strings that are already
>   copies, for example when the results of path_name(path, name) are
>   used as a name in revision.c:add_object().  This might be rare
>   enough that it can be ignored (though the original result of
>   path_name() would have to be freed, which this patch doesn't do so
>   there is a memory leak).
>
> * Many callers store the empty string ("") as the name; for example,
>   most of the entries created during a run of rev-list have "" as
>   their name.  This means that lots of needless copies of "" are being
>   made.  I think that the best solution to this problem would be to
>   store NULL rather than "" for such entries, but I haven't figured
>   out all of the places where the name is used.

Use strbufs?

No allocation (except for the strbuf object itself) is needed for
empty strings, and string ownership and be transferred to and from it
to prevent extra copies.

...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/17] Remove assumptions about refname lifetimes
  2013-05-20 10:28 ` [PATCH 00/17] Remove assumptions about refname lifetimes Johan Herland
@ 2013-05-20 12:15   ` Michael Haggerty
  2013-05-20 16:37   ` Junio C Hamano
  1 sibling, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-20 12:15 UTC (permalink / raw)
  To: Johan Herland; +Cc: Jeff King, Junio C Hamano, git

This is a very interesting idea.  "It's turtles all the way down."

On 05/20/2013 12:28 PM, Johan Herland wrote:
> (Sorry for going slightly off-topic and returning to the general
> discussion on how to resolve the race conditions...)
> 
> For server-class installations we need ref storage that can be read
> (and updated?) atomically, and the current system of loose + packed
> files won't work since reading (and updating) more than a single file
> is not an atomic operation. Trivially, one could resolve this by
> dropping loose refs, and always using a single packed-refs file, but
> that would make it prohibitively expensive to update refs (the entire
> packed-refs file must be rewritten for every update).

Correct, or the "packed-refs" file would have to be updated in place
using some database-style approach for locking/transactions/whatever.

> Now, observe that we don't have these race conditions in the object
> database, because it is an add-only immutable data store.

Except for prune, of course, which can cause race conditions WRT to writers.

> What if we stored the refs as a tree object in the object database,
> referenced by a single (loose) ref? There would be a _single_ (albeit
> highly contentious) file outside the object database that represent
> the current state of the refs, but hopefully we can guarantee
> atomicity when reading (and updating?) that one file. Transactions can
> be done by:
>  1. Recording the tree id holding the refs before starting manipulation.
>  2. Creating a new tree object holding the manipulated state.
>  3. Re-checking the tree id before replacing the loose ref. If
> unchanged: commit, else: rollback/error out.

There are two closely related possibilities and I'm not sure which one
you mean:

* Effectively treat all of the refs as loose refs, but stored not in the
filesystem but rather in a hierarchical tree structure in the object
database.  E.g., all of the refs directly under "refs/heads" would be in
one tree object, those in refs/remotes/foo in a second, those for
refs/remotes/bar in another etc. and all of them linked up together in a
tree object representing "refs".

* Effectively treat all of the refs as packed refs, but store the single
"packed-refs" file as a single object in the object database.

(The first alternative sounds more practical to me.  I also guess that's
what you mean, since down below you say that each change would require
producing "a few objects".)

Of course in either case we couldn't use a tree object directly, because
these new "reference tree" objects would refer not only to blobs and
other trees but also to commits and tags.

> All readers would trivially have access to a consistent refs view,
> since the state of the entire refs hierarchy is held in the tree id
> read from that single loose ref.
> 
> It seems to me this should be somewhat less prohibitively expensive
> than maintaining all refs in a single packed-refs file. That said, we
> do end up producing a few new objects for every single ref update,
> most of which would be thrown away by a future "gc". This might bog
> things down, but I'm not sure how much.
> 
> I'm sure someone must have had this idea before (although I don't
> remember this alternative being raised at the Git Merge conference),
> so please enlighten me as to why this won't work... ;)

[I know this is not what you are suggesting, but I am reminded of
Subversion, which stores trunk, branches, and tags in the same "tree"
space as the contents of the working trees.  A Subversion commit
references a gigantic tree encompassing all branches of development and
all files on all of those branches (with cheap copies to reduce the
redundancy):

    /
    /trunk/
    /trunk/Makefile
    /trunk/src/
    /trunk/src/foo.c
    /branches/
    /branches/next/
    /branches/next/Makefile
    /branches/next/src/
    /branches/next/src/foo.c
    /branches/pu/
    /branches/pu/Makefile
    /branches/pu/src/
    /branches/pu/src/foo.c
    /tags/
    /tags/v1.8.2/
    /tags/v1.8.2/Makefile
    /tags/v1.8.2/src/
    /tags/v1.8.2/src/foo.c
    etc...

A Subversion commit thus describes the state of *every* branch and tag
at that moment in time.  The model is conceptually very simple (in fact,
too simple, and I believe the Subversion developers regret not having
distinguished between the branch namespace and the file namespace).]

The main difficulty with this idea will be the extreme contention on
that "last loose reference file" pointing at the root of the reference
tree.  Essentially *every* change to the repository will have to create
a new reference tree and point this file at the new version.  I doubt
that would be a problem for short-lived operations, but I fear that a
long-lived operation would *never* get done.  By the time it had
finished constructing its new reference tree, some other short-lived
operation will have changed it, and the long-lived process will have to
choose between

* Restart from the beginning.

* Die with a kind of "concurrent modification error".

* Resolve the difference between the reference tree at the start of its
operation and the reference tree as it exists when it is done with the
changes that they want to make.  In some cases this might be able to be
done automatically as a kind of "reference tree merge" but the logic
might have to vary from case to case.

> PS: Keeping reflogs is just a matter of wrapping the ref tree in a
> commit object using the previous state of the ref tree as its parent.

Yes, there are a lot of nice aspects to this idea in that it reuses
concepts with which we are already familiar.  For example, fetching from
a remote would approximately hook the remote's entire reference tree
into a subtree of the local "refs/remotes" reference subtree.  But with
things like reflogs we would have to be careful not to keep obsolete
objects around *forever*--there would have to be some mechanism to prune
the old reference history.

Altogether a very interesting idea.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 16/17] object_array_entry: copy name before storing in name field
  2013-05-20 10:33   ` Johan Herland
@ 2013-05-20 14:42     ` Michael Haggerty
  2013-05-20 16:44       ` Jeff King
  0 siblings, 1 reply; 42+ messages in thread
From: Michael Haggerty @ 2013-05-20 14:42 UTC (permalink / raw)
  To: Johan Herland; +Cc: Jeff King, Junio C Hamano, git

On 05/20/2013 12:33 PM, Johan Herland wrote:
> On Sun, May 19, 2013 at 10:27 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> This is the culmination of the last few commits.  Since some callers
>> want to store refnames in the name field of object_array elements, but
>> we don't want those callers to assume that the refnames that they got
>> from for_each_ref() have infinite lifetime, the easiest thing to do is
>> have object_array make a copy of the names before writing them in the
>> entries, and to free the names for entries that are no longer in use.
>> This change fixes the problem, but has some disadvantages:
>>
>> * It requires extra copies to be made of strings that are already
>>   copies, for example when the results of path_name(path, name) are
>>   used as a name in revision.c:add_object().  This might be rare
>>   enough that it can be ignored (though the original result of
>>   path_name() would have to be freed, which this patch doesn't do so
>>   there is a memory leak).
>>
>> * Many callers store the empty string ("") as the name; for example,
>>   most of the entries created during a run of rev-list have "" as
>>   their name.  This means that lots of needless copies of "" are being
>>   made.  I think that the best solution to this problem would be to
>>   store NULL rather than "" for such entries, but I haven't figured
>>   out all of the places where the name is used.
> 
> Use strbufs?
> 
> No allocation (except for the strbuf object itself) is needed for
> empty strings, and string ownership and be transferred to and from it
> to prevent extra copies.

That would cost two extra size_t per object_array_entry.  I have the
feeling that this structure is used often enough that the extra overhead
would be a disadvantage, but I'm not sure.

The obvious alternative would be to teach users to deal with NULL and
either add another constructor alternative that transfers string
ownership or *always* transfer string ownership and change the callers
to call xstrdup() if they don't already own the name string.  I think I
will try that approach first.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/17] Remove assumptions about refname lifetimes
  2013-05-20 10:28 ` [PATCH 00/17] Remove assumptions about refname lifetimes Johan Herland
  2013-05-20 12:15   ` Michael Haggerty
@ 2013-05-20 16:37   ` Junio C Hamano
  2013-05-20 16:59     ` Jeff King
  2013-05-20 17:03     ` Johan Herland
  1 sibling, 2 replies; 42+ messages in thread
From: Junio C Hamano @ 2013-05-20 16:37 UTC (permalink / raw)
  To: Johan Herland; +Cc: Michael Haggerty, Jeff King, git

Johan Herland <johan@herland.net> writes:

> For server-class installations we need ref storage that can be read
> (and updated?) atomically, and the current system of loose + packed
> files won't work since reading (and updating) more than a single file
> is not an atomic operation. Trivially, one could resolve this by
> dropping loose refs, and always using a single packed-refs file, but
> that would make it prohibitively expensive to update refs (the entire
> packed-refs file must be rewritten for every update).
>
> Now, observe that we don't have these race conditions in the object
> database, because it is an add-only immutable data store.
>
> What if we stored the refs as a tree object in the object database,
> referenced by a single (loose) ref?

What is the cost of updating a single branch with that scheme?

Doesn't it end up recording roughly the same amount of information
as updating a single packed-refs file that is flat, but with the
need to open a few tree objects (top-level, refs/, and refs/heads/),
writing out a blob that stores the object name at the tip, computing
the updated trees (refs/heads/, refs/ and top-level), and then
finally doing the compare-and-swap of that single loose ref?

You may guarantee atomicity but it is the same granularity of
atomicity as a single packed-refs file.  When you are updating a
branch while somebody else is updating a tag, of course you do not
have to look at refs/tags/ in your operation and you can write out
your final packed-refs equivalent tree to the object database
without racing with the other process, but the top-level you come up
with and the top-level the other process comes up with (which has
an up-to-date refs/tags/ part, but has a stale refs/heads/ part from
your point of view) have to race to update that single loose ref,
and one of you have to back out.

That "backing out" can be made more intelligently than just dying
with "compare and swap failed--please retry" message, e.g. you at
that point notice what you started with, what the other party did
while you were working on (i.e. updating refs/tags/), and three-way
merge the refs tree, and in cases where "all refs recorded as loose
refs" scheme wouldn't have resulted in problematic conflict, such a
three-way merge would resolve trivially (you updated refs/heads/ and
the update by the other process to refs/tags/ would not conflict
with what you did).  But the same three-way merge scheme can be
employed with the current flat single packed-refs scheme, can't it?

Even worse, what is the cost of looking up the value of a single
branch?  You would need to open a few tree objects and the leaf blob
that records the object name the ref points at, wouldn't you?

Right now, such a look-up is either opening a single small file and
reading the first 41 bytes off of it, and falling back (when the ref
in question is packed) to read a single packed-refs file and finding
the ref you want from it.

So...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 16/17] object_array_entry: copy name before storing in name field
  2013-05-20 14:42     ` Michael Haggerty
@ 2013-05-20 16:44       ` Jeff King
  2013-05-20 21:34         ` Michael Haggerty
  0 siblings, 1 reply; 42+ messages in thread
From: Jeff King @ 2013-05-20 16:44 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Johan Herland, Junio C Hamano, git

On Mon, May 20, 2013 at 04:42:38PM +0200, Michael Haggerty wrote:

> >> * Many callers store the empty string ("") as the name; for example,
> >>   most of the entries created during a run of rev-list have "" as
> >>   their name.  This means that lots of needless copies of "" are being
> >>   made.  I think that the best solution to this problem would be to
> >>   store NULL rather than "" for such entries, but I haven't figured
> >>   out all of the places where the name is used.
> > 
> > Use strbufs?
> > 
> > No allocation (except for the strbuf object itself) is needed for
> > empty strings, and string ownership and be transferred to and from it
> > to prevent extra copies.
> 
> That would cost two extra size_t per object_array_entry.  I have the
> feeling that this structure is used often enough that the extra overhead
> would be a disadvantage, but I'm not sure.
> 
> The obvious alternative would be to teach users to deal with NULL and
> either add another constructor alternative that transfers string
> ownership or *always* transfer string ownership and change the callers
> to call xstrdup() if they don't already own the name string.  I think I
> will try that approach first.

You could use the same trick that strbuf does: instead of NULL, point to
a well-known empty string literal. Readers do not have to care about
this optimization at all; only writers need to recognize the well-known
pointer value. And since we do not update in place but only eventually
free, it really is just that anyone calling free() would do "if (name !=
well_known_empty_string)".

-Peff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/17] Remove assumptions about refname lifetimes
  2013-05-20 16:37   ` Junio C Hamano
@ 2013-05-20 16:59     ` Jeff King
  2013-05-20 17:08       ` Johan Herland
  2013-05-20 18:03       ` Junio C Hamano
  2013-05-20 17:03     ` Johan Herland
  1 sibling, 2 replies; 42+ messages in thread
From: Jeff King @ 2013-05-20 16:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johan Herland, Michael Haggerty, git

On Mon, May 20, 2013 at 09:37:30AM -0700, Junio C Hamano wrote:

> Johan Herland <johan@herland.net> writes:
> 
> > For server-class installations we need ref storage that can be read
> > (and updated?) atomically, and the current system of loose + packed
> > files won't work since reading (and updating) more than a single file
> > is not an atomic operation. Trivially, one could resolve this by
> > dropping loose refs, and always using a single packed-refs file, but
> > that would make it prohibitively expensive to update refs (the entire
> > packed-refs file must be rewritten for every update).
> >
> > Now, observe that we don't have these race conditions in the object
> > database, because it is an add-only immutable data store.
> >
> > What if we stored the refs as a tree object in the object database,
> > referenced by a single (loose) ref?
> 
> What is the cost of updating a single branch with that scheme?

Yeah, I had the same thoughts as you here; I don't think this is really
much better than updating a single packed-refs file. You may only
have to touch log(N) of the entries to update the trees along the path
to your ref (assuming you have a bushy refs/ tree; in practice, of
course, you have a lot of entries in refs/heads/, so you do not quite
get log behavior). So algorithmically it may be slightly better, but you
pay a much higher constant, in that you are creating many tree objects
along the path (and reading them on lookup).

But more importantly, it introduces contention between two unrelated
refs that are being updated. Even if we reconcile the differences
automatically (e.g., with a "merge-and-retry" strategy), that is going
to be a serious performance regression for a busy repository, as we
repeatedly try to reconcile the serialized updates to the refs/ root
tree.

Any transactional solution needs to have the equivalent of ref-level
locking (i.e., row-level locking in a relational setting). It is OK for
two simultaneous updates to different rows to happen in random order; we
don't care about that level of atomicity. The important thing is that
the _readers_ see transactions in consistent order; if we know that
update B started after update A finished, then we must never see update
A without update B reflected. And I think that is more or less a solved
problem in the database world.

That is what prevents:

  git update-ref A B
  git update-ref -d B

from creating a view where the reader sees neither A nor B (which is
what can happen with the current filesystem view).

-Peff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/17] Remove assumptions about refname lifetimes
  2013-05-20 16:37   ` Junio C Hamano
  2013-05-20 16:59     ` Jeff King
@ 2013-05-20 17:03     ` Johan Herland
  2013-05-21 18:39       ` Junio C Hamano
  1 sibling, 1 reply; 42+ messages in thread
From: Johan Herland @ 2013-05-20 17:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Michael Haggerty, Jeff King, git

On Mon, May 20, 2013 at 6:37 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Johan Herland <johan@herland.net> writes:
>> For server-class installations we need ref storage that can be read
>> (and updated?) atomically, and the current system of loose + packed
>> files won't work since reading (and updating) more than a single file
>> is not an atomic operation. Trivially, one could resolve this by
>> dropping loose refs, and always using a single packed-refs file, but
>> that would make it prohibitively expensive to update refs (the entire
>> packed-refs file must be rewritten for every update).
>>
>> Now, observe that we don't have these race conditions in the object
>> database, because it is an add-only immutable data store.
>>
>> What if we stored the refs as a tree object in the object database,
>> referenced by a single (loose) ref?
>
> What is the cost of updating a single branch with that scheme?
>
> Doesn't it end up recording roughly the same amount of information
> as updating a single packed-refs file that is flat, but with the
> need to open a few tree objects (top-level, refs/, and refs/heads/),
> writing out a blob that stores the object name at the tip, computing
> the updated trees (refs/heads/, refs/ and top-level), and then
> finally doing the compare-and-swap of that single loose ref?

Yes, except that when you update packed-refs, you have to write out
the _entire_ file, whereas with this scheme you only have to write out
the part of the refs hierarchy you actually touched (e.g. rewriting
refs/heads/foo would not have to write out anything inside
refs/tags/*). If you have small number of branches, and a huge number
of tags, this scheme might end up being cheaper than writing the
entire packed-refs. But in general, it's probably much more expensive
to go via the odb.

> You may guarantee atomicity but it is the same granularity of
> atomicity as a single packed-refs file.

Yes, as I argued elsewhere in this thread: It seems that _any_
filesystem-based solution must resort to having all updates depend on
a single file in order to guarantee atomicity.

> When you are updating a
> branch while somebody else is updating a tag, of course you do not
> have to look at refs/tags/ in your operation and you can write out
> your final packed-refs equivalent tree to the object database
> without racing with the other process, but the top-level you come up
> with and the top-level the other process comes up with (which has
> an up-to-date refs/tags/ part, but has a stale refs/heads/ part from
> your point of view) have to race to update that single loose ref,
> and one of you have to back out.

True.

> That "backing out" can be made more intelligently than just dying
> with "compare and swap failed--please retry" message, e.g. you at
> that point notice what you started with, what the other party did
> while you were working on (i.e. updating refs/tags/), and three-way
> merge the refs tree, and in cases where "all refs recorded as loose
> refs" scheme wouldn't have resulted in problematic conflict, such a
> three-way merge would resolve trivially (you updated refs/heads/ and
> the update by the other process to refs/tags/ would not conflict
> with what you did).  But the same three-way merge scheme can be
> employed with the current flat single packed-refs scheme, can't it?

Yes. (albeit without reusing the machinery we already have for doing
three-way merges)

> Even worse, what is the cost of looking up the value of a single
> branch?  You would need to open a few tree objects and the leaf blob
> that records the object name the ref points at, wouldn't you?

Yes. Probably a showstopper, although single branch/ref lookup might
not be so common on server side, as it is on the user side.

> Right now, such a look-up is either opening a single small file and
> reading the first 41 bytes off of it, and falling back (when the ref
> in question is packed) to read a single packed-refs file and finding
> the ref you want from it.
>
> So...

Yep...


...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/17] Remove assumptions about refname lifetimes
  2013-05-20 16:59     ` Jeff King
@ 2013-05-20 17:08       ` Johan Herland
  2013-05-20 18:03       ` Junio C Hamano
  1 sibling, 0 replies; 42+ messages in thread
From: Johan Herland @ 2013-05-20 17:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Michael Haggerty, git

On Mon, May 20, 2013 at 6:59 PM, Jeff King <peff@peff.net> wrote:
> On Mon, May 20, 2013 at 09:37:30AM -0700, Junio C Hamano wrote:
>> Johan Herland <johan@herland.net> writes:
>>
>> > For server-class installations we need ref storage that can be read
>> > (and updated?) atomically, and the current system of loose + packed
>> > files won't work since reading (and updating) more than a single file
>> > is not an atomic operation. Trivially, one could resolve this by
>> > dropping loose refs, and always using a single packed-refs file, but
>> > that would make it prohibitively expensive to update refs (the entire
>> > packed-refs file must be rewritten for every update).
>> >
>> > Now, observe that we don't have these race conditions in the object
>> > database, because it is an add-only immutable data store.
>> >
>> > What if we stored the refs as a tree object in the object database,
>> > referenced by a single (loose) ref?
>>
>> What is the cost of updating a single branch with that scheme?
>
> Yeah, I had the same thoughts as you here; I don't think this is really
> much better than updating a single packed-refs file. You may only
> have to touch log(N) of the entries to update the trees along the path
> to your ref (assuming you have a bushy refs/ tree; in practice, of
> course, you have a lot of entries in refs/heads/, so you do not quite
> get log behavior). So algorithmically it may be slightly better, but you
> pay a much higher constant, in that you are creating many tree objects
> along the path (and reading them on lookup).

True.

> But more importantly, it introduces contention between two unrelated
> refs that are being updated. Even if we reconcile the differences
> automatically (e.g., with a "merge-and-retry" strategy), that is going
> to be a serious performance regression for a busy repository, as we
> repeatedly try to reconcile the serialized updates to the refs/ root
> tree.
>
> Any transactional solution needs to have the equivalent of ref-level
> locking (i.e., row-level locking in a relational setting). It is OK for
> two simultaneous updates to different rows to happen in random order; we
> don't care about that level of atomicity. The important thing is that
> the _readers_ see transactions in consistent order; if we know that
> update B started after update A finished, then we must never see update
> A without update B reflected. And I think that is more or less a solved
> problem in the database world.
>
> That is what prevents:
>
>   git update-ref A B
>   git update-ref -d B
>
> from creating a view where the reader sees neither A nor B (which is
> what can happen with the current filesystem view).

All true. Just a last spasm from the dying notion that we could solve
this in the filesystem, without using "proper" database...

...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/17] Remove assumptions about refname lifetimes
  2013-05-20 16:59     ` Jeff King
  2013-05-20 17:08       ` Johan Herland
@ 2013-05-20 18:03       ` Junio C Hamano
  1 sibling, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2013-05-20 18:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Johan Herland, Michael Haggerty, git

Jeff King <peff@peff.net> writes:

> But more importantly, it introduces contention between two unrelated
> refs that are being updated. Even if we reconcile the differences
> automatically (e.g., with a "merge-and-retry" strategy), that is going
> to be a serious performance regression for a busy repository, as we
> repeatedly try to reconcile the serialized updates to the refs/ root
> tree.

I think we are on the same page.

> Any transactional solution needs to have the equivalent of ref-level
> locking (i.e., row-level locking in a relational setting).

Not necessarily if we can exploit assumptions such as deletion is
far rare compared to update and creation which in turn are far rare
compared to lookup.  I've been wondering if we can find a cheap
reader-writer lock mechanism, use a single instance of it to govern
the whole repository, and have everybody but ref deleters and "git
pack-ref" take the read side of the lock.

Then only while ref deletion or ref packing is going on, everybody
need to stall, but otherwise the most common "read or update
recently touched refs (aka loose refs)" will be taking the reader
side of that cheap read-write lock and reading a single loose ref
file.

Perhaps such a "cheap on reader side" reader-write lock is hard to
come by?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 16/17] object_array_entry: copy name before storing in name field
  2013-05-20 16:44       ` Jeff King
@ 2013-05-20 21:34         ` Michael Haggerty
  0 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-20 21:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Johan Herland, Junio C Hamano, git

On 05/20/2013 06:44 PM, Jeff King wrote:
> On Mon, May 20, 2013 at 04:42:38PM +0200, Michael Haggerty wrote:
> 
>>>> * Many callers store the empty string ("") as the name; for example,
>>>>   most of the entries created during a run of rev-list have "" as
>>>>   their name.  This means that lots of needless copies of "" are being
>>>>   made.  I think that the best solution to this problem would be to
>>>>   store NULL rather than "" for such entries, but I haven't figured
>>>>   out all of the places where the name is used.
>>>
>>> Use strbufs?
>>>
>>> No allocation (except for the strbuf object itself) is needed for
>>> empty strings, and string ownership and be transferred to and from it
>>> to prevent extra copies.
>>
>> That would cost two extra size_t per object_array_entry.  I have the
>> feeling that this structure is used often enough that the extra overhead
>> would be a disadvantage, but I'm not sure.
>>
>> The obvious alternative would be to teach users to deal with NULL and
>> either add another constructor alternative that transfers string
>> ownership or *always* transfer string ownership and change the callers
>> to call xstrdup() if they don't already own the name string.  I think I
>> will try that approach first.
> 
> You could use the same trick that strbuf does: instead of NULL, point to
> a well-known empty string literal. Readers do not have to care about
> this optimization at all; only writers need to recognize the well-known
> pointer value. And since we do not update in place but only eventually
> free, it really is just that anyone calling free() would do "if (name !=
> well_known_empty_string)".

Yes, that sounds like the best solution.  Ultimately there is only one
writer, add_object_array_with_mode(), and it can do

    if (!name)
        entry->name = NULL;
    else if (!*name)
        entry->name = well_known_empty_string;
    else
        entry->name = xstrdup(name);

This should be a lot less intrusive than what I was trying: to change
callers who wrote name="" to write name=NULL instead.  But it is a
nightmare to find all of the code that reads name and decide whether
they need to do

    entry->name ? entry->name : ""

because that in turn depends on whether the code that wrote into the
same object_array always/never/sometimes wrote strings vs. NULL to the
name field.  Blech, encapsulation is tough in C.

While I was chasing down callers, I came across other gems like

builtin/checkout.c:
	add_pending_object(&revs, object, sha1_to_hex(object->sha1));

revision.c:
	add_pending_object(revs, object, sha1_to_hex(object->sha1));

submodule.c:
	add_pending_object(rev, &list->item->object,
		sha1_to_hex(list->item->object.sha1));

so apparently I wasn't the only one befuddled by the lifetime and
ownership of the name field of object_array_entry.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/17] builtin_diff_tree(): make it obvious that function wants two entries
  2013-05-19 20:26 ` [PATCH 04/17] builtin_diff_tree(): make it obvious that function wants two entries Michael Haggerty
@ 2013-05-21 17:27   ` Junio C Hamano
  2013-05-23  7:19     ` Michael Haggerty
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2013-05-21 17:27 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, Johan Herland, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> Instead of accepting an array and using exactly two elements from the
> array, take two single (struct object_array_entry *) arguments.
>
> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
> ---
>  builtin/diff.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/builtin/diff.c b/builtin/diff.c
> index 8c2af6c..ba68c6c 100644
> --- a/builtin/diff.c
> +++ b/builtin/diff.c
> @@ -153,7 +153,8 @@ static int builtin_diff_index(struct rev_info *revs,
>  
>  static int builtin_diff_tree(struct rev_info *revs,
>  			     int argc, const char **argv,
> -			     struct object_array_entry *ent)
> +			     struct object_array_entry *ent0,
> +			     struct object_array_entry *ent1)
>  {
>  	const unsigned char *(sha1[2]);
>  	int swap = 0;
> @@ -161,13 +162,13 @@ static int builtin_diff_tree(struct rev_info *revs,
>  	if (argc > 1)
>  		usage(builtin_diff_usage);
>  
> -	/* We saw two trees, ent[0] and ent[1].
> -	 * if ent[1] is uninteresting, they are swapped
> +	/* We saw two trees, ent0 and ent1.
> +	 * if ent1 is uninteresting, they are swapped
>  	 */

"While you are touching this comment" is a perfect time to fix the
existing style violation.

Other than that, looks good to me.

> -	if (ent[1].item->flags & UNINTERESTING)
> +	if (ent1->item->flags & UNINTERESTING)
>  		swap = 1;
> -	sha1[swap] = ent[0].item->sha1;
> -	sha1[1-swap] = ent[1].item->sha1;
> +	sha1[swap] = ent0->item->sha1;
> +	sha1[1-swap] = ent1->item->sha1;
>  	diff_tree_sha1(sha1[0], sha1[1], "", &revs->diffopt);
>  	log_tree_diff_flush(revs);
>  	return 0;
> @@ -403,7 +404,7 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
>  	else if (ents == 1)
>  		result = builtin_diff_index(&rev, argc, argv);
>  	else if (ents == 2)
> -		result = builtin_diff_tree(&rev, argc, argv, ent);
> +		result = builtin_diff_tree(&rev, argc, argv, &ent[0], &ent[1]);
>  	else if (ent[0].item->flags & UNINTERESTING) {
>  		/*
>  		 * diff A...B where there is at least one merge base
> @@ -412,8 +413,7 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
>  		 * between the base and B.  Note that we pick one
>  		 * merge base at random if there are more than one.
>  		 */
> -		ent[1] = ent[ents-1];
> -		result = builtin_diff_tree(&rev, argc, argv, ent);
> +		result = builtin_diff_tree(&rev, argc, argv, &ent[0], &ent[ents-1]);
>  	} else
>  		result = builtin_diff_combined(&rev, argc, argv,
>  					       ent, ents);

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/17] cmd_diff(): use an object_array for holding trees
  2013-05-19 20:27 ` [PATCH 05/17] cmd_diff(): use an object_array for holding trees Michael Haggerty
@ 2013-05-21 17:30   ` Junio C Hamano
  2013-05-23  7:21     ` Michael Haggerty
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2013-05-21 17:30 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, Johan Herland, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> Change cmd_diff() to use a (struct object_array) for holding the trees
> that it accumulates, rather than rolling its own equivalent.
>

A significant detail missing here is that this lifts the hardcoded
100 tree limit in combined diff but that does not matter in
practice, I would suppose ;-).

> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
> ---
>  builtin/diff.c | 37 ++++++++++++++++++-------------------
>  1 file changed, 18 insertions(+), 19 deletions(-)
>
> diff --git a/builtin/diff.c b/builtin/diff.c
> index ba68c6c..72d99c0 100644
> --- a/builtin/diff.c
> +++ b/builtin/diff.c
> @@ -252,8 +252,8 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
>  {
>  	int i;
>  	struct rev_info rev;
> -	struct object_array_entry ent[100];
> -	int ents = 0, blobs = 0, paths = 0;
> +	struct object_array ent = OBJECT_ARRAY_INIT;
> +	int blobs = 0, paths = 0;
>  	const char *path = NULL;
>  	struct blobinfo blob[2];
>  	int nongit;
> @@ -350,13 +350,8 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
>  		if (obj->type == OBJ_COMMIT)
>  			obj = &((struct commit *)obj)->tree->object;
>  		if (obj->type == OBJ_TREE) {
> -			if (ARRAY_SIZE(ent) <= ents)
> -				die(_("more than %d trees given: '%s'"),
> -				    (int) ARRAY_SIZE(ent), name);
>  			obj->flags |= flags;
> -			ent[ents].item = obj;
> -			ent[ents].name = name;
> -			ents++;
> +			add_object_array(obj, name, &ent);
>  			continue;
>  		}
>  		if (obj->type == OBJ_BLOB) {
> @@ -380,7 +375,7 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
>  	/*
>  	 * Now, do the arguments look reasonable?
>  	 */
> -	if (!ents) {
> +	if (!ent.nr) {
>  		switch (blobs) {
>  		case 0:
>  			result = builtin_diff_files(&rev, argc, argv);
> @@ -401,22 +396,26 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
>  	}
>  	else if (blobs)
>  		usage(builtin_diff_usage);
> -	else if (ents == 1)
> +	else if (ent.nr == 1)
>  		result = builtin_diff_index(&rev, argc, argv);
> -	else if (ents == 2)
> -		result = builtin_diff_tree(&rev, argc, argv, &ent[0], &ent[1]);
> -	else if (ent[0].item->flags & UNINTERESTING) {
> +	else if (ent.nr == 2)
> +		result = builtin_diff_tree(&rev, argc, argv,
> +					   &ent.objects[0], &ent.objects[1]);
> +	else if (ent.objects[0].item->flags & UNINTERESTING) {
>  		/*
>  		 * diff A...B where there is at least one merge base
> -		 * between A and B.  We have ent[0] == merge-base,
> -		 * ent[ents-2] == A, and ent[ents-1] == B.  Show diff
> -		 * between the base and B.  Note that we pick one
> -		 * merge base at random if there are more than one.
> +		 * between A and B.  We have ent.objects[0] ==
> +		 * merge-base, ent.objects[ents-2] == A, and
> +		 * ent.objects[ents-1] == B.  Show diff between the
> +		 * base and B.  Note that we pick one merge base at
> +		 * random if there are more than one.
>  		 */
> -		result = builtin_diff_tree(&rev, argc, argv, &ent[0], &ent[ents-1]);
> +		result = builtin_diff_tree(&rev, argc, argv,
> +					   &ent.objects[0],
> +					   &ent.objects[ent.nr-1]);
>  	} else
>  		result = builtin_diff_combined(&rev, argc, argv,
> -					       ent, ents);
> +					       ent.objects, ent.nr);
>  	result = diff_result_code(&rev.diffopt, result);
>  	if (1 < rev.diffopt.skip_stat_unmatch)
>  		refresh_index_quietly();

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/17] revision: split some overly-long lines
  2013-05-19 20:27 ` [PATCH 08/17] revision: split some overly-long lines Michael Haggerty
@ 2013-05-21 17:34   ` Junio C Hamano
  2013-05-23  6:27     ` Michael Haggerty
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2013-05-21 17:34 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, Johan Herland, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
> ---
>  revision.c | 20 ++++++++++++++------
>  revision.h | 32 +++++++++++++++++++++-----------
>  2 files changed, 35 insertions(+), 17 deletions(-)

Looks obviously good for *.c file, but I am on the fence for *.h
one, as the reason we kept these long single lines in *.h files was
to help those who want to grep in *.h files to let them view the
full function signature.  It probably is OK to tell them to use
"git grep -A$n" instead, though.

> diff --git a/revision.c b/revision.c
> index 25e424c..8ac88d6 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -70,7 +70,8 @@ static int show_path_truncated(FILE *out, const struct name_path *path)
>  	return ours || emitted;
>  }
>  
> -void show_object_with_name(FILE *out, struct object *obj, const struct name_path *path, const char *component)
> +void show_object_with_name(FILE *out, struct object *obj,
> +			   const struct name_path *path, const char *component)
>  {
>  	struct name_path leaf;
>  	leaf.up = (struct name_path *)path;
> @@ -186,7 +187,9 @@ void mark_parents_uninteresting(struct commit *commit)
>  	}
>  }
>  
> -static void add_pending_object_with_mode(struct rev_info *revs, struct object *obj, const char *name, unsigned mode)
> +static void add_pending_object_with_mode(struct rev_info *revs,
> +					 struct object *obj,
> +					 const char *name, unsigned mode)
>  {
>  	if (!obj)
>  		return;
> @@ -209,7 +212,8 @@ static void add_pending_object_with_mode(struct rev_info *revs, struct object *o
>  	add_object_array_with_mode(obj, name, &revs->pending, mode);
>  }
>  
> -void add_pending_object(struct rev_info *revs, struct object *obj, const char *name)
> +void add_pending_object(struct rev_info *revs,
> +			struct object *obj, const char *name)
>  {
>  	add_pending_object_with_mode(revs, obj, name, S_IFINVALID);
>  }
> @@ -226,7 +230,9 @@ void add_head_to_pending(struct rev_info *revs)
>  	add_pending_object(revs, obj, "HEAD");
>  }
>  
> -static struct object *get_reference(struct rev_info *revs, const char *name, const unsigned char *sha1, unsigned int flags)
> +static struct object *get_reference(struct rev_info *revs, const char *name,
> +				    const unsigned char *sha1,
> +				    unsigned int flags)
>  {
>  	struct object *object;
>  
> @@ -247,7 +253,8 @@ void add_pending_sha1(struct rev_info *revs, const char *name,
>  	add_pending_object(revs, object, name);
>  }
>  
> -static struct commit *handle_commit(struct rev_info *revs, struct object *object, const char *name)
> +static struct commit *handle_commit(struct rev_info *revs,
> +				    struct object *object, const char *name)
>  {
>  	unsigned long flags = object->flags;
>  
> @@ -368,7 +375,8 @@ static void file_change(struct diff_options *options,
>  	DIFF_OPT_SET(options, HAS_CHANGES);
>  }
>  
> -static int rev_compare_tree(struct rev_info *revs, struct commit *parent, struct commit *commit)
> +static int rev_compare_tree(struct rev_info *revs,
> +			    struct commit *parent, struct commit *commit)
>  {
>  	struct tree *t1 = parent->tree;
>  	struct tree *t2 = commit->tree;
> diff --git a/revision.h b/revision.h
> index 01bd2b7..9628465 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -195,19 +195,23 @@ struct setup_revision_opt {
>  };
>  
>  extern void init_revisions(struct rev_info *revs, const char *prefix);
> -extern int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct setup_revision_opt *);
> +extern int setup_revisions(int argc, const char **argv, struct rev_info *revs,
> +			   struct setup_revision_opt *);
>  extern void parse_revision_opt(struct rev_info *revs, struct parse_opt_ctx_t *ctx,
> -				 const struct option *options,
> -				 const char * const usagestr[]);
> +			       const struct option *options,
> +			       const char * const usagestr[]);
>  #define REVARG_CANNOT_BE_FILENAME 01
>  #define REVARG_COMMITTISH 02
> -extern int handle_revision_arg(const char *arg, struct rev_info *revs, int flags, unsigned revarg_opt);
> +extern int handle_revision_arg(const char *arg, struct rev_info *revs,
> +			       int flags, unsigned revarg_opt);
>  
>  extern void reset_revision_walk(void);
>  extern int prepare_revision_walk(struct rev_info *revs);
>  extern struct commit *get_revision(struct rev_info *revs);
> -extern char *get_revision_mark(const struct rev_info *revs, const struct commit *commit);
> -extern void put_revision_mark(const struct rev_info *revs, const struct commit *commit);
> +extern char *get_revision_mark(const struct rev_info *revs,
> +			       const struct commit *commit);
> +extern void put_revision_mark(const struct rev_info *revs,
> +			      const struct commit *commit);
>  
>  extern void mark_parents_uninteresting(struct commit *commit);
>  extern void mark_tree_uninteresting(struct tree *tree);
> @@ -220,15 +224,19 @@ struct name_path {
>  
>  char *path_name(const struct name_path *path, const char *name);
>  
> -extern void show_object_with_name(FILE *, struct object *, const struct name_path *, const char *);
> +extern void show_object_with_name(FILE *, struct object *,
> +				  const struct name_path *, const char *);
>  
>  extern void add_object(struct object *obj,
>  		       struct object_array *p,
>  		       struct name_path *path,
>  		       const char *name);
>  
> -extern void add_pending_object(struct rev_info *revs, struct object *obj, const char *name);
> -extern void add_pending_sha1(struct rev_info *revs, const char *name, const unsigned char *sha1, unsigned int flags);
> +extern void add_pending_object(struct rev_info *revs,
> +			       struct object *obj, const char *name);
> +extern void add_pending_sha1(struct rev_info *revs,
> +			     const char *name, const unsigned char *sha1,
> +			     unsigned int flags);
>  
>  extern void add_head_to_pending(struct rev_info *);
>  
> @@ -238,7 +246,9 @@ enum commit_action {
>  	commit_error
>  };
>  
> -extern enum commit_action get_commit_action(struct rev_info *revs, struct commit *commit);
> -extern enum commit_action simplify_commit(struct rev_info *revs, struct commit *commit);
> +extern enum commit_action get_commit_action(struct rev_info *revs,
> +					    struct commit *commit);
> +extern enum commit_action simplify_commit(struct rev_info *revs,
> +					  struct commit *commit);
>  
>  #endif

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/17] get_revision_internal(): make check less mysterious
  2013-05-19 20:27 ` [PATCH 10/17] get_revision_internal(): make check less mysterious Michael Haggerty
@ 2013-05-21 17:38   ` Junio C Hamano
  2013-05-23  6:39     ` Michael Haggerty
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2013-05-21 17:38 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, Johan Herland, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> The condition under which gc_boundary() is called was previously
>
>     if (alloc <= nr)
>
> .  But by construction, nr can never exceed alloc, so the check looks
> unnecessarily mysterious.  In fact, the purpose of the check is to try
> to avoid a realloc() call by shrinking the array if possible if it is
> at its allocation limit when a new element is about to be added.  So
> change the check to
>
>     if (nr == alloc)
>
> and add a comment to explain what's going on.
>
> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
> ---
> Please check that I have properly described the purpose of this check.
>
> The way the code is written, it looks like a bad pattern of growth and
> shrinkage of the array (namely, just under the resize limit) could
> cause gc_boundary() to be called over and over again with (most of)
> the same data.  I hope that the author had some reason to believe that
> such a pattern is unlikely.

That is about comparing with "alloc", not having high and low
watermarks, right?

I do not see "alloc <= nr" is mysterious at all; it is merely being
defensive.

>
>  revision.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/revision.c b/revision.c
> index 2e0992b..19c59f4 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -2573,8 +2573,10 @@ static struct commit *get_revision_internal(struct rev_info *revs)
>  		if (p->flags & (CHILD_SHOWN | SHOWN))
>  			continue;
>  		p->flags |= CHILD_SHOWN;
> -		if (revs->boundary_commits.alloc <= revs->boundary_commits.nr)
> +		if (revs->boundary_commits.nr == revs->boundary_commits.alloc) {
> +			/* Try to make space and thereby avoid a realloc(): */
>  			gc_boundary(&revs->boundary_commits);
> +		}
>  		add_object_array(p, NULL, &revs->boundary_commits);
>  	}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/17] gc_boundary(): move the check "alloc <= nr" to caller
  2013-05-19 20:27 ` [PATCH 09/17] gc_boundary(): move the check "alloc <= nr" to caller Michael Haggerty
@ 2013-05-21 17:49   ` Junio C Hamano
  2013-05-23  7:09     ` Michael Haggerty
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2013-05-21 17:49 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, Johan Herland, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> There is no logical reason for this test to be here.  At the caller we
> might be able to figure out its meaning.
>
> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
> ---

I do not think this change is justified, *unless* the caller later
in the series gains a better heuristics than what can be done with
information in the object_array (namely, alloc and nr) to decide
when to trigger gc.

And I was hoping to see such a cleverness added to the caller, but I
do not think I saw any.

I would have to say gc_boundary() knows better when it needs to gc
with the code at this point in the series, and that is true also in
the final code after all the patches in this series.

If we keep the "when to gc" logic inside "gc", in 11/17 this caller
can no longer call directly to object_array_filter().  It should
call gc_boundary(), but I see it as a merit, not a downside.  The
"gc function can later be taught the high/low watermark logic you
alluded to in 10/17, and the growth/shrinkage characteristic you
would take advantage of while doing "gc" is specific to this
codepath.  And the logic still does not have to have access to
anything only the caller has access to; "gc" can work on what can be
read from the object_array->{alloc,nr} that is given to it.

  revision.c | 27 ++++++++++++---------------

>  1 file changed, 12 insertions(+), 15 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index 8ac88d6..2e0992b 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -2437,23 +2437,19 @@ static struct commit *get_revision_1(struct rev_info *revs)
>  
>  static void gc_boundary(struct object_array *array)
>  {
> -	unsigned nr = array->nr;
> -	unsigned alloc = array->alloc;
> +	unsigned nr = array->nr, i, j;
>  	struct object_array_entry *objects = array->objects;
>  
> -	if (alloc <= nr) {
> -		unsigned i, j;
> -		for (i = j = 0; i < nr; i++) {
> -			if (objects[i].item->flags & SHOWN)
> -				continue;
> -			if (i != j)
> -				objects[j] = objects[i];
> -			j++;
> -		}
> -		for (i = j; i < nr; i++)
> -			objects[i].item = NULL;
> -		array->nr = j;
> +	for (i = j = 0; i < nr; i++) {
> +		if (objects[i].item->flags & SHOWN)
> +			continue;
> +		if (i != j)
> +			objects[j] = objects[i];
> +		j++;
>  	}
> +	for (i = j; i < nr; i++)
> +		objects[i].item = NULL;
> +	array->nr = j;
>  }
>  
>  static void create_boundary_commit_list(struct rev_info *revs)
> @@ -2577,7 +2573,8 @@ static struct commit *get_revision_internal(struct rev_info *revs)
>  		if (p->flags & (CHILD_SHOWN | SHOWN))
>  			continue;
>  		p->flags |= CHILD_SHOWN;
> -		gc_boundary(&revs->boundary_commits);
> +		if (revs->boundary_commits.alloc <= revs->boundary_commits.nr)
> +			gc_boundary(&revs->boundary_commits);
>  		add_object_array(p, NULL, &revs->boundary_commits);
>  	}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/17] Remove assumptions about refname lifetimes
  2013-05-20 17:03     ` Johan Herland
@ 2013-05-21 18:39       ` Junio C Hamano
  0 siblings, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2013-05-21 18:39 UTC (permalink / raw)
  To: Johan Herland; +Cc: Michael Haggerty, Jeff King, git

Johan Herland <johan@herland.net> writes:

> On Mon, May 20, 2013 at 6:37 PM, Junio C Hamano <gitster@pobox.com> wrote:
> ...
>> That "backing out" can be made more intelligently than just dying
>> with "compare and swap failed--please retry" message, e.g. you at
>> that point notice what you started with, what the other party did
>> while you were working on (i.e. updating refs/tags/), and three-way
>> merge the refs tree, and in cases where "all refs recorded as loose
>> refs" scheme wouldn't have resulted in problematic conflict, such a
>> three-way merge would resolve trivially (you updated refs/heads/ and
>> the update by the other process to refs/tags/ would not conflict
>> with what you did).  But the same three-way merge scheme can be
>> employed with the current flat single packed-refs scheme, can't it?
>
> Yes. (albeit without reusing the machinery we already have for doing
> three-way merges)

I do not think that is relevant, as you do *not* want to reuse the
usual three-way xdiff merge machinery that leaves "<=>" markers for
this usage anyway.

But that is not the primary reason why I am beating this dead horse;
it is the following.

Your version of packed-refs file, and his version of packed-refs
file, and the original you started with, are all sorted in a known
order.  You just look at lines from all and merge them line by line.

I think that logic should become a new blob-level merge driver that
can be reused for normal in-working-tree objects that are sorted.
For that kind of "sorted wordlist" file, conflicts that may
artificially arise only because two sides updated adjacent lines in
the list and you used the normal xdiff merge machinery is unwanted,
and users would benefit by having such a specialized merge driver
outside the context of merging two proposed list of refs.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/17] revision: split some overly-long lines
  2013-05-21 17:34   ` Junio C Hamano
@ 2013-05-23  6:27     ` Michael Haggerty
  2013-05-23 17:08       ` Junio C Hamano
  0 siblings, 1 reply; 42+ messages in thread
From: Michael Haggerty @ 2013-05-23  6:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Johan Herland, git

On 05/21/2013 07:34 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
>> ---
>>  revision.c | 20 ++++++++++++++------
>>  revision.h | 32 +++++++++++++++++++++-----------
>>  2 files changed, 35 insertions(+), 17 deletions(-)
> 
> Looks obviously good for *.c file, but I am on the fence for *.h
> one, as the reason we kept these long single lines in *.h files was
> to help those who want to grep in *.h files to let them view the
> full function signature.  It probably is OK to tell them to use
> "git grep -A$n" instead, though.

My goal with this patch was not to set a new policy but rather just to
make the code conform a little better to the existing policy as
described in CodingGuidelines.  *If* it is preferred that header files
list all parameters on a single line, then by all means adjust the
CodingGuidelines and I will just as happily make header files conform to
*that* policy when I touch them :-)

(That being said, my personal preference is to apply the 80-character
limit for header files too.)

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/17] get_revision_internal(): make check less mysterious
  2013-05-21 17:38   ` Junio C Hamano
@ 2013-05-23  6:39     ` Michael Haggerty
  0 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-23  6:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Johan Herland, git

On 05/21/2013 07:38 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> The condition under which gc_boundary() is called was previously
>>
>>     if (alloc <= nr)
>>
>> .  But by construction, nr can never exceed alloc, so the check looks
>> unnecessarily mysterious.  In fact, the purpose of the check is to try
>> to avoid a realloc() call by shrinking the array if possible if it is
>> at its allocation limit when a new element is about to be added.  So
>> change the check to
>>
>>     if (nr == alloc)
>>
>> and add a comment to explain what's going on.
>>
>> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
>> ---
>> Please check that I have properly described the purpose of this check.
>>
>> The way the code is written, it looks like a bad pattern of growth and
>> shrinkage of the array (namely, just under the resize limit) could
>> cause gc_boundary() to be called over and over again with (most of)
>> the same data.  I hope that the author had some reason to believe that
>> such a pattern is unlikely.
> 
> That is about comparing with "alloc", not having high and low
> watermarks, right?
> 
> I do not see "alloc <= nr" is mysterious at all; it is merely being
> defensive.

If nr would ever exceed alloc, then the code would be broken and would
probably have already performed an illegal memory access.  Pretending to
support nr > alloc here is not defensive but misleading, because by that
time the ship is going down anyway.

On a more practical level, when I saw this code I thought to myself
"that's strange, I'd better look into it because it suggests that I
don't understand the meaning of nr and alloc".  I think that the
suggested change will help prevent the next reader from repeating the
same pointless investigation.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/17] gc_boundary(): move the check "alloc <= nr" to caller
  2013-05-21 17:49   ` Junio C Hamano
@ 2013-05-23  7:09     ` Michael Haggerty
  2013-05-23 18:02       ` Junio C Hamano
  0 siblings, 1 reply; 42+ messages in thread
From: Michael Haggerty @ 2013-05-23  7:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Johan Herland, git

On 05/21/2013 07:49 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> There is no logical reason for this test to be here.  At the caller we
>> might be able to figure out its meaning.
>>
>> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
>> ---
> 
> I do not think this change is justified, *unless* the caller later
> in the series gains a better heuristics than what can be done with
> information in the object_array (namely, alloc and nr) to decide
> when to trigger gc.
> 
> And I was hoping to see such a cleverness added to the caller, but I
> do not think I saw any.

Correct.

> I would have to say gc_boundary() knows better when it needs to gc
> with the code at this point in the series, and that is true also in
> the final code after all the patches in this series.
> 
> If we keep the "when to gc" logic inside "gc", in 11/17 this caller
> can no longer call directly to object_array_filter().  It should
> call gc_boundary(), but I see it as a merit, not a downside.  The
> "gc function can later be taught the high/low watermark logic you
> alluded to in 10/17, and the growth/shrinkage characteristic you
> would take advantage of while doing "gc" is specific to this
> codepath.  And the logic still does not have to have access to
> anything only the caller has access to; "gc" can work on what can be
> read from the object_array->{alloc,nr} that is given to it.

I don't feel strongly about this patch and if you prefer to have it
dropped I will do so.  But let me explain my reasoning:

1. The function name gc_boundary() suggests that it will do a garbage
collection unconditionally.  In fact, it might or might not depending on
this test.  At the caller, this made it look like a gc was happening
each time through the loop, which made me nervous about the performance.
 The new version makes it clear at the caller that the gc is only
happening occasionally.

2. Even assuming that gc_boundary() were renamed to maybe_gc_boundary(),
the function has hopelessly little information on which to base its
decision whether or not to gc, and the choice of policy can only be
justified based on some implicit knowledge about how the array is grown
and shrunk.  But the growing/shrinking is done at the layer of the
caller, and therefore the choice of gc policy also belongs at the layer
of the caller.

3. As you point out, if the gc policy is ever to be made more
intelligent, then gc_boundary() is unlikely to have enough information
to implement the new policy (e.g., it would have no place to record
low/high water marks).  Separating concerns at the correct level would
make a change like that easier.

At the moment I am not interested in improving the gc policy of this
code.  The only reason that I am mucking about here is to change it to
use object_array_filter(), which is needed to centralize where
object_array_entries are created and destroyed so that the "name" memory
can be copied and freed consistently.  That can be done with or without
patches 09 and 10.  Please let me know what you prefer.

Michael

>   revision.c | 27 ++++++++++++---------------
> 
>>  1 file changed, 12 insertions(+), 15 deletions(-)
>>
>> diff --git a/revision.c b/revision.c
>> index 8ac88d6..2e0992b 100644
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -2437,23 +2437,19 @@ static struct commit *get_revision_1(struct rev_info *revs)
>>  
>>  static void gc_boundary(struct object_array *array)
>>  {
>> -	unsigned nr = array->nr;
>> -	unsigned alloc = array->alloc;
>> +	unsigned nr = array->nr, i, j;
>>  	struct object_array_entry *objects = array->objects;
>>  
>> -	if (alloc <= nr) {
>> -		unsigned i, j;
>> -		for (i = j = 0; i < nr; i++) {
>> -			if (objects[i].item->flags & SHOWN)
>> -				continue;
>> -			if (i != j)
>> -				objects[j] = objects[i];
>> -			j++;
>> -		}
>> -		for (i = j; i < nr; i++)
>> -			objects[i].item = NULL;
>> -		array->nr = j;
>> +	for (i = j = 0; i < nr; i++) {
>> +		if (objects[i].item->flags & SHOWN)
>> +			continue;
>> +		if (i != j)
>> +			objects[j] = objects[i];
>> +		j++;
>>  	}
>> +	for (i = j; i < nr; i++)
>> +		objects[i].item = NULL;
>> +	array->nr = j;
>>  }
>>  
>>  static void create_boundary_commit_list(struct rev_info *revs)
>> @@ -2577,7 +2573,8 @@ static struct commit *get_revision_internal(struct rev_info *revs)
>>  		if (p->flags & (CHILD_SHOWN | SHOWN))
>>  			continue;
>>  		p->flags |= CHILD_SHOWN;
>> -		gc_boundary(&revs->boundary_commits);
>> +		if (revs->boundary_commits.alloc <= revs->boundary_commits.nr)
>> +			gc_boundary(&revs->boundary_commits);
>>  		add_object_array(p, NULL, &revs->boundary_commits);
>>  	}

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/17] builtin_diff_tree(): make it obvious that function wants two entries
  2013-05-21 17:27   ` Junio C Hamano
@ 2013-05-23  7:19     ` Michael Haggerty
  0 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-23  7:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Johan Herland, git

On 05/21/2013 07:27 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> Instead of accepting an array and using exactly two elements from the
>> array, take two single (struct object_array_entry *) arguments.
>>
>> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
>> ---
>>  builtin/diff.c | 18 +++++++++---------
>>  1 file changed, 9 insertions(+), 9 deletions(-)
>>
>> diff --git a/builtin/diff.c b/builtin/diff.c
>> index 8c2af6c..ba68c6c 100644
>> --- a/builtin/diff.c
>> +++ b/builtin/diff.c
>> @@ -153,7 +153,8 @@ static int builtin_diff_index(struct rev_info *revs,
>>  
>>  static int builtin_diff_tree(struct rev_info *revs,
>>  			     int argc, const char **argv,
>> -			     struct object_array_entry *ent)
>> +			     struct object_array_entry *ent0,
>> +			     struct object_array_entry *ent1)
>>  {
>>  	const unsigned char *(sha1[2]);
>>  	int swap = 0;
>> @@ -161,13 +162,13 @@ static int builtin_diff_tree(struct rev_info *revs,
>>  	if (argc > 1)
>>  		usage(builtin_diff_usage);
>>  
>> -	/* We saw two trees, ent[0] and ent[1].
>> -	 * if ent[1] is uninteresting, they are swapped
>> +	/* We saw two trees, ent0 and ent1.
>> +	 * if ent1 is uninteresting, they are swapped
>>  	 */
> 
> "While you are touching this comment" is a perfect time to fix the
> existing style violation.

Yes.  Will fix in next version.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/17] cmd_diff(): use an object_array for holding trees
  2013-05-21 17:30   ` Junio C Hamano
@ 2013-05-23  7:21     ` Michael Haggerty
  0 siblings, 0 replies; 42+ messages in thread
From: Michael Haggerty @ 2013-05-23  7:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Johan Herland, git

On 05/21/2013 07:30 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> Change cmd_diff() to use a (struct object_array) for holding the trees
>> that it accumulates, rather than rolling its own equivalent.
>>
> 
> A significant detail missing here is that this lifts the hardcoded
> 100 tree limit in combined diff but that does not matter in
> practice, I would suppose ;-).

I'll note it anyway in v2 of the patch series.

Thanks for all of your comments.  I will send a re-roll after I hear
back from you regarding patches 08, 09, and 10.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/17] revision: split some overly-long lines
  2013-05-23  6:27     ` Michael Haggerty
@ 2013-05-23 17:08       ` Junio C Hamano
  0 siblings, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2013-05-23 17:08 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, Johan Herland, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> On 05/21/2013 07:34 PM, Junio C Hamano wrote:
>> Michael Haggerty <mhagger@alum.mit.edu> writes:
>> 
>>> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
>>> ---
>>>  revision.c | 20 ++++++++++++++------
>>>  revision.h | 32 +++++++++++++++++++++-----------
>>>  2 files changed, 35 insertions(+), 17 deletions(-)
>> 
>> Looks obviously good for *.c file, but I am on the fence for *.h
>> one, as the reason we kept these long single lines in *.h files was
>> to help those who want to grep in *.h files to let them view the
>> full function signature.  It probably is OK to tell them to use
>> "git grep -A$n" instead, though.
>
> My goal with this patch was not to set a new policy but rather just to
> make the code conform a little better to the existing policy as
> described in CodingGuidelines.  *If* it is preferred that header files
> list all parameters on a single line, then by all means adjust the
> CodingGuidelines and I will just as happily make header files conform to
> *that* policy when I touch them :-)
>
> (That being said, my personal preference is to apply the 80-character
> limit for header files too.)

Yeah, that is why I said "I am on the fence but it probably is OK to
break" the unwritten but guessable rule.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/17] gc_boundary(): move the check "alloc <= nr" to caller
  2013-05-23  7:09     ` Michael Haggerty
@ 2013-05-23 18:02       ` Junio C Hamano
  0 siblings, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2013-05-23 18:02 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, Johan Herland, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> 1. The function name gc_boundary() suggests that it will do a garbage
> collection unconditionally.  In fact, it might or might not depending on
> this test.  At the caller, this made it look like a gc was happening
> each time through the loop, which made me nervous about the performance.
>  The new version makes it clear at the caller that the gc is only
> happening occasionally.

Perhaps.

> 2. Even assuming that gc_boundary() were renamed to maybe_gc_boundary(),
> the function has hopelessly little information on which to base its
> decision whether or not to gc, and the choice of policy can only be
> justified based on some implicit knowledge about how the array is grown
> and shrunk.  But the growing/shrinking is done at the layer of the
> caller, and therefore the choice of gc policy also belongs at the layer
> of the caller.
>
> 3. As you point out, if the gc policy is ever to be made more
> intelligent, then gc_boundary() is unlikely to have enough information
> to implement the new policy (e.g., it would have no place to record
> low/high water marks).  Separating concerns at the correct level would
> make a change like that easier.

These two depend on how you look at the API hierarchy.  You seem to
think that the ideal end result is

	get_revision_internal()
          have an open coded "when to gc" logic in this function
          call object_array_filter()

My suggestion was based on a different view, which is:

	get_revision_internal()
          call gc_boundary()

        gc_boundary()
          make decision on when and how to gc
          if decided to do so
            call object_array_fitler()

You can obviously rename gc_boundary() to auto_gc_boundary() if that
makes it easier to understand, but these two belong to the same
codepath that deals with the object array used specifically for
keeping track of boundary commits. I view "who has what information"
as secondary---if the decision to gc is not the primary thing
get_revision_internal() should be worrying about (and I do not think
it is), it would be a better code structure to have a helper specific
for doing so, i.e. gc_boundary(), and delegate that part of job to it.

Obviously, the caller needs to supply sufficient information to that
helper if the helper needs more than what it currently gets by
adding parameters to it, but that goes without saying.

> At the moment I am not interested in improving the gc policy of this
> code.

I am not either; the only effect we get from removing gc_boudnary()
and calling directly to object_array_filter() is to lose the above
abstraction and make it easy to let get_revision_internal() become
more cluttered when somebody else later decides to improve the gc
policy, it seems.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2013-05-23 18:02 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-19 20:26 [PATCH 00/17] Remove assumptions about refname lifetimes Michael Haggerty
2013-05-19 20:26 ` [PATCH 01/17] describe: make own copy of refname Michael Haggerty
2013-05-19 20:26 ` [PATCH 02/17] fetch: make own copies of refnames Michael Haggerty
2013-05-19 20:26 ` [PATCH 03/17] add_rev_cmdline(): make a copy of the name argument Michael Haggerty
2013-05-19 20:26 ` [PATCH 04/17] builtin_diff_tree(): make it obvious that function wants two entries Michael Haggerty
2013-05-21 17:27   ` Junio C Hamano
2013-05-23  7:19     ` Michael Haggerty
2013-05-19 20:27 ` [PATCH 05/17] cmd_diff(): use an object_array for holding trees Michael Haggerty
2013-05-21 17:30   ` Junio C Hamano
2013-05-23  7:21     ` Michael Haggerty
2013-05-19 20:27 ` [PATCH 06/17] cmd_diff(): rename local variable "list" -> "entry" Michael Haggerty
2013-05-19 20:27 ` [PATCH 07/17] cmd_diff(): make it obvious which cases are exclusive of each other Michael Haggerty
2013-05-19 20:27 ` [PATCH 08/17] revision: split some overly-long lines Michael Haggerty
2013-05-21 17:34   ` Junio C Hamano
2013-05-23  6:27     ` Michael Haggerty
2013-05-23 17:08       ` Junio C Hamano
2013-05-19 20:27 ` [PATCH 09/17] gc_boundary(): move the check "alloc <= nr" to caller Michael Haggerty
2013-05-21 17:49   ` Junio C Hamano
2013-05-23  7:09     ` Michael Haggerty
2013-05-23 18:02       ` Junio C Hamano
2013-05-19 20:27 ` [PATCH 10/17] get_revision_internal(): make check less mysterious Michael Haggerty
2013-05-21 17:38   ` Junio C Hamano
2013-05-23  6:39     ` Michael Haggerty
2013-05-19 20:27 ` [PATCH 11/17] object_array: add function object_array_filter() Michael Haggerty
2013-05-19 20:27 ` [PATCH 12/17] object_array_remove_duplicates(): rewrite to reduce copying Michael Haggerty
2013-05-19 20:27 ` [PATCH 13/17] fsck: don't put a void*-shaped peg in a char*-shaped hole Michael Haggerty
2013-05-19 20:27 ` [PATCH 14/17] find_first_merges(): initialize merges variable using initializer Michael Haggerty
2013-05-19 20:27 ` [PATCH 15/17] find_first_merges(): remove unnecessary code Michael Haggerty
2013-05-19 20:27 ` [RFC 16/17] object_array_entry: copy name before storing in name field Michael Haggerty
2013-05-20 10:33   ` Johan Herland
2013-05-20 14:42     ` Michael Haggerty
2013-05-20 16:44       ` Jeff King
2013-05-20 21:34         ` Michael Haggerty
2013-05-19 20:27 ` [RFC 17/17] refs: document the lifetime of the refname passed to each_ref_fn Michael Haggerty
2013-05-20 10:28 ` [PATCH 00/17] Remove assumptions about refname lifetimes Johan Herland
2013-05-20 12:15   ` Michael Haggerty
2013-05-20 16:37   ` Junio C Hamano
2013-05-20 16:59     ` Jeff King
2013-05-20 17:08       ` Johan Herland
2013-05-20 18:03       ` Junio C Hamano
2013-05-20 17:03     ` Johan Herland
2013-05-21 18:39       ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.