All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] tree-diff: avoid alloca for large allocations
@ 2016-06-07 22:53 Jeff King
  2016-06-08  0:36 ` Junio C Hamano
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff King @ 2016-06-07 22:53 UTC (permalink / raw)
  To: git; +Cc: Kirill Smelkov

Commit 72441af (tree-diff: rework diff_tree() to generate
diffs for multiparent cases as well, 2014-04-07) introduced
the use of alloca so that the common cases of commits with 1
or 2 parents would not be adversely affected by going
through the multi-parent code.

However, our xalloca is not ideal when the number of parents
grows very large:

  1. If the requested size is too large for our stack,
     alloca() has no way to tell us, and we simply segfault
     while trying to access the memory.

  2. It does not use our usual memory_limit_check() logic.

I measured, and alloca is indeed buying us a very small
speedup over xmalloc()/free(). So we'd want to keep
something like it.

This patch simply puts a conditional in place at each
callsite: we use alloca for common known-small numbers of
parents, and otherwise use the heap. We are technically
still vulnerable to (1), but no more so than if we simply
put a few dozen bytes on the stack, which we must do all the
time anyway. And likewise, we technically miss a memory
limit check if it is tiny, but such a limit is pointless.

An alternative to this would be implement something like:

  struct tree *tp, tp_fallback[2];
  if (nparent <= ARRAY_SIZE(tp_fallback))
          tp = tp_fallback;
  else
	  ALLOC_ARRAY(tp, nparent);
  ...
  if (tp != tp_fallback)
	  free(tp);

That would let us drop our xalloca() portability code
entirely. But in my measurements, this seemed to perform
slightly worse than the xalloca solution.

Note in the example above, and in the patch below, I've used
ALLOC_ARRAY() to replace the manual xmalloc(nr * sizeof(*x)).
Besides being shorter, this has the bonus that one cannot
accidentally overflow a size_t during that computation.

Signed-off-by: Jeff King <peff@peff.net>
---
And yes, I actually saw this in real life. The commit in question had a
million parents, which is obviously silly. I never let it run to
completion, as presumably there's some quadratic stuff in there. So we
could also have a general max-parent limit. I'm just not sure what it
would be; any limit we set would be arbitrary. Most of the rest of git
is happy to chug along using whatever RAM and CPU you are comfortable to
give it, and if you want to wait 5 minutes for your answer, go right
ahead.

 tree-diff.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/tree-diff.c b/tree-diff.c
index ff4e0d3..ebf40f4 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -14,6 +14,16 @@
  */
 #define S_IFXMIN_NEQ	S_DIFFTREE_IFXMIN_NEQ
 
+#define FAST_ARRAY_ALLOC(x, nr) do { \
+	if ((nr) <= 2) \
+		(x) = xalloca((nr) * sizeof(*(x))); \
+	else \
+		ALLOC_ARRAY((x), nr); \
+} while(0)
+#define FAST_ARRAY_FREE(x, nr) do { \
+	if ((nr) > 2) \
+		free((x)); \
+} while(0)
 
 static struct combine_diff_path *ll_diff_tree_paths(
 	struct combine_diff_path *p, const unsigned char *sha1,
@@ -265,7 +275,7 @@ static struct combine_diff_path *emit_path(struct combine_diff_path *p,
 	if (recurse) {
 		const unsigned char **parents_sha1;
 
-		parents_sha1 = xalloca(nparent * sizeof(parents_sha1[0]));
+		FAST_ARRAY_ALLOC(parents_sha1, nparent);
 		for (i = 0; i < nparent; ++i) {
 			/* same rule as in emitthis */
 			int tpi_valid = tp && !(tp[i].entry.mode & S_IFXMIN_NEQ);
@@ -277,7 +287,7 @@ static struct combine_diff_path *emit_path(struct combine_diff_path *p,
 		strbuf_add(base, path, pathlen);
 		strbuf_addch(base, '/');
 		p = ll_diff_tree_paths(p, sha1, parents_sha1, nparent, base, opt);
-		xalloca_free(parents_sha1);
+		FAST_ARRAY_FREE(parents_sha1, nparent);
 	}
 
 	strbuf_setlen(base, old_baselen);
@@ -402,8 +412,8 @@ static struct combine_diff_path *ll_diff_tree_paths(
 	void *ttree, **tptree;
 	int i;
 
-	tp     = xalloca(nparent * sizeof(tp[0]));
-	tptree = xalloca(nparent * sizeof(tptree[0]));
+	FAST_ARRAY_ALLOC(tp, nparent);
+	FAST_ARRAY_ALLOC(tptree, nparent);
 
 	/*
 	 * load parents first, as they are probably already cached.
@@ -531,8 +541,8 @@ static struct combine_diff_path *ll_diff_tree_paths(
 	free(ttree);
 	for (i = nparent-1; i >= 0; i--)
 		free(tptree[i]);
-	xalloca_free(tptree);
-	xalloca_free(tp);
+	FAST_ARRAY_FREE(tptree, nparent);
+	FAST_ARRAY_FREE(tp, nparent);
 
 	return p;
 }
-- 
2.9.0.rc1.159.g81173282e

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] tree-diff: avoid alloca for large allocations
  2016-06-07 22:53 [PATCH] tree-diff: avoid alloca for large allocations Jeff King
@ 2016-06-08  0:36 ` Junio C Hamano
  2016-06-08  1:36   ` Jeff King
  0 siblings, 1 reply; 3+ messages in thread
From: Junio C Hamano @ 2016-06-08  0:36 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Kirill Smelkov

Jeff King <peff@peff.net> writes:

> An alternative to this would be implement something like:
>
>   struct tree *tp, tp_fallback[2];
>   if (nparent <= ARRAY_SIZE(tp_fallback))
>           tp = tp_fallback;
>   else
> 	  ALLOC_ARRAY(tp, nparent);
>   ...
>   if (tp != tp_fallback)
> 	  free(tp);
>
> That would let us drop our xalloca() portability code
> entirely. But in my measurements, this seemed to perform
> slightly worse than the xalloca solution.

It indeed is curious why this "obvious" alternative performed
worse.

> +#define FAST_ARRAY_ALLOC(x, nr) do { \
> +	if ((nr) <= 2) \
> +		(x) = xalloca((nr) * sizeof(*(x))); \
> +	else \
> +		ALLOC_ARRAY((x), nr); \
> +} while(0)
> +#define FAST_ARRAY_FREE(x, nr) do { \
> +	if ((nr) > 2) \
> +		free((x)); \
> +} while(0)

An obvious and clean implementation of the idea.

The only slight worry I have is that nr must stay constant until the
time the caller calls FREE(), but for the only three callsite pairs
it is clear nparent will stay constant and this is local to the
file.

Thanks.

>  static struct combine_diff_path *ll_diff_tree_paths(
>  	struct combine_diff_path *p, const unsigned char *sha1,
> @@ -265,7 +275,7 @@ static struct combine_diff_path *emit_path(struct combine_diff_path *p,
>  	if (recurse) {
>  		const unsigned char **parents_sha1;
>  
> -		parents_sha1 = xalloca(nparent * sizeof(parents_sha1[0]));
> +		FAST_ARRAY_ALLOC(parents_sha1, nparent);
>  		for (i = 0; i < nparent; ++i) {
>  			/* same rule as in emitthis */
>  			int tpi_valid = tp && !(tp[i].entry.mode & S_IFXMIN_NEQ);
> @@ -277,7 +287,7 @@ static struct combine_diff_path *emit_path(struct combine_diff_path *p,
>  		strbuf_add(base, path, pathlen);
>  		strbuf_addch(base, '/');
>  		p = ll_diff_tree_paths(p, sha1, parents_sha1, nparent, base, opt);
> -		xalloca_free(parents_sha1);
> +		FAST_ARRAY_FREE(parents_sha1, nparent);
>  	}
>  
>  	strbuf_setlen(base, old_baselen);
> @@ -402,8 +412,8 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  	void *ttree, **tptree;
>  	int i;
>  
> -	tp     = xalloca(nparent * sizeof(tp[0]));
> -	tptree = xalloca(nparent * sizeof(tptree[0]));
> +	FAST_ARRAY_ALLOC(tp, nparent);
> +	FAST_ARRAY_ALLOC(tptree, nparent);
>  
>  	/*
>  	 * load parents first, as they are probably already cached.
> @@ -531,8 +541,8 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  	free(ttree);
>  	for (i = nparent-1; i >= 0; i--)
>  		free(tptree[i]);
> -	xalloca_free(tptree);
> -	xalloca_free(tp);
> +	FAST_ARRAY_FREE(tptree, nparent);
> +	FAST_ARRAY_FREE(tp, nparent);
>  
>  	return p;
>  }

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] tree-diff: avoid alloca for large allocations
  2016-06-08  0:36 ` Junio C Hamano
@ 2016-06-08  1:36   ` Jeff King
  0 siblings, 0 replies; 3+ messages in thread
From: Jeff King @ 2016-06-08  1:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

On Tue, Jun 07, 2016 at 05:36:59PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > An alternative to this would be implement something like:
> >
> >   struct tree *tp, tp_fallback[2];
> >   if (nparent <= ARRAY_SIZE(tp_fallback))
> >           tp = tp_fallback;
> >   else
> > 	  ALLOC_ARRAY(tp, nparent);
> >   ...
> >   if (tp != tp_fallback)
> > 	  free(tp);
> >
> > That would let us drop our xalloca() portability code
> > entirely. But in my measurements, this seemed to perform
> > slightly worse than the xalloca solution.
> 
> It indeed is curious why this "obvious" alternative performed
> worse.

Yeah. I'd be happy if somebody wanted to prove me wrong here. It's
entirely possible I just did something stupid. I had wrapped up the
above in the FAST_ARRAY_ALLOC macro. It would waste the stack space when
we _did_ have to call malloc, but that should almost never happen in my
benchmark. It also allocates an extra slot on the stack for non-merge
commits. So I guess the wasted 24 bytes or whatever could have in
impact. It's also possible that the extra variable simply tickles the
optimizer in some way; I didn't look at the generated asm.

FWIW, here are the timings I had come up with for running "git log --raw
--no-abbrev --no-renames" on linux.git (the same benchmark from the
commit that originally added the alloca). The "time" output at the end
is best-of-five, with the "attempt" lines showing the wall-clock time
for each:

  [original, with xalloca]
  Attempt 1: 30.669
  Attempt 2: 30.782
  Attempt 3: 31.807
  Attempt 4: 31.152
  Attempt 5: 30.243

  real    0m30.243s
  user    0m30.112s
  sys     0m0.128s


  [xmalloc/free instead of xalloca]
  Attempt 1: 31.306
  Attempt 2: 31.814
  Attempt 3: 31.138
  Attempt 4: 31.787
  Attempt 5: 32.23

  real    0m31.138s
  user    0m30.976s
  sys     0m0.160s


  [local var with fallback to xmalloc]
  Attempt 1: 30.926
  Attempt 2: 31.466
  Attempt 3: 31.865
  Attempt 4: 31.345
  Attempt 5: 33.159

  real    0m30.926s
  user    0m30.804s
  sys     0m0.120s


So the improvement here over even a naive malloc/free is _really_ small.
You'll note that the best-of-five for xmalloc is actually smaller than
the worst-case with xalloca. But the mean is still 2% better. The mean
for the "fallback" case aren't really any better (which makes me wonder
if I was somehow accidentally calling malloc each time).

So I dunno. For 2%, I was tempted to simply say "screw it, let's just
forget this micro-optimization and call xmalloc". This patch is the
conservative choice in that it has the same performance profile as the
original. Or so I thought. I didn't record the old timings, but they
looked similar to the original. I just recreated them while writing this
email and got:

  [xalloca with fallback to xmalloc]
  Attempt 1: 31.356
  Attempt 2: 31.725
  Attempt 3: 31.454
  Attempt 4: 30.898
  Attempt 5: 31.937

  real    0m30.898s
  user    0m30.752s
  sys     0m0.144s

which really isn't much better than the local-var case. I wonder if just
having the conditional stack/heap is what kills us (either itself, or
because it affects the optimizer).

I dunno.


> > +#define FAST_ARRAY_ALLOC(x, nr) do { \
> > +	if ((nr) <= 2) \
> > +		(x) = xalloca((nr) * sizeof(*(x))); \
> > +	else \
> > +		ALLOC_ARRAY((x), nr); \
> > +} while(0)
> > +#define FAST_ARRAY_FREE(x, nr) do { \
> > +	if ((nr) > 2) \
> > +		free((x)); \
> > +} while(0)
> 
> An obvious and clean implementation of the idea.
> 
> The only slight worry I have is that nr must stay constant until the
> time the caller calls FREE(), but for the only three callsite pairs
> it is clear nparent will stay constant and this is local to the
> file.

Yep, I had the same worry. I think it's OK because the damage is limited
to tree-diff.c. I'm not sure I would promote this macro for general use.

-Peff

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-06-08  1:37 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-07 22:53 [PATCH] tree-diff: avoid alloca for large allocations Jeff King
2016-06-08  0:36 ` Junio C Hamano
2016-06-08  1:36   ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.