All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] diff_filespec cleanups and optimizations
@ 2014-01-17  1:18 Jeff King
  2014-01-17  1:19 ` [PATCH 1/5] diff_filespec: reorder dirty_submodule macro definitions Jeff King
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Jeff King @ 2014-01-17  1:18 UTC (permalink / raw)
  To: git

I recently came across a repository with a commit containing 100 million
paths in its tree. Cleverly, the whole repo fits into a 1.5K packfile
(can you guess how it was done?). Not so cleverly, running "diff-tree
--root" on that commit uses a large amount of memory. :)

I do not think it is worth optimizing for such a pathological
repository. But I was curious how much it would want (it OOM'd on my
64-bit 16G machine). The answer is roughly:

   100,000,000 * (
      8 bytes per pointer to diff_filepair in the diff_queue
    + 32 bytes per diff_filepair struct
    +  2 * (
         96 bytes per diff_filespec struct
       + 12 bytes per filename (in this case)
     )
  )

which is about 25G. Plus malloc overhead. So obviously this example is
unreasonable. A more reasonable large case is something like WebKit at
~150K files, doing a diff against the empty tree. That's only 37M.

But while looking at it, I noticed a bunch of cleanups for
diff_filespec.  With the patches below, sizeof(struct diff_filespec) on
my 64-bit machine goes from 96 bytes down to 80. Compiling with "-m32"
goes from 64 bytes down to 52.

The first few patches have cleanup value aside from the struct size
improvement. The last two are pure optimization. I doubt the
optimization is noticeable for any real-life cases, so I don't mind if
they get dropped. But they're quite trivial and obvious.

  [1/5]: diff_filespec: reorder dirty_submodule macro definitions
  [2/5]: diff_filespec: drop funcname_pattern_ident field
  [3/5]: diff_filespec: drop xfrm_flags field
  [4/5]: diff_filespec: reorder is_binary field
  [5/5]: diff_filespec: use only 2 bits for is_binary flag

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/5] diff_filespec: reorder dirty_submodule macro definitions
  2014-01-17  1:18 [PATCH 0/5] diff_filespec cleanups and optimizations Jeff King
@ 2014-01-17  1:19 ` Jeff King
  2014-01-17 18:46   ` Junio C Hamano
  2014-01-17  1:20 ` [PATCH 2/5] diff_filespec: drop funcname_pattern_ident field Jeff King
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2014-01-17  1:19 UTC (permalink / raw)
  To: git

diff_filespec has a 2-bit "dirty_submodule" field and
defines two flags as macros. Originally these were right
next to each other, but a new field was accidentally added
in between in commit 4682d85. This patch puts the field and
its flags back together.

Using an enum like:

  enum {
	  DIRTY_SUBMODULE_UNTRACKED = 1,
	  DIRTY_SUBMODULE_MODIFIED = 2
  } dirty_submodule;

would be more obvious, but it bloats the structure. Limiting
the enum size like:

  } dirty_submodule : 2;

might work, but it is not portable.

Signed-off-by: Jeff King <peff@peff.net>
---
 diffcore.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/diffcore.h b/diffcore.h
index 1c16c85..f822f9e 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -43,9 +43,9 @@ struct diff_filespec {
 	unsigned should_free : 1; /* data should be free()'ed */
 	unsigned should_munmap : 1; /* data should be munmap()'ed */
 	unsigned dirty_submodule : 2;  /* For submodules: its work tree is dirty */
-	unsigned is_stdin : 1;
 #define DIRTY_SUBMODULE_UNTRACKED 1
 #define DIRTY_SUBMODULE_MODIFIED  2
+	unsigned is_stdin : 1;
 	unsigned has_more_entries : 1; /* only appear in combined diff */
 	struct userdiff_driver *driver;
 	/* data should be considered "binary"; -1 means "don't know yet" */
-- 
1.8.5.2.500.g8060133

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/5] diff_filespec: drop funcname_pattern_ident field
  2014-01-17  1:18 [PATCH 0/5] diff_filespec cleanups and optimizations Jeff King
  2014-01-17  1:19 ` [PATCH 1/5] diff_filespec: reorder dirty_submodule macro definitions Jeff King
@ 2014-01-17  1:20 ` Jeff King
  2014-01-17  1:21 ` [PATCH 3/5] diff_filespec: drop xfrm_flags field Jeff King
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2014-01-17  1:20 UTC (permalink / raw)
  To: git

This struct field was obsoleted by be58e70 (diff: unify
external diff and funcname parsing code, 2008-10-05), but we
forgot to remove it.

Signed-off-by: Jeff King <peff@peff.net>
---
8 bytes of our size savings come here...

 diffcore.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/diffcore.h b/diffcore.h
index f822f9e..92e4d48 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -29,7 +29,6 @@ struct diff_filespec {
 	char *path;
 	void *data;
 	void *cnt_data;
-	const char *funcname_pattern_ident;
 	unsigned long size;
 	int count;               /* Reference count */
 	int xfrm_flags;		 /* for use by the xfrm */
-- 
1.8.5.2.500.g8060133

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/5] diff_filespec: drop xfrm_flags field
  2014-01-17  1:18 [PATCH 0/5] diff_filespec cleanups and optimizations Jeff King
  2014-01-17  1:19 ` [PATCH 1/5] diff_filespec: reorder dirty_submodule macro definitions Jeff King
  2014-01-17  1:20 ` [PATCH 2/5] diff_filespec: drop funcname_pattern_ident field Jeff King
@ 2014-01-17  1:21 ` Jeff King
  2014-01-17  1:22 ` [PATCH 4/5] diff_filespec: reorder is_binary field Jeff King
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2014-01-17  1:21 UTC (permalink / raw)
  To: git

The only mention of this field in the code is by some
debugging code which prints it out (and it will always be
zero, since we never touch it otherwise). It was obsoleted
very early on by 25d5ea4 ([PATCH] Redo rename/copy detection
logic., 2005-05-24).

Signed-off-by: Jeff King <peff@peff.net>
---
No savings here on 64-bit, since this ended up going to padding, but it
is what makes the next patch work.

 diff.c     | 4 ++--
 diffcore.h | 1 -
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/diff.c b/diff.c
index 6b4cd0e..8e4a6a9 100644
--- a/diff.c
+++ b/diff.c
@@ -4139,9 +4139,9 @@ void diff_debug_filespec(struct diff_filespec *s, int x, const char *one)
 		DIFF_FILE_VALID(s) ? "valid" : "invalid",
 		s->mode,
 		s->sha1_valid ? sha1_to_hex(s->sha1) : "");
-	fprintf(stderr, "queue[%d] %s size %lu flags %d\n",
+	fprintf(stderr, "queue[%d] %s size %lu\n",
 		x, one ? one : "",
-		s->size, s->xfrm_flags);
+		s->size);
 }
 
 void diff_debug_filepair(const struct diff_filepair *p, int i)
diff --git a/diffcore.h b/diffcore.h
index 92e4d48..22993e1 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -31,7 +31,6 @@ struct diff_filespec {
 	void *cnt_data;
 	unsigned long size;
 	int count;               /* Reference count */
-	int xfrm_flags;		 /* for use by the xfrm */
 	int rename_used;         /* Count of rename users */
 	unsigned short mode;	 /* file mode */
 	unsigned sha1_valid : 1; /* if true, use sha1 and trust mode;
-- 
1.8.5.2.500.g8060133

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/5] diff_filespec: reorder is_binary field
  2014-01-17  1:18 [PATCH 0/5] diff_filespec cleanups and optimizations Jeff King
                   ` (2 preceding siblings ...)
  2014-01-17  1:21 ` [PATCH 3/5] diff_filespec: drop xfrm_flags field Jeff King
@ 2014-01-17  1:22 ` Jeff King
  2014-01-17  1:25 ` [PATCH 5/5] diff_filespec: use only 2 bits for is_binary flag Jeff King
  2014-01-17 18:49 ` [PATCH 0/5] diff_filespec cleanups and optimizations Junio C Hamano
  5 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2014-01-17  1:22 UTC (permalink / raw)
  To: git

The middle of the diff_filespec struct contains a mixture of
ints, shorts, and bit-fields, followed by a pointer. On an
x86-64 system with an LP64 or LLP64 data model (i.e., most
of them), the integers and flags end up being padded out by
41 bits to put the pointer at an 8-byte boundary.

After the pointer, we have the "int is_binary" field, which
is only 32 bits. We end up wasting another 32 bits to pad
the struct size up to a multiple of 64 bits.

We can move the is_binary field before the pointer, which
lets the compiler store it where we used to have padding.
This shrinks the top padding to only 9 bits (from the
bit-fields), and eliminates the bottom padding entirely,
dropping the struct size from 88 to 80 bytes.

On a 32-bit system, there is no benefit, but nor should
there be any harm (we only need 4-byte alignment there, so
we were already using only 9 bits of padding).

Signed-off-by: Jeff King <peff@peff.net>
---

 diffcore.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/diffcore.h b/diffcore.h
index 22993e1..d911bf0 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -45,9 +45,9 @@ struct diff_filespec {
 #define DIRTY_SUBMODULE_MODIFIED  2
 	unsigned is_stdin : 1;
 	unsigned has_more_entries : 1; /* only appear in combined diff */
-	struct userdiff_driver *driver;
 	/* data should be considered "binary"; -1 means "don't know yet" */
 	int is_binary;
+	struct userdiff_driver *driver;
 };
 
 extern struct diff_filespec *alloc_filespec(const char *);
-- 
1.8.5.2.500.g8060133

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/5] diff_filespec: use only 2 bits for is_binary flag
  2014-01-17  1:18 [PATCH 0/5] diff_filespec cleanups and optimizations Jeff King
                   ` (3 preceding siblings ...)
  2014-01-17  1:22 ` [PATCH 4/5] diff_filespec: reorder is_binary field Jeff King
@ 2014-01-17  1:25 ` Jeff King
  2014-01-17 18:49 ` [PATCH 0/5] diff_filespec cleanups and optimizations Junio C Hamano
  5 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2014-01-17  1:25 UTC (permalink / raw)
  To: git

The is_binary flag needs only three values: -1, 0, and 1.
However, we use a whole 32-bit int for it on most systems
(both 32- and 64- bit).

Instead, we can mark it to use only 2 bits. On 32-bit
systems, this lets it end up as part of the bitfield above
(saving 4 bytes). On 64-bit systems, we don't see any change
(because the savings end up as padding), but it does leave
room for another "free" 32-bit value to be added later.

Signed-off-by: Jeff King <peff@peff.net>
---
I don't typically use bit-sized integers like this for anything but
unsigned integers to be used as flags. My understanding is that using
signed integers is explicitly permitted by the standard. I don't know if
we're guaranteed a 2's-complement representation, but I can't imagine an
implementation providing any range besides -2..1, which is what we need.

 diffcore.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/diffcore.h b/diffcore.h
index d911bf0..79de8cf 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -46,7 +46,7 @@ struct diff_filespec {
 	unsigned is_stdin : 1;
 	unsigned has_more_entries : 1; /* only appear in combined diff */
 	/* data should be considered "binary"; -1 means "don't know yet" */
-	int is_binary;
+	int is_binary : 2;
 	struct userdiff_driver *driver;
 };
 
-- 
1.8.5.2.500.g8060133

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/5] diff_filespec: reorder dirty_submodule macro definitions
  2014-01-17  1:19 ` [PATCH 1/5] diff_filespec: reorder dirty_submodule macro definitions Jeff King
@ 2014-01-17 18:46   ` Junio C Hamano
  2014-01-17 19:47     ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2014-01-17 18:46 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> diff_filespec has a 2-bit "dirty_submodule" field and
> defines two flags as macros. Originally these were right
> next to each other, but a new field was accidentally added
> in between in commit 4682d85.

Interesting.

 - 4682d852 (diff-index.c: "git diff" has no need to read blob from
   the standard input, 2012-06-27) wants to use this rule: all the
   bitfield definitions first, and then whatever macro constants
   next.

 - 25e5e2bf (combine-diff: support format_callback, 2011-08-19),
   wants to use a different rule: a run of (one bitfield definition
   and zero-or-more macro constants to be used in that bitfield).

When they were merged together at d7afe648 (Merge branch
'jc/refactor-diff-stdin', 2012-07-13), these two conflicting
philosophies crashed.

That is the commit to be blamed for this mess ;-)

I am of course fine with the end result this patch gives us.

Thanks.

> This patch puts the field and
> its flags back together.
>
> Using an enum like:
>
>   enum {
> 	  DIRTY_SUBMODULE_UNTRACKED = 1,
> 	  DIRTY_SUBMODULE_MODIFIED = 2
>   } dirty_submodule;
>
> would be more obvious, but it bloats the structure. Limiting
> the enum size like:
>
>   } dirty_submodule : 2;
>
> might work, but it is not portable.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  diffcore.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/diffcore.h b/diffcore.h
> index 1c16c85..f822f9e 100644
> --- a/diffcore.h
> +++ b/diffcore.h
> @@ -43,9 +43,9 @@ struct diff_filespec {
>  	unsigned should_free : 1; /* data should be free()'ed */
>  	unsigned should_munmap : 1; /* data should be munmap()'ed */
>  	unsigned dirty_submodule : 2;  /* For submodules: its work tree is dirty */
> -	unsigned is_stdin : 1;
>  #define DIRTY_SUBMODULE_UNTRACKED 1
>  #define DIRTY_SUBMODULE_MODIFIED  2
> +	unsigned is_stdin : 1;
>  	unsigned has_more_entries : 1; /* only appear in combined diff */
>  	struct userdiff_driver *driver;
>  	/* data should be considered "binary"; -1 means "don't know yet" */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/5] diff_filespec cleanups and optimizations
  2014-01-17  1:18 [PATCH 0/5] diff_filespec cleanups and optimizations Jeff King
                   ` (4 preceding siblings ...)
  2014-01-17  1:25 ` [PATCH 5/5] diff_filespec: use only 2 bits for is_binary flag Jeff King
@ 2014-01-17 18:49 ` Junio C Hamano
  5 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2014-01-17 18:49 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> But while looking at it, I noticed a bunch of cleanups for
> diff_filespec.  With the patches below, sizeof(struct diff_filespec) on
> my 64-bit machine goes from 96 bytes down to 80. Compiling with "-m32"
> goes from 64 bytes down to 52.
>
> The first few patches have cleanup value aside from the struct size
> improvement. The last two are pure optimization. I doubt the
> optimization is noticeable for any real-life cases, so I don't mind if
> they get dropped. But they're quite trivial and obvious.

Thanks for a pleasant read.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/5] diff_filespec: reorder dirty_submodule macro definitions
  2014-01-17 18:46   ` Junio C Hamano
@ 2014-01-17 19:47     ` Jeff King
  2014-01-17 23:50       ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2014-01-17 19:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Fri, Jan 17, 2014 at 10:46:59AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > diff_filespec has a 2-bit "dirty_submodule" field and
> > defines two flags as macros. Originally these were right
> > next to each other, but a new field was accidentally added
> > in between in commit 4682d85.
> 
> Interesting.
> 
>  - 4682d852 (diff-index.c: "git diff" has no need to read blob from
>    the standard input, 2012-06-27) wants to use this rule: all the
>    bitfield definitions first, and then whatever macro constants
>    next.
> 
>  - 25e5e2bf (combine-diff: support format_callback, 2011-08-19),
>    wants to use a different rule: a run of (one bitfield definition
>    and zero-or-more macro constants to be used in that bitfield).
> 
> When they were merged together at d7afe648 (Merge branch
> 'jc/refactor-diff-stdin', 2012-07-13), these two conflicting
> philosophies crashed.
> 
> That is the commit to be blamed for this mess ;-)

That makes sense. I had assumed it was a mis-merge initially, so was
surprised to find 4682d85. It just looked like an error to me, but the
rule you gave above does at least make sense.

I'm happy with it either way. I almost just pulled the macro
definitions, including DIFF_FILE_VALID, out of the struct definition
completely. I see the value in having the flags near their bitfield, but
it makes the definition a bit harder to read.

> I am of course fine with the end result this patch gives us.

Me too, though if you feel like doing it the other way, I'm fine with
that, too.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/5] diff_filespec: reorder dirty_submodule macro definitions
  2014-01-17 19:47     ` Jeff King
@ 2014-01-17 23:50       ` Junio C Hamano
  0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2014-01-17 23:50 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> I'm happy with it either way. I almost just pulled the macro
> definitions, including DIFF_FILE_VALID, out of the struct definition
> completely. I see the value in having the flags near their bitfield, but
> it makes the definition a bit harder to read.

Yeah, my thoughts exactly when I did those two conflicting changes.
I have a slight preference "Constants go with the fields they are
used in" over "fields and macros mixed together is harder to read",
so let's use your patch as-is.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-01-17 23:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-17  1:18 [PATCH 0/5] diff_filespec cleanups and optimizations Jeff King
2014-01-17  1:19 ` [PATCH 1/5] diff_filespec: reorder dirty_submodule macro definitions Jeff King
2014-01-17 18:46   ` Junio C Hamano
2014-01-17 19:47     ` Jeff King
2014-01-17 23:50       ` Junio C Hamano
2014-01-17  1:20 ` [PATCH 2/5] diff_filespec: drop funcname_pattern_ident field Jeff King
2014-01-17  1:21 ` [PATCH 3/5] diff_filespec: drop xfrm_flags field Jeff King
2014-01-17  1:22 ` [PATCH 4/5] diff_filespec: reorder is_binary field Jeff King
2014-01-17  1:25 ` [PATCH 5/5] diff_filespec: use only 2 bits for is_binary flag Jeff King
2014-01-17 18:49 ` [PATCH 0/5] diff_filespec cleanups and optimizations Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.