All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Show binary file size change in diff --stat
@ 2007-02-28 13:03 Andy Parkins
  2007-02-28 14:44 ` Johannes Schindelin
  2007-02-28 18:58 ` Rogan Dawes
  0 siblings, 2 replies; 25+ messages in thread
From: Andy Parkins @ 2007-02-28 13:03 UTC (permalink / raw)
  To: git

Previously, a binary file in the diffstat would show as:

 some-binary-file.bin       |  Bin

The space after the "Bin" was never used.  This patch changes binary
lines in the diffstat to be:

 some-binary-file.bin       |  Bin +123456B -12345B

The "+" item is the size of the new version, the "-" item is the size of
the old version.  If a size is 0 it's not shown (although it would
probably be better to treat no-file differently from zero-byte-file).

The "B" (for bytes) is shown to highlight the fact that these numbers
are not "number of lines", but actual bytes.

The user can see what changed in the binary file, and how big the new
file is.  This is in keeping with the information in the rest of the
diffstat.

The diffstat_t members "added" and "deleted" were unused when the file
was binary, so this patch loads them with the file sizes in
builtin_diffstat().  These figures are then read in show_stats() when
the file is marked binary.

Signed-off-by: Andy Parkins <andyparkins@gmail.com>
---
 diff.c |   15 ++++++++++++---
 1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/diff.c b/diff.c
index d1eae72..174b84e 100644
--- a/diff.c
+++ b/diff.c
@@ -808,7 +808,14 @@ static void show_stats(struct diffstat_t* data, struct diff_options *options)
 
 		if (data->files[i]->is_binary) {
 			show_name(prefix, name, len, reset, set);
-			printf("  Bin\n");
+			printf("  Bin ");
+			if (added != 0)
+				printf("%s+%iB%s", add_c, added, reset);
+			if (added != 0 && deleted != 0 )
+				printf(" ");
+			if (deleted != 0)
+				printf("%s-%iB%s", del_c, deleted, reset);
+			printf("\n");
 			goto free_diffstat_file;
 		}
 		else if (data->files[i]->is_unmerged) {
@@ -1179,9 +1186,11 @@ static void builtin_diffstat(const char *name_a, const char *name_b,
 	if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
 		die("unable to read files to diff");
 
-	if (mmfile_is_binary(&mf1) || mmfile_is_binary(&mf2))
+	if (mmfile_is_binary(&mf1) || mmfile_is_binary(&mf2)) {
 		data->is_binary = 1;
-	else {
+		data->added = mf2.size;
+		data->deleted = mf1.size;
+	} else {
 		/* Crazy xdl interfaces.. */
 		xpparam_t xpp;
 		xdemitconf_t xecfg;
-- 
1.5.0.2.778.gdcb06

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-02-28 13:03 [PATCH] Show binary file size change in diff --stat Andy Parkins
@ 2007-02-28 14:44 ` Johannes Schindelin
  2007-02-28 14:51   ` Nicolas Pitre
                     ` (2 more replies)
  2007-02-28 18:58 ` Rogan Dawes
  1 sibling, 3 replies; 25+ messages in thread
From: Johannes Schindelin @ 2007-02-28 14:44 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git

Hi,

On Wed, 28 Feb 2007, Andy Parkins wrote:

> Previously, a binary file in the diffstat would show as:
> 
>  some-binary-file.bin       |  Bin
> 
> The space after the "Bin" was never used.  This patch changes binary
> lines in the diffstat to be:
> 
>  some-binary-file.bin       |  Bin +123456B -12345B

This seemed as a good idea to me at first. But I think it is a little 
misleading: Imagine for example that you have raw image data (which I 
track with Git, too). If only part of that changes, this representation is 
misleading.

How about "Bin 123456 -> 12345 bytes" instead?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-02-28 14:44 ` Johannes Schindelin
@ 2007-02-28 14:51   ` Nicolas Pitre
  2007-02-28 15:15   ` Andy Parkins
  2007-02-28 15:26   ` Andy Parkins
  2 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2007-02-28 14:51 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Andy Parkins, git

On Wed, 28 Feb 2007, Johannes Schindelin wrote:

> How about "Bin 123456 -> 12345 bytes" instead?

I agree.


Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] Show binary file size change in diff --stat
  2007-02-28 14:44 ` Johannes Schindelin
  2007-02-28 14:51   ` Nicolas Pitre
@ 2007-02-28 15:15   ` Andy Parkins
  2007-02-28 15:37     ` Johannes Schindelin
  2007-02-28 15:26   ` Andy Parkins
  2 siblings, 1 reply; 25+ messages in thread
From: Andy Parkins @ 2007-02-28 15:15 UTC (permalink / raw)
  To: git

Previously, a binary file in the diffstat would show as:

 some-binary-file.bin       |  Bin

The space after the "Bin" was never used.  This patch changes binary
lines in the diffstat to be:

 some-binary-file.bin       |  Bin 123456 -> 12345 bytes

The very nice "->" notation was suggested by Johannes Schindelin, and
shows the before and after sizes more clearly than "+" and "-" would.
If a size is 0 it's not shown (although it would probably be better to
treat no-file differently from zero-byte-file).

The user can see what changed in the binary file, and how big the new
file is.  This is in keeping with the information in the rest of the
diffstat.

The diffstat_t members "added" and "deleted" were unused when the file
was binary, so this patch loads them with the file sizes in
builtin_diffstat().  These figures are then read in show_stats() when
the file is marked binary.

Signed-off-by: Andy Parkins <andyparkins@gmail.com>
---

Good call.

I like it.  "->" is much better than "+" and "-"; they would
imply that a set of bytes had been removed and another set added to
replace them - obviously nonsense.

 diff.c |   17 ++++++++++++++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/diff.c b/diff.c
index d1eae72..ea322e2 100644
--- a/diff.c
+++ b/diff.c
@@ -808,7 +808,16 @@ static void show_stats(struct diffstat_t* data, struct diff_options *options)
 
 		if (data->files[i]->is_binary) {
 			show_name(prefix, name, len, reset, set);
-			printf("  Bin\n");
+			printf("  Bin ");
+			if (added != 0)
+				printf("%s%i%s", add_c, added, reset);
+			if (added != 0 && deleted != 0 )
+				printf(" -> ");
+			if (deleted != 0)
+				printf("%s%i%s", del_c, deleted, reset);
+			if (added != 0 || deleted != 0 )
+				printf(" bytes");
+			printf("\n");
 			goto free_diffstat_file;
 		}
 		else if (data->files[i]->is_unmerged) {
@@ -1179,9 +1188,11 @@ static void builtin_diffstat(const char *name_a, const char *name_b,
 	if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
 		die("unable to read files to diff");
 
-	if (mmfile_is_binary(&mf1) || mmfile_is_binary(&mf2))
+	if (mmfile_is_binary(&mf1) || mmfile_is_binary(&mf2)) {
 		data->is_binary = 1;
-	else {
+		data->added = mf2.size;
+		data->deleted = mf1.size;
+	} else {
 		/* Crazy xdl interfaces.. */
 		xpparam_t xpp;
 		xdemitconf_t xecfg;
-- 
1.5.0.2.205.gea888-dirty

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH] Show binary file size change in diff --stat
  2007-02-28 14:44 ` Johannes Schindelin
  2007-02-28 14:51   ` Nicolas Pitre
  2007-02-28 15:15   ` Andy Parkins
@ 2007-02-28 15:26   ` Andy Parkins
  2 siblings, 0 replies; 25+ messages in thread
From: Andy Parkins @ 2007-02-28 15:26 UTC (permalink / raw)
  To: git

Previously, a binary file in the diffstat would show as:

 some-binary-file.bin       |  Bin

The space after the "Bin" was never used.  This patch changes binary
lines in the diffstat to be:

 some-binary-file.bin       |  Bin 12345 -> 123456 bytes

The very nice "->" notation was suggested by Johannes Schindelin, and
shows the before and after sizes more clearly than "+" and "-" would.

The user can see what changed in the binary file, and how big the new
file is.  This is in keeping with the information in the rest of the
diffstat.

The diffstat_t members "added" and "deleted" were unused when the file
was binary, so this patch loads them with the file sizes in
builtin_diffstat().  These figures are then read in show_stats() when
the file is marked binary.

Signed-off-by: Andy Parkins <andyparkins@gmail.com>
---

GAHHH!!!

Ignore previous; I forgot to swap the order of the added and deleted items.
I've also removed the conditionals on the 0 byte sizes - they're always shown.


 diff.c |   13 ++++++++++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/diff.c b/diff.c
index d1eae72..cfddbb4 100644
--- a/diff.c
+++ b/diff.c
@@ -808,7 +808,12 @@ static void show_stats(struct diffstat_t* data, struct diff_options *options)
 
 		if (data->files[i]->is_binary) {
 			show_name(prefix, name, len, reset, set);
-			printf("  Bin\n");
+			printf("  Bin ");
+			printf("%s%i%s", del_c, deleted, reset);
+			printf(" -> ");
+			printf("%s%i%s", add_c, added, reset);
+			printf(" bytes");
+			printf("\n");
 			goto free_diffstat_file;
 		}
 		else if (data->files[i]->is_unmerged) {
@@ -1179,9 +1184,11 @@ static void builtin_diffstat(const char *name_a, const char *name_b,
 	if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
 		die("unable to read files to diff");
 
-	if (mmfile_is_binary(&mf1) || mmfile_is_binary(&mf2))
+	if (mmfile_is_binary(&mf1) || mmfile_is_binary(&mf2)) {
 		data->is_binary = 1;
-	else {
+		data->added = mf2.size;
+		data->deleted = mf1.size;
+	} else {
 		/* Crazy xdl interfaces.. */
 		xpparam_t xpp;
 		xdemitconf_t xecfg;
-- 
1.5.0.2.205.gea888-dirty

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-02-28 15:15   ` Andy Parkins
@ 2007-02-28 15:37     ` Johannes Schindelin
  2007-02-28 18:42       ` Andy Parkins
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2007-02-28 15:37 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git

Hi,

On Wed, 28 Feb 2007, Andy Parkins wrote:

> +				printf("%s%i%s", add_c, added, reset);

Please use "%d" instead of "%i" (I forgot why, but I remember Junio saying 
that we should do it...).

> +			if (added != 0 && deleted != 0 )
> +				printf(" -> ");

Either you want to have

	Bin +123456 bytes

or

	Bin 0 -> 123456 bytes

for added files (and the obvious thing for deleted ones), but with your 
patch, both added and deleted get

	Bin 123456 bytes

which is not so optimal.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-02-28 15:37     ` Johannes Schindelin
@ 2007-02-28 18:42       ` Andy Parkins
  2007-02-28 19:41         ` Johannes Schindelin
  0 siblings, 1 reply; 25+ messages in thread
From: Andy Parkins @ 2007-02-28 18:42 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin

On Wednesday 2007, February 28, Johannes Schindelin wrote:

> for added files (and the obvious thing for deleted ones), but with
> your patch, both added and deleted get
>
> 	Bin 123456 bytes
>
> which is not so optimal.

See my corrected patch.  However, while you're interested: how would one 
tell the difference between 0 bytes and (no file).  I'm thinking of 
these possibilities:

 new -> 123456
 0 -> 123456
 123456 -> deleted
 123456 -> 0

For example - does mmfile_t.ptr set to NULL mean anything?



Andy
-- 
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-02-28 13:03 [PATCH] Show binary file size change in diff --stat Andy Parkins
  2007-02-28 14:44 ` Johannes Schindelin
@ 2007-02-28 18:58 ` Rogan Dawes
  2007-02-28 19:42   ` Johannes Schindelin
  1 sibling, 1 reply; 25+ messages in thread
From: Rogan Dawes @ 2007-02-28 18:58 UTC (permalink / raw)
  To: git

Andy Parkins wrote:
> Previously, a binary file in the diffstat would show as:
> 
>  some-binary-file.bin       |  Bin
> 
> The space after the "Bin" was never used.  This patch changes binary
> lines in the diffstat to be:
> 
>  some-binary-file.bin       |  Bin +123456B -12345B
> 
> The "+" item is the size of the new version, the "-" item is the size of
> the old version.  If a size is 0 it's not shown (although it would
> probably be better to treat no-file differently from zero-byte-file).
> 
> The "B" (for bytes) is shown to highlight the fact that these numbers
> are not "number of lines", but actual bytes.
> 
> The user can see what changed in the binary file, and how big the new
> file is.  This is in keeping with the information in the rest of the
> diffstat.
> 
> The diffstat_t members "added" and "deleted" were unused when the file
> was binary, so this patch loads them with the file sizes in
> builtin_diffstat().  These figures are then read in show_stats() when
> the file is marked binary.

How about showing the size of the changes between the 2 files via the 
libxdiff binary patch function?

It would probably not be that difficult to generate a binary patch, and 
then derive some stats from it.

Rogan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-02-28 18:42       ` Andy Parkins
@ 2007-02-28 19:41         ` Johannes Schindelin
  0 siblings, 0 replies; 25+ messages in thread
From: Johannes Schindelin @ 2007-02-28 19:41 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git

Hi,

On Wed, 28 Feb 2007, Andy Parkins wrote:

> On Wednesday 2007, February 28, Johannes Schindelin wrote:
> 
> > for added files (and the obvious thing for deleted ones), but with
> > your patch, both added and deleted get
> >
> > 	Bin 123456 bytes
> >
> > which is not so optimal.
> 
> See my corrected patch.  However, while you're interested: how would one 
> tell the difference between 0 bytes and (no file).  I'm thinking of 
> these possibilities:
> 
>  new -> 123456
>  0 -> 123456
>  123456 -> deleted
>  123456 -> 0
> 
> For example - does mmfile_t.ptr set to NULL mean anything?

In non-binary case, we have only plusses or minusses, so you cannot tell 
either. You have to look into the summary for that.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-02-28 18:58 ` Rogan Dawes
@ 2007-02-28 19:42   ` Johannes Schindelin
  2007-02-28 21:27     ` Rogan Dawes
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2007-02-28 19:42 UTC (permalink / raw)
  To: Rogan Dawes; +Cc: git

Hi,

On Wed, 28 Feb 2007, Rogan Dawes wrote:

> How about showing the size of the changes between the 2 files via the 
> libxdiff binary patch function?

I briefly considered this, too. But what would it tell you in the case of 
a jpg? I think it has more disadvantages than advantages...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-02-28 19:42   ` Johannes Schindelin
@ 2007-02-28 21:27     ` Rogan Dawes
  2007-03-01  1:09       ` Johannes Schindelin
  0 siblings, 1 reply; 25+ messages in thread
From: Rogan Dawes @ 2007-02-28 21:27 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Johannes Schindelin wrote:
> Hi,
> 
> On Wed, 28 Feb 2007, Rogan Dawes wrote:
> 
>> How about showing the size of the changes between the 2 files via the 
>> libxdiff binary patch function?
> 
> I briefly considered this, too. But what would it tell you in the case of 
> a jpg? I think it has more disadvantages than advantages...
> 
> Ciao,
> Dscho

It would still tell you the extent of the changes. i.e. Did we change 
only 10 bytes of the file, or is it a dramatic change?

This is exactly what the text statistics show. +(new lines + modified 
lines) -(deleted lines + modified lines)

Instead of having a "line-based" record size, simply use individual 
bytes, since binary files don't have lines (most of them, anyway!).

Rogan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-02-28 21:27     ` Rogan Dawes
@ 2007-03-01  1:09       ` Johannes Schindelin
  2007-03-01  6:58         ` Rogan Dawes
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2007-03-01  1:09 UTC (permalink / raw)
  To: Rogan Dawes; +Cc: git

Hi,

On Wed, 28 Feb 2007, Rogan Dawes wrote:

> Johannes Schindelin wrote:
> 
> > On Wed, 28 Feb 2007, Rogan Dawes wrote:
> > 
> > > How about showing the size of the changes between the 2 files via 
> > > the libxdiff binary patch function?
> > 
> > I briefly considered this, too. But what would it tell you in the case 
> > of a jpg? I think it has more disadvantages than advantages...
> 
> It would still tell you the extent of the changes. i.e. Did we change 
> only 10 bytes of the file, or is it a dramatic change?

I was not explicit enough, okay. I was not so worried about the case where 
only 10 bytes changed. If you insert a single dot in a jpg image, chances 
are that your binary content will change _a lot_.

So, no problem deducing from 10 bytes changed that it was a minor change. 
But you cannot deduce the opposite of a 1MB change!

Hth,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-03-01  1:09       ` Johannes Schindelin
@ 2007-03-01  6:58         ` Rogan Dawes
  0 siblings, 0 replies; 25+ messages in thread
From: Rogan Dawes @ 2007-03-01  6:58 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Johannes Schindelin wrote:
> Hi,
> 
> On Wed, 28 Feb 2007, Rogan Dawes wrote:
> 
>> Johannes Schindelin wrote:
>>
>>> On Wed, 28 Feb 2007, Rogan Dawes wrote:
>>>
>>>> How about showing the size of the changes between the 2 files via 
>>>> the libxdiff binary patch function?
>>> I briefly considered this, too. But what would it tell you in the case 
>>> of a jpg? I think it has more disadvantages than advantages...
>> It would still tell you the extent of the changes. i.e. Did we change 
>> only 10 bytes of the file, or is it a dramatic change?
> 
> I was not explicit enough, okay. I was not so worried about the case where 
> only 10 bytes changed. If you insert a single dot in a jpg image, chances 
> are that your binary content will change _a lot_.
> 
> So, no problem deducing from 10 bytes changed that it was a minor change. 
> But you cannot deduce the opposite of a 1MB change!
> 
> Hth,
> Dscho
> 

Yeah, I did understand that.

However, if we were to generalise, making a change to an XML document 
(wrapping the whole thing in an additional tag, with the associated 
indentation adjustments), we'd still see all the lines in the file 
change, even though the actual (semantic?) change is small. However, we 
would still report the full numbers of lines added and deleted regardless.

Why should a binary file be any different? It would be up to the 
observer to make a conclusion as to the significance of the numbers, 
based on their intrinsic knowledge of the file type, and its properties.

I just believe that showing bytes changed for binary files is 
_consistent_ with what we do for text.

Rogan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-04-04 16:40         ` Linus Torvalds
  2007-04-04 16:59           ` Johannes Schindelin
@ 2007-04-04 17:47           ` Junio C Hamano
  1 sibling, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2007-04-04 17:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Johannes Schindelin, Rogan Dawes, Andy Parkins, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, 4 Apr 2007, Johannes Schindelin wrote:
>> 
>> The subtle difference: your approach is _expensive_ in terms of CPU time, 
>> while the byte change approach is _dirt cheap_.
>
> Well, you could do a combination (still dirt cheap):
>  - show the size before/after (and yes, new/delete should be separate from 
>    "zero size before/after")
>  - show the size of the binary patch.
>
> No "X added bytes" vs "Y bytes deleted", just "size of binary patch". It 
> could be really small, even if 10k was deleted, or the file was totally 
> re-organized by moving chunks around.
>
> It would still be a meaningful thing to know - if only because it tells 
> you how much space the delta takes.

I agree wrt the kind of information to give, except that I am
not so sure about new/delete vs zero before/after.  We do not do
that for a text file, and when people do care about the
distinction, they would use --summary.

I often have wished that we could make --stat imply --summary;
the only reason we did not do that is because the --stat option
started its life as an imitation of "diffstat".

I've seen our diff-delta change its output size once.  It was a
nice improvement to make the delta much smaller than before, but
I had to rewrite the rename similarity in diffcore not to depend
on the diff-delta algorithm change, to keep it stable across
diff-delta improvements (the alternative was to futz with the
default threshold).  I suspect we might see similar confusion if
we show the delta size, depending on people's expectations.
This is a very minor issue, but I thought I should mention it.

There is a machine readable output format for the same --stat
information called --numstat.  It currently signals the
binary-ness by showing '-' instead of line count.  We could
extend it by showing '-' + number of bytes.

So here are some more suggestions:

 (1) --stat for binary files to show preimage and postimage
     sizes like this (if we were to do delta size -- otherwise
     drop " (.*" at the end):

	penguin.jpg |  Bin 745245 -> 660689 (delta: 4434)

 (2) --numstat for binary files to show preimage and postimage
     sizes like this:

	penguin.jpg	-745245	-660689

 (3) independent from all of the above, make --stat imply
     --summary and perhaps introduce --no-summary if people do not
     want --summary given when they say --stat;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-04-04 16:59           ` Johannes Schindelin
@ 2007-04-04 17:12             ` Linus Torvalds
  0 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2007-04-04 17:12 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Rogan Dawes, Andy Parkins, git



On Wed, 4 Apr 2007, Johannes Schindelin wrote:
> 
> ... and by this (size of binary patch) you mean the deltified object?

yes. Just the size of the delta. Although I guess you're right - we 
may not have generated that.

		Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-04-04 16:40         ` Linus Torvalds
@ 2007-04-04 16:59           ` Johannes Schindelin
  2007-04-04 17:12             ` Linus Torvalds
  2007-04-04 17:47           ` Junio C Hamano
  1 sibling, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2007-04-04 16:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rogan Dawes, Andy Parkins, git

Hi,

On Wed, 4 Apr 2007, Linus Torvalds wrote:

> On Wed, 4 Apr 2007, Johannes Schindelin wrote:
> > 
> > The subtle difference: your approach is _expensive_ in terms of CPU time, 
> > while the byte change approach is _dirt cheap_.
> 
> Well, you could do a combination (still dirt cheap):
>  - show the size before/after (and yes, new/delete should be separate from 
>    "zero size before/after")
>  - show the size of the binary patch.

... and by this (size of binary patch) you mean the deltified object?

In general, we do not have the binary patch. And the generation of that 
binary patch is what I was referring to being expensive.

Remember, most binary files are way larger than the average source code 
files we have, since it is much, much easier to generate binary data than 
to write meaningful code. Therefore, the binary patch generation has to 
look at much larger pieces to begin with, which translates into CPU time.

Having said that, I have to admit that I don't have numbers backing this 
reasoning up. So, if somebody comes up with numbers contradicting my 
theory, I will gladly change my mind.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-04-04 16:22       ` Johannes Schindelin
  2007-04-04 16:26         ` Rogan Dawes
@ 2007-04-04 16:40         ` Linus Torvalds
  2007-04-04 16:59           ` Johannes Schindelin
  2007-04-04 17:47           ` Junio C Hamano
  1 sibling, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2007-04-04 16:40 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Rogan Dawes, Andy Parkins, git



On Wed, 4 Apr 2007, Johannes Schindelin wrote:
> 
> The subtle difference: your approach is _expensive_ in terms of CPU time, 
> while the byte change approach is _dirt cheap_.

Well, you could do a combination (still dirt cheap):
 - show the size before/after (and yes, new/delete should be separate from 
   "zero size before/after")
 - show the size of the binary patch.

No "X added bytes" vs "Y bytes deleted", just "size of binary patch". It 
could be really small, even if 10k was deleted, or the file was totally 
re-organized by moving chunks around.

It would still be a meaningful thing to know - if only because it tells 
you how much space the delta takes. So even if it's a .jpg, and the actual 
*picture* didn't change a lot (ie you did some new version with color 
correction or something: it looks similar to the old one, but the *diff* 
is basically "rewrite it all"), knowing the size of the delta at least has 
the meaning of "this is how basically much space it will take when you 
send the binary diff in an email".

That's fairly close to what "5 new lines, 1 deleted line" message means. 
It's another way to give you an approximate idea of how big the changes 
were.

		Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-04-04 16:22       ` Johannes Schindelin
@ 2007-04-04 16:26         ` Rogan Dawes
  2007-04-04 16:40         ` Linus Torvalds
  1 sibling, 0 replies; 25+ messages in thread
From: Rogan Dawes @ 2007-04-04 16:26 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Andy Parkins, git

Johannes Schindelin wrote:
> Hi,
> 
> On Wed, 4 Apr 2007, Rogan Dawes wrote:
> 
>> Andy Parkins wrote:
>>
>>> As an example: compress a file, change a byte, compress it again, 
>>> perform a binary diff; what is that diff telling you about the change?  
>>> (My answer is: not much).
>> Well, at least as much as the resulting sizes tell you, if not more.
> 
> The subtle difference: your approach is _expensive_ in terms of CPU time, 
> while the byte change approach is _dirt cheap_.

I don't dispute that for a second.

> Since it seems that there are gazillions of examples where one or the 
> other (or both) do not make sense, I'd rather have the fast one.

Fair enough.

Rogan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-04-04 15:51     ` Rogan Dawes
@ 2007-04-04 16:22       ` Johannes Schindelin
  2007-04-04 16:26         ` Rogan Dawes
  2007-04-04 16:40         ` Linus Torvalds
  0 siblings, 2 replies; 25+ messages in thread
From: Johannes Schindelin @ 2007-04-04 16:22 UTC (permalink / raw)
  To: Rogan Dawes; +Cc: Andy Parkins, git

Hi,

On Wed, 4 Apr 2007, Rogan Dawes wrote:

> Andy Parkins wrote:
>
> > As an example: compress a file, change a byte, compress it again, 
> > perform a binary diff; what is that diff telling you about the change?  
> > (My answer is: not much).
> 
> Well, at least as much as the resulting sizes tell you, if not more.

The subtle difference: your approach is _expensive_ in terms of CPU time, 
while the byte change approach is _dirt cheap_.

Since it seems that there are gazillions of examples where one or the 
other (or both) do not make sense, I'd rather have the fast one.

HOWEVER, there is something I really would like to have, namely a clear 
distinction between new file / deleted file, and 0->n / n->0 changes.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-04-04 14:40   ` Geert Bosch
@ 2007-04-04 16:00     ` Rogan Dawes
  0 siblings, 0 replies; 25+ messages in thread
From: Rogan Dawes @ 2007-04-04 16:00 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Andy Parkins, git

Geert Bosch wrote:
> 
> On Apr 4, 2007, at 09:34, Rogan Dawes wrote:
> 
>> For binary files, it would be consistent to show the number of bytes 
>> added/deleted. I have not investigated the output format for the 
>> libxdiff binary patch format, but hopefully it would not be too 
>> difficult to calculate the deletions and additions.
> 
> For binary files it is impractical to do insert/delete type of differences.
> For text files, treating lines as indivisible entities to insert/delete
> make some sense. For binary files, you'd have to use some arbitrary
> context-defined breakpoints and then go from there. The result would
> be some very complicated and unclear algorithm that would have no use
> in the real world.
> 
> Many binary files, such as an images, waveforms or virtually any compressed
> stream, can change in a way that changes all bytes in the file, while
> the changes in the displayed image or the uncompressed stream are
> imperceptible or absent. Guessing semantic differences between binary
> blobs is hopeless and subjective, while differences in size are fact.
> 
>   -Geert

As per my mail to Andy, we *already* do this for text files. e.g. wrap 
an XML document in an additional tag, and update the indentation to match.

The semantic change is minimal (perhaps 2 new lines), but the reported 
change reflects n lines deleted, and n+2 added.

Exactly because we *don't* do any semantic analysis (for text or binary 
files), we should simply report the number of bytes changed, exactly as 
we do for text files (reporting number of lines changed). This is 
_consistent_ with what we do currently for text files.

Note that Andy's apparent preference (to know how the sizes have 
changed) can still largely be satisfied by this approach.

  somefile.bin  | 1000 -> 1000 bytes

and

  somefile.bin  | 500 bytes removed, 500 bytes added

You can still see that the overall size of the file has not changed, but 
you get the additional information about how many bytes were actually 
changed at the same time, which you don't get just showing the sizes.

Rogan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-04-04 14:40   ` Andy Parkins
@ 2007-04-04 15:51     ` Rogan Dawes
  2007-04-04 16:22       ` Johannes Schindelin
  0 siblings, 1 reply; 25+ messages in thread
From: Rogan Dawes @ 2007-04-04 15:51 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git

Andy Parkins wrote:
> On Wednesday 2007 April 04 14:34, Rogan Dawes wrote:
> 
>> Well, how about my comments in <45E67978.9030805@dawes.za.net>,
>> suggesting that the edit difference (number of steps required to
>> transform one to the other) would be a better indication?
> 
> Perhaps.  There is certainly a difference between:
> 
>  somefile.bin  | 1000 -> 1000 bytes
> 
> and
> 
>  somefile.bin  | 500 bytes removed, 500 bytes added
> 
>> I think it is better because it is consistent with what we currently do
>> for text files: show the number of lines added/deleted.
> 
> The thing is, "lines" is an understandable unit for a text file, so it's 
> useful to show.  I'm not sure the same is true of "bytes" for a binary file.  
> Those bytes could represent anything; the true unit of a binary file is 
> dependent on its type.

I think bytes are the only reasonable unit for a binary file, since we 
have no idea what a meaningful divisor may be. So, defaulting to the 
smallest possible unit (other than going to the bit-level) makes perfect 
sense.

>> For binary files, it would be consistent to show the number of bytes
>> added/deleted. I have not investigated the output format for the
>> libxdiff binary patch format, but hopefully it would not be too
>> difficult to calculate the deletions and additions.
> 
> I'm inclined to agree with Johannes, while it's certainly something 
> that /could/ be shown - is it more useful?  There is no guarantee that a 
> small change in the underlying content is represented by a small change in 
> the binary diff.
> 
> As an example: compress a file, change a byte, compress it again, perform a 
> binary diff; what is that diff telling you about the change?  (My answer is: 
> not much).

Well, at least as much as the resulting sizes tell you, if not more.

Here is a counter example for a text file, where lines changed do not 
actually reflect the real changes in the file: the contents of an XML 
file being wrapped in an additional tag.

Semantically, all that has changed is an opening and closing tag. But, 
we still show that on a line by line basis, the entire file has changed 
(because the indentation changes). So you'd have n lines deleted, and 
n+2 lines added (for the additional opening and closing tag).

> Andy

I still maintain that showing bytes changed is the only consistent thing 
to do, unless we have additional logic that allows us to do "per 
file-type" diff statistics. Maybe .gitattributes will allow/enable this?

Regards,

Rogan

P.S. I'm not volunteering to inflict my novice C-skills on the git 
community, so this is really "just my 2c"

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-04-04 13:34 ` Rogan Dawes
  2007-04-04 14:40   ` Geert Bosch
@ 2007-04-04 14:40   ` Andy Parkins
  2007-04-04 15:51     ` Rogan Dawes
  1 sibling, 1 reply; 25+ messages in thread
From: Andy Parkins @ 2007-04-04 14:40 UTC (permalink / raw)
  To: git; +Cc: Rogan Dawes

On Wednesday 2007 April 04 14:34, Rogan Dawes wrote:

> Well, how about my comments in <45E67978.9030805@dawes.za.net>,
> suggesting that the edit difference (number of steps required to
> transform one to the other) would be a better indication?

Perhaps.  There is certainly a difference between:

 somefile.bin  | 1000 -> 1000 bytes

and

 somefile.bin  | 500 bytes removed, 500 bytes added

> I think it is better because it is consistent with what we currently do
> for text files: show the number of lines added/deleted.

The thing is, "lines" is an understandable unit for a text file, so it's 
useful to show.  I'm not sure the same is true of "bytes" for a binary file.  
Those bytes could represent anything; the true unit of a binary file is 
dependent on its type.

> For binary files, it would be consistent to show the number of bytes
> added/deleted. I have not investigated the output format for the
> libxdiff binary patch format, but hopefully it would not be too
> difficult to calculate the deletions and additions.

I'm inclined to agree with Johannes, while it's certainly something 
that /could/ be shown - is it more useful?  There is no guarantee that a 
small change in the underlying content is represented by a small change in 
the binary diff.

As an example: compress a file, change a byte, compress it again, perform a 
binary diff; what is that diff telling you about the change?  (My answer is: 
not much).


Andy

-- 
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-04-04 13:34 ` Rogan Dawes
@ 2007-04-04 14:40   ` Geert Bosch
  2007-04-04 16:00     ` Rogan Dawes
  2007-04-04 14:40   ` Andy Parkins
  1 sibling, 1 reply; 25+ messages in thread
From: Geert Bosch @ 2007-04-04 14:40 UTC (permalink / raw)
  To: Rogan Dawes; +Cc: Andy Parkins, git


On Apr 4, 2007, at 09:34, Rogan Dawes wrote:

> For binary files, it would be consistent to show the number of  
> bytes added/deleted. I have not investigated the output format for  
> the libxdiff binary patch format, but hopefully it would not be too  
> difficult to calculate the deletions and additions.

For binary files it is impractical to do insert/delete type of  
differences.
For text files, treating lines as indivisible entities to insert/delete
make some sense. For binary files, you'd have to use some arbitrary
context-defined breakpoints and then go from there. The result would
be some very complicated and unclear algorithm that would have no use
in the real world.

Many binary files, such as an images, waveforms or virtually any  
compressed
stream, can change in a way that changes all bytes in the file, while
the changes in the displayed image or the uncompressed stream are
imperceptible or absent. Guessing semantic differences between binary
blobs is hopeless and subjective, while differences in size are fact.

   -Geert

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Show binary file size change in diff --stat
  2007-04-04 13:14 Andy Parkins
@ 2007-04-04 13:34 ` Rogan Dawes
  2007-04-04 14:40   ` Geert Bosch
  2007-04-04 14:40   ` Andy Parkins
  0 siblings, 2 replies; 25+ messages in thread
From: Rogan Dawes @ 2007-04-04 13:34 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git

Andy Parkins wrote:
> Previously, a binary file in the diffstat would show as:
> 
>  some-binary-file.bin       |  Bin
> 
> The space after the "Bin" was never used.  This patch changes binary
> lines in the diffstat to be:
> 
>  some-binary-file.bin       |  Bin 12345 -> 123456 bytes
> 
> The very nice "->" notation was suggested by Johannes Schindelin, and
> shows the before and after sizes more clearly than "+" and "-" would.
> If a size is 0 it's not shown (although it would probably be better to
> treat no-file differently from zero-byte-file).
> 
> The user can see what changed in the binary file, and how big the new
> file is.  This is in keeping with the information in the rest of the
> diffstat.
> 
> The diffstat_t members "added" and "deleted" were unused when the file
> was binary, so this patch loads them with the file sizes in
> builtin_diffstat().  These figures are then read in show_stats() when
> the file is marked binary.
> 
> Signed-off-by: Andy Parkins <andyparkins@gmail.com>
> ---
> This is a resend; I still wish for it every time I see a --stat with
> binary files changed.
> 
> Is there any objection to it that I can address?


Well, how about my comments in <45E67978.9030805@dawes.za.net>, 
suggesting that the edit difference (number of steps required to 
transform one to the other) would be a better indication?

I think it is better because it is consistent with what we currently do 
for text files: show the number of lines added/deleted.

For binary files, it would be consistent to show the number of bytes 
added/deleted. I have not investigated the output format for the 
libxdiff binary patch format, but hopefully it would not be too 
difficult to calculate the deletions and additions.

Rogan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] Show binary file size change in diff --stat
@ 2007-04-04 13:14 Andy Parkins
  2007-04-04 13:34 ` Rogan Dawes
  0 siblings, 1 reply; 25+ messages in thread
From: Andy Parkins @ 2007-04-04 13:14 UTC (permalink / raw)
  To: git

Previously, a binary file in the diffstat would show as:

 some-binary-file.bin       |  Bin

The space after the "Bin" was never used.  This patch changes binary
lines in the diffstat to be:

 some-binary-file.bin       |  Bin 12345 -> 123456 bytes

The very nice "->" notation was suggested by Johannes Schindelin, and
shows the before and after sizes more clearly than "+" and "-" would.
If a size is 0 it's not shown (although it would probably be better to
treat no-file differently from zero-byte-file).

The user can see what changed in the binary file, and how big the new
file is.  This is in keeping with the information in the rest of the
diffstat.

The diffstat_t members "added" and "deleted" were unused when the file
was binary, so this patch loads them with the file sizes in
builtin_diffstat().  These figures are then read in show_stats() when
the file is marked binary.

Signed-off-by: Andy Parkins <andyparkins@gmail.com>
---
This is a resend; I still wish for it every time I see a --stat with
binary files changed.

Is there any objection to it that I can address?


 diff.c |   13 ++++++++++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/diff.c b/diff.c
index d8f9242..1a527e9 100644
--- a/diff.c
+++ b/diff.c
@@ -811,7 +811,12 @@ static void show_stats(struct diffstat_t* data, struct diff_options *options)
 
 		if (data->files[i]->is_binary) {
 			show_name(prefix, name, len, reset, set);
-			printf("  Bin\n");
+			printf("  Bin ");
+			printf("%s%i%s", del_c, deleted, reset);
+			printf(" -> ");
+			printf("%s%i%s", add_c, added, reset);
+			printf(" bytes");
+			printf("\n");
 			goto free_diffstat_file;
 		}
 		else if (data->files[i]->is_unmerged) {
@@ -1185,9 +1190,11 @@ static void builtin_diffstat(const char *name_a, const char *name_b,
 	if (fill_mmfile(&mf1, one) < 0 || fill_mmfile(&mf2, two) < 0)
 		die("unable to read files to diff");
 
-	if (mmfile_is_binary(&mf1) || mmfile_is_binary(&mf2))
+	if (mmfile_is_binary(&mf1) || mmfile_is_binary(&mf2)) {
 		data->is_binary = 1;
-	else {
+		data->added = mf2.size;
+		data->deleted = mf1.size;
+	} else {
 		/* Crazy xdl interfaces.. */
 		xpparam_t xpp;
 		xdemitconf_t xecfg;
-- 
1.5.1.rc3.703.g5709-dirty

^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2007-04-04 17:48 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-28 13:03 [PATCH] Show binary file size change in diff --stat Andy Parkins
2007-02-28 14:44 ` Johannes Schindelin
2007-02-28 14:51   ` Nicolas Pitre
2007-02-28 15:15   ` Andy Parkins
2007-02-28 15:37     ` Johannes Schindelin
2007-02-28 18:42       ` Andy Parkins
2007-02-28 19:41         ` Johannes Schindelin
2007-02-28 15:26   ` Andy Parkins
2007-02-28 18:58 ` Rogan Dawes
2007-02-28 19:42   ` Johannes Schindelin
2007-02-28 21:27     ` Rogan Dawes
2007-03-01  1:09       ` Johannes Schindelin
2007-03-01  6:58         ` Rogan Dawes
2007-04-04 13:14 Andy Parkins
2007-04-04 13:34 ` Rogan Dawes
2007-04-04 14:40   ` Geert Bosch
2007-04-04 16:00     ` Rogan Dawes
2007-04-04 14:40   ` Andy Parkins
2007-04-04 15:51     ` Rogan Dawes
2007-04-04 16:22       ` Johannes Schindelin
2007-04-04 16:26         ` Rogan Dawes
2007-04-04 16:40         ` Linus Torvalds
2007-04-04 16:59           ` Johannes Schindelin
2007-04-04 17:12             ` Linus Torvalds
2007-04-04 17:47           ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.