git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Fixes for undefined behaviour
@ 2013-04-02 19:50 John Keeping
  2013-04-02 19:50 ` [PATCH 1/2] diffcore-break: don't divide by zero John Keeping
  2013-04-02 19:50 ` [PATCH 2/2] bisect: avoid signed integer overflow John Keeping
  0 siblings, 2 replies; 9+ messages in thread
From: John Keeping @ 2013-04-02 19:50 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, John Keeping

I've been playing with Clang's undefined behaviour sanitizer, which
points out a few potential issues in Git when running the test suite
(it's a runtime analysis that is compiled in by setting suitable
CFLAGS).

These patches fix one issue that I think we need to worry about and one
that's trivial to fix.

The remaining warnings are:

refs.c:2426:17: runtime error: index -1 out of bounds for type 'char [8192]'

  Caused by a loop walking backwards over the reflog which sets its scan
  pointer to be one before the start of the buffer in order to break out
  of the loop.  It seems unlikely that the (stack allocated) buffer will
  be at address zero so I don't think any sane compiler will cause us
  problems here.

tag.c:104:40: runtime error: member access within null pointer of type
'struct commit'

  This does "&lookup_commit(sha1)->object" which ends up being okay
  because "object" is the first item in struct commit.  I'm not sure
  it's worth the churn to change this.

xdiff/xutils.c:308:7: runtime error: load of misaligned address for type
'unsigned long', which requires 8 byte alignment

  This is in the XDL_FAST_HASH code, which should only be used on
  architectures where this is likely to be reasonably fast.  The commit
  introducing this code points at an LKML thread[1] discussing a similar
  implementation in the kernel, which discusses the impact of the
  unaligned access, the conclusion being that it's faster than any
  alternative.

[1] https://lkml.org/lkml/2012/3/2/452

John Keeping (2):
  diffcore-break: don't divide by zero
  bisect: avoid signed integer overflow

 bisect.c         | 2 +-
 diffcore-break.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

-- 
1.8.2.540.gf023cfe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] diffcore-break: don't divide by zero
  2013-04-02 19:50 [PATCH 0/2] Fixes for undefined behaviour John Keeping
@ 2013-04-02 19:50 ` John Keeping
  2013-04-02 21:15   ` Junio C Hamano
  2013-04-02 19:50 ` [PATCH 2/2] bisect: avoid signed integer overflow John Keeping
  1 sibling, 1 reply; 9+ messages in thread
From: John Keeping @ 2013-04-02 19:50 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, John Keeping

When the source file is empty, the calculation of the merge score
results in a division by zero.  Since the merge score is initialized to
zero, it makes sense to just leave it as it is if the source size is
zero.  This means that we still use the extent of damage metric to
decide whether to break the filepair.

Signed-off-by: John Keeping <john@keeping.me.uk>
---
 diffcore-break.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/diffcore-break.c b/diffcore-break.c
index 44f8678..37d8d05 100644
--- a/diffcore-break.c
+++ b/diffcore-break.c
@@ -90,7 +90,8 @@ static int should_break(struct diff_filespec *src,
 	 * merge the surviving pair together if the score is
 	 * less than the minimum, after rename/copy runs.
 	 */
-	*merge_score_p = (int)(src_removed * MAX_SCORE / src->size);
+	if (src->size)
+		*merge_score_p = (int)(src_removed * MAX_SCORE / src->size);
 	if (*merge_score_p > break_score)
 		return 1;
 
-- 
1.8.2.540.gf023cfe

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] bisect: avoid signed integer overflow
  2013-04-02 19:50 [PATCH 0/2] Fixes for undefined behaviour John Keeping
  2013-04-02 19:50 ` [PATCH 1/2] diffcore-break: don't divide by zero John Keeping
@ 2013-04-02 19:50 ` John Keeping
  2013-04-02 21:09   ` Junio C Hamano
  1 sibling, 1 reply; 9+ messages in thread
From: John Keeping @ 2013-04-02 19:50 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, John Keeping

Signed-off-by: John Keeping <john@keeping.me.uk>
---
 bisect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/bisect.c b/bisect.c
index bd1b7b5..0d33c6f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -526,7 +526,7 @@ struct commit_list *filter_skipped(struct commit_list *list,
  * for this application.
  */
 static int get_prn(int count) {
-	count = count * 1103515245 + 12345;
+	count = ((unsigned) count) * 1103515245 + 12345;
 	return ((unsigned)(count/65536) % PRN_MODULO);
 }
 
-- 
1.8.2.540.gf023cfe

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] bisect: avoid signed integer overflow
  2013-04-02 19:50 ` [PATCH 2/2] bisect: avoid signed integer overflow John Keeping
@ 2013-04-02 21:09   ` Junio C Hamano
  2013-04-03 19:17     ` John Keeping
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2013-04-02 21:09 UTC (permalink / raw)
  To: John Keeping; +Cc: git

John Keeping <john@keeping.me.uk> writes:

> Signed-off-by: John Keeping <john@keeping.me.uk>
> ---
>  bisect.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/bisect.c b/bisect.c
> index bd1b7b5..0d33c6f 100644
> --- a/bisect.c
> +++ b/bisect.c
> @@ -526,7 +526,7 @@ struct commit_list *filter_skipped(struct commit_list *list,
>   * for this application.
>   */
>  static int get_prn(int count) {
> -	count = count * 1103515245 + 12345;
> +	count = ((unsigned) count) * 1103515245 + 12345;
>  	return ((unsigned)(count/65536) % PRN_MODULO);

I wonder

	static int get_prn(unsigned);

or even

	static unsigned get_prn(unsigned);

to lose the existing cast may be a better alternative.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] diffcore-break: don't divide by zero
  2013-04-02 19:50 ` [PATCH 1/2] diffcore-break: don't divide by zero John Keeping
@ 2013-04-02 21:15   ` Junio C Hamano
  2013-04-02 21:36     ` John Keeping
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2013-04-02 21:15 UTC (permalink / raw)
  To: John Keeping; +Cc: git

John Keeping <john@keeping.me.uk> writes:

> When the source file is empty, the calculation of the merge score
> results in a division by zero.  Since the merge score is initialized to
> zero, it makes sense to just leave it as it is if the source size is
> zero.  This means that we still use the extent of damage metric to
> decide whether to break the filepair.

Well spotted.  An empty blob to another blob that is larger than 400
bytes will trigger div0, and it makes sense to leave merge-score to
0 (i.e. do not show as whole-delete-then-whole-add after rename
detection is done and the broken filepair is merged back).

Actually, if src->size is 0, we probably shouldn't break the filepair
in the first place.  That is, if your preimage and postimage looked
like these:


     == preimage ==		== postimage ==

     F (empty file)		F (a large file)
			        E (a new empty file)

do we want to see F renamed to E and then a new file created as F
while running "git diff -B -M"?  I doubt it.

So in that sense, this might be a better solution.  I dunno.

 diffcore-break.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/diffcore-break.c b/diffcore-break.c
index 44f8678..eabafd5 100644
--- a/diffcore-break.c
+++ b/diffcore-break.c
@@ -67,6 +67,8 @@ static int should_break(struct diff_filespec *src,
 	max_size = ((src->size > dst->size) ? src->size : dst->size);
 	if (max_size < MINIMUM_BREAK_SIZE)
 		return 0; /* we do not break too small filepair */
+	if (src->size == 0)
+		return 0; /* we do not let empty files get renamed */
 
 	if (diffcore_count_changes(src, dst,
 				   &src->cnt_data, &dst->cnt_data,

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] diffcore-break: don't divide by zero
  2013-04-02 21:15   ` Junio C Hamano
@ 2013-04-02 21:36     ` John Keeping
  2013-04-02 22:41       ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: John Keeping @ 2013-04-02 21:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Apr 02, 2013 at 02:15:17PM -0700, Junio C Hamano wrote:
> John Keeping <john@keeping.me.uk> writes:
> 
> > When the source file is empty, the calculation of the merge score
> > results in a division by zero.  Since the merge score is initialized to
> > zero, it makes sense to just leave it as it is if the source size is
> > zero.  This means that we still use the extent of damage metric to
> > decide whether to break the filepair.
> 
> Well spotted.  An empty blob to another blob that is larger than 400
> bytes will trigger div0, and it makes sense to leave merge-score to
> 0 (i.e. do not show as whole-delete-then-whole-add after rename
> detection is done and the broken filepair is merged back).
> 
> Actually, if src->size is 0, we probably shouldn't break the filepair
> in the first place.  That is, if your preimage and postimage looked
> like these:
> 
> 
>      == preimage ==		== postimage ==
> 
>      F (empty file)		F (a large file)
> 			        E (a new empty file)
> 
> do we want to see F renamed to E and then a new file created as F
> while running "git diff -B -M"?  I doubt it.
> 
> So in that sense, this might be a better solution.  I dunno.

I considered this, but didn't quite understand the code well enough to
be sure that was the right thing to do.  Does it make sense to bail out
early if dst->size is zero as well?

The message for commit 6dd4b66 (Fix diffcore-break total breakage)
indicates that "don't bother to break small files" is wrong in some
cases, but it I wonder if "don't bother to break empty files" is okay.

>  diffcore-break.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/diffcore-break.c b/diffcore-break.c
> index 44f8678..eabafd5 100644
> --- a/diffcore-break.c
> +++ b/diffcore-break.c
> @@ -67,6 +67,8 @@ static int should_break(struct diff_filespec *src,
>  	max_size = ((src->size > dst->size) ? src->size : dst->size);
>  	if (max_size < MINIMUM_BREAK_SIZE)
>  		return 0; /* we do not break too small filepair */
> +	if (src->size == 0)
> +		return 0; /* we do not let empty files get renamed */
>  
>  	if (diffcore_count_changes(src, dst,
>  				   &src->cnt_data, &dst->cnt_data,

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] diffcore-break: don't divide by zero
  2013-04-02 21:36     ` John Keeping
@ 2013-04-02 22:41       ` Junio C Hamano
  2013-04-03 19:24         ` [PATCH v2] " John Keeping
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2013-04-02 22:41 UTC (permalink / raw)
  To: John Keeping; +Cc: git

John Keeping <john@keeping.me.uk> writes:

> The message for commit 6dd4b66 (Fix diffcore-break total breakage)
> indicates that "don't bother to break small files" is wrong in some
> cases, but it I wonder if "don't bother to break empty files" is okay.

This has a rather subtle ramifications, and we would need to think
carefully.  "break" does two very different things, and the criteria
you would want to use to decide to "break" a filepair depends on
which one you are interested in.

The very original motivation of "break" was to show a patch that
rewrites an existing file completely in a way different from the
usual "diff -u" output, which will try to minimize the output by
finding (rare) lines that happen to be the same between the preimage
and postimage, intersparsed in many lines of "-" (deletion) and "+"
(addition).  Such a change is often easier to understand when shown
as a single block of "-" (deletion) of the entire original contents,
followed by a single block of "+" (addition) of the entire new
contents.

A totally separate motivation of "break" is the one Linus talks
about in the log message of the said commit.  A path filename.h was
moved to filename_32.h, and a new (and much smaller) filename.h was
introduced, that "#include"s filename_32.h.  "diff -M" that pairs
deleted files with created files to compute renames in such a case
would not consider the original filename.h as a possible source of
filename_32.h that was created.  You want to break modification of
filename.h into (virtual) deletion and addition of filename.h.

For the purpose of the former, you would want not to break a file
too aggressively.  If you started from a file with 1000 lines in it,
and deleted 910 lines and added 10 lines to result in a file with
100 lines, you still have 90 lines of shared contents between the
preimage and the postimage, and you do not want to show it as
"delete 1000 lines and add 100 lines".  You would want to base your
decision on how much common material exists between the two.

For the purpose of the latter, however, it is better to err on the
side of breaking than not breaking.  After breaking a suspicious
modification into addition and deletion, if rename comparison does
not find a suitable destination for the virtual deletion, the broken
halves will be merged back, so breaking too little can hurt by
missing potential renames, but breaking too much will not.  You do
want to break the "900 deleted out of 1000 and then added 10" case,
as that 900 lines may have gone to another file that was created in
the same commit.  "But we have 90 lines of common material" does not
matter for the decision to break.

In today's code, the return value of should_break() is about the
latter, while the score it stores in *merge_score is about the
former.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] bisect: avoid signed integer overflow
  2013-04-02 21:09   ` Junio C Hamano
@ 2013-04-03 19:17     ` John Keeping
  0 siblings, 0 replies; 9+ messages in thread
From: John Keeping @ 2013-04-03 19:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git


Signed-off-by: John Keeping <john@keeping.me.uk>
---
Changes since v1:
- Change the function signature instead of casting a value in the
  function.
- This lets us remove an existing cast.

 bisect.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/bisect.c b/bisect.c
index bd1b7b5..374d9e2 100644
--- a/bisect.c
+++ b/bisect.c
@@ -525,9 +525,9 @@ struct commit_list *filter_skipped(struct commit_list *list,
  * is increased by one between each call, but that should not matter
  * for this application.
  */
-static int get_prn(int count) {
+static unsigned get_prn(unsigned count) {
 	count = count * 1103515245 + 12345;
-	return ((unsigned)(count/65536) % PRN_MODULO);
+	return (count/65536) % PRN_MODULO;
 }
 
 /*
-- 
1.8.2.540.gf023cfe

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2] diffcore-break: don't divide by zero
  2013-04-02 22:41       ` Junio C Hamano
@ 2013-04-03 19:24         ` John Keeping
  0 siblings, 0 replies; 9+ messages in thread
From: John Keeping @ 2013-04-03 19:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

When the source file is empty, the calculation of the merge score
results in a division by zero.  In the situation:

     == preimage ==             == postimage ==

     F (empty file)             F (a large file)
                                E (a new empty file)

it does not make sense to consider F->E as a rename, so it is better not
to break the pre- and post-image of F.

Bail out early in this case to avoid hitting the divide-by-zero.  This
causes the merge score to be left at zero.

Signed-off-by: John Keeping <john@keeping.me.uk>
---
On Tue, Apr 02, 2013 at 03:41:10PM -0700, Junio C Hamano wrote:
> John Keeping <john@keeping.me.uk> writes:
> 
> > The message for commit 6dd4b66 (Fix diffcore-break total breakage)
> > indicates that "don't bother to break small files" is wrong in some
> > cases, but it I wonder if "don't bother to break empty files" is okay.
> 
> This has a rather subtle ramifications, and we would need to think
> carefully.  "break" does two very different things, and the criteria
> you would want to use to decide to "break" a filepair depends on
> which one you are interested in.
> 
> The very original motivation of "break" was to show a patch that
> rewrites an existing file completely in a way different from the
> usual "diff -u" output, which will try to minimize the output by
> finding (rare) lines that happen to be the same between the preimage
> and postimage, intersparsed in many lines of "-" (deletion) and "+"
> (addition).  Such a change is often easier to understand when shown
> as a single block of "-" (deletion) of the entire original contents,
> followed by a single block of "+" (addition) of the entire new
> contents.
> 
> A totally separate motivation of "break" is the one Linus talks
> about in the log message of the said commit.  A path filename.h was
> moved to filename_32.h, and a new (and much smaller) filename.h was
> introduced, that "#include"s filename_32.h.  "diff -M" that pairs
> deleted files with created files to compute renames in such a case
> would not consider the original filename.h as a possible source of
> filename_32.h that was created.  You want to break modification of
> filename.h into (virtual) deletion and addition of filename.h.
> 
> For the purpose of the former, you would want not to break a file
> too aggressively.  If you started from a file with 1000 lines in it,
> and deleted 910 lines and added 10 lines to result in a file with
> 100 lines, you still have 90 lines of shared contents between the
> preimage and the postimage, and you do not want to show it as
> "delete 1000 lines and add 100 lines".  You would want to base your
> decision on how much common material exists between the two.
> 
> For the purpose of the latter, however, it is better to err on the
> side of breaking than not breaking.  After breaking a suspicious
> modification into addition and deletion, if rename comparison does
> not find a suitable destination for the virtual deletion, the broken
> halves will be merged back, so breaking too little can hurt by
> missing potential renames, but breaking too much will not.  You do
> want to break the "900 deleted out of 1000 and then added 10" case,
> as that 900 lines may have gone to another file that was created in
> the same commit.  "But we have 90 lines of common material" does not
> matter for the decision to break.
> 
> In today's code, the return value of should_break() is about the
> latter, while the score it stores in *merge_score is about the
> former.

Thanks for this explanation.  Following it through, it seems that
bailing out early when the destination is empty will do the wrong thing,
whereas letting the code carry on should result in a merge score of
MAX_SCORE and the function returning 1.

So it seems that the patch you proposed is the best way to handle this.

 diffcore-break.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/diffcore-break.c b/diffcore-break.c
index 44f8678..1d9e530 100644
--- a/diffcore-break.c
+++ b/diffcore-break.c
@@ -68,6 +68,9 @@ static int should_break(struct diff_filespec *src,
 	if (max_size < MINIMUM_BREAK_SIZE)
 		return 0; /* we do not break too small filepair */
 
+	if (!src->size)
+		return 0; /* we do not let empty files get renamed */
+
 	if (diffcore_count_changes(src, dst,
 				   &src->cnt_data, &dst->cnt_data,
 				   0,
-- 
1.8.2.540.gf023cfe

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-04-03 19:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-02 19:50 [PATCH 0/2] Fixes for undefined behaviour John Keeping
2013-04-02 19:50 ` [PATCH 1/2] diffcore-break: don't divide by zero John Keeping
2013-04-02 21:15   ` Junio C Hamano
2013-04-02 21:36     ` John Keeping
2013-04-02 22:41       ` Junio C Hamano
2013-04-03 19:24         ` [PATCH v2] " John Keeping
2013-04-02 19:50 ` [PATCH 2/2] bisect: avoid signed integer overflow John Keeping
2013-04-02 21:09   ` Junio C Hamano
2013-04-03 19:17     ` John Keeping

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).