git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] test-tool.h: include git-compat-util.h
@ 2018-08-21 18:41 Jeff King
  2018-08-21 19:03 ` Junio C Hamano
                   ` (6 more replies)
  0 siblings, 7 replies; 55+ messages in thread
From: Jeff King @ 2018-08-21 18:41 UTC (permalink / raw)
  To: git

The test-tool programs include "test-tool.h" as their first
include, which breaks our CodingGuideline of "the first
include must be git-compat-util.h or an equivalent".

Rather than change them all, let's instead make test-tool.h
one of those equivalents, just like we do for builtin.h
(which many of the actual git builtins include first).

Signed-off-by: Jeff King <peff@peff.net>
---
Repost, as the original was in a larger thread about includes and didn't
get any comment.

 t/helper/test-tool.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index e926c416ea..e954e8c522 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -1,6 +1,8 @@
 #ifndef __TEST_TOOL_H__
 #define __TEST_TOOL_H__
 
+#include "git-compat-util.h"
+
 int cmd__chmtime(int argc, const char **argv);
 int cmd__config(int argc, const char **argv);
 int cmd__ctype(int argc, const char **argv);
-- 
2.19.0.rc0.398.g138a08f6f6

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] test-tool.h: include git-compat-util.h
  2018-08-21 18:41 [PATCH] test-tool.h: include git-compat-util.h Jeff King
@ 2018-08-21 19:03 ` Junio C Hamano
  2018-08-21 19:06 ` [PATCH 1/6] t/perf: factor boilerplate out of test_perf Jeff King
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 55+ messages in thread
From: Junio C Hamano @ 2018-08-21 19:03 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> The test-tool programs include "test-tool.h" as their first
> include, which breaks our CodingGuideline of "the first
> include must be git-compat-util.h or an equivalent".
>
> Rather than change them all, let's instead make test-tool.h
> one of those equivalents, just like we do for builtin.h
> (which many of the actual git builtins include first).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> Repost, as the original was in a larger thread about includes and didn't
> get any comment.

Thanks.  Will queue.

>
>  t/helper/test-tool.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
> index e926c416ea..e954e8c522 100644
> --- a/t/helper/test-tool.h
> +++ b/t/helper/test-tool.h
> @@ -1,6 +1,8 @@
>  #ifndef __TEST_TOOL_H__
>  #define __TEST_TOOL_H__
>  
> +#include "git-compat-util.h"
> +
>  int cmd__chmtime(int argc, const char **argv);
>  int cmd__config(int argc, const char **argv);
>  int cmd__ctype(int argc, const char **argv);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/6] t/perf: factor boilerplate out of test_perf
  2018-08-21 18:41 [PATCH] test-tool.h: include git-compat-util.h Jeff King
  2018-08-21 19:03 ` Junio C Hamano
@ 2018-08-21 19:06 ` Jeff King
  2018-08-21 19:06 ` [PATCH 2/6] t/perf: factor out percent calculations Jeff King
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-21 19:06 UTC (permalink / raw)
  To: git

About half of test_perf() is boilerplate preparing to run
_any_ test, and the other half is specifically running a
timing test. Let's split it into two functions, so that we
can reuse the boilerplate in future commits.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/perf/perf-lib.sh | 61 ++++++++++++++++++++++++++--------------------
 1 file changed, 35 insertions(+), 26 deletions(-)

diff --git a/t/perf/perf-lib.sh b/t/perf/perf-lib.sh
index e4c343a6b7..a54be09516 100644
--- a/t/perf/perf-lib.sh
+++ b/t/perf/perf-lib.sh
@@ -179,8 +179,8 @@ exit $ret' >&3 2>&4
 	return "$eval_ret"
 }
 
-
-test_perf () {
+test_wrapper_ () {
+	test_wrapper_func_=$1; shift
 	test_start_
 	test "$#" = 3 && { test_prereq=$1; shift; } || test_prereq=
 	test "$#" = 2 ||
@@ -191,35 +191,44 @@ test_perf () {
 		base=$(basename "$0" .sh)
 		echo "$test_count" >>"$perf_results_dir"/$base.subtests
 		echo "$1" >"$perf_results_dir"/$base.$test_count.descr
-		if test -z "$verbose"; then
-			printf "%s" "perf $test_count - $1:"
-		else
-			echo "perf $test_count - $1:"
-		fi
-		for i in $(test_seq 1 $GIT_PERF_REPEAT_COUNT); do
-			say >&3 "running: $2"
-			if test_run_perf_ "$2"
-			then
-				if test -z "$verbose"; then
-					printf " %s" "$i"
-				else
-					echo "* timing run $i/$GIT_PERF_REPEAT_COUNT:"
-				fi
+		base="$perf_results_dir"/"$perf_results_prefix$(basename "$0" .sh)"."$test_count"
+		"$test_wrapper_func_" "$@"
+	fi
+
+	test_finish_
+}
+
+test_perf_ () {
+	if test -z "$verbose"; then
+		printf "%s" "perf $test_count - $1:"
+	else
+		echo "perf $test_count - $1:"
+	fi
+	for i in $(test_seq 1 $GIT_PERF_REPEAT_COUNT); do
+		say >&3 "running: $2"
+		if test_run_perf_ "$2"
+		then
+			if test -z "$verbose"; then
+				printf " %s" "$i"
 			else
-				test -z "$verbose" && echo
-				test_failure_ "$@"
-				break
+				echo "* timing run $i/$GIT_PERF_REPEAT_COUNT:"
 			fi
-		done
-		if test -z "$verbose"; then
-			echo " ok"
 		else
-			test_ok_ "$1"
+			test -z "$verbose" && echo
+			test_failure_ "$@"
+			break
 		fi
-		base="$perf_results_dir"/"$perf_results_prefix$(basename "$0" .sh)"."$test_count"
-		"$TEST_DIRECTORY"/perf/min_time.perl test_time.* >"$base".times
+	done
+	if test -z "$verbose"; then
+		echo " ok"
+	else
+		test_ok_ "$1"
 	fi
-	test_finish_
+	"$TEST_DIRECTORY"/perf/min_time.perl test_time.* >"$base".times
+}
+
+test_perf () {
+	test_wrapper_ test_perf_ "$@"
 }
 
 # We extend test_done to print timings at the end (./run disables this
-- 
2.19.0.rc0.398.g138a08f6f6


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 2/6] t/perf: factor out percent calculations
  2018-08-21 18:41 [PATCH] test-tool.h: include git-compat-util.h Jeff King
  2018-08-21 19:03 ` Junio C Hamano
  2018-08-21 19:06 ` [PATCH 1/6] t/perf: factor boilerplate out of test_perf Jeff King
@ 2018-08-21 19:06 ` Jeff King
  2018-08-21 19:06 ` [PATCH 3/6] t/perf: add infrastructure for measuring sizes Jeff King
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-21 19:06 UTC (permalink / raw)
  To: git

This will let us reuse the code when we add new values to
aggregate besides times.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/perf/aggregate.perl | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/t/perf/aggregate.perl b/t/perf/aggregate.perl
index bc865160e7..3181b087ab 100755
--- a/t/perf/aggregate.perl
+++ b/t/perf/aggregate.perl
@@ -19,21 +19,24 @@ sub get_times {
 	return ($rt, $4, $5);
 }
 
+sub relative_change {
+	my ($r, $firstr) = @_;
+	if ($firstr > 0) {
+		return sprintf "%+.1f%%", 100.0*($r-$firstr)/$firstr;
+	} elsif ($r == 0) {
+		return "=";
+	} else {
+		return "+inf";
+	}
+}
+
 sub format_times {
 	my ($r, $u, $s, $firstr) = @_;
 	if (!defined $r) {
 		return "<missing>";
 	}
 	my $out = sprintf "%.2f(%.2f+%.2f)", $r, $u, $s;
-	if (defined $firstr) {
-		if ($firstr > 0) {
-			$out .= sprintf " %+.1f%%", 100.0*($r-$firstr)/$firstr;
-		} elsif ($r == 0) {
-			$out .= " =";
-		} else {
-			$out .= " +inf";
-		}
-	}
+	$out .= ' ' . relative_change($r, $firstr) if defined $firstr;
 	return $out;
 }
 
-- 
2.19.0.rc0.398.g138a08f6f6


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 3/6] t/perf: add infrastructure for measuring sizes
  2018-08-21 18:41 [PATCH] test-tool.h: include git-compat-util.h Jeff King
                   ` (2 preceding siblings ...)
  2018-08-21 19:06 ` [PATCH 2/6] t/perf: factor out percent calculations Jeff King
@ 2018-08-21 19:06 ` Jeff King
  2018-08-22 13:40   ` Derrick Stolee
  2018-08-21 19:06 ` [PATCH 4/6] t/perf: add perf tests for fetches from a bitmapped server Jeff King
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 55+ messages in thread
From: Jeff King @ 2018-08-21 19:06 UTC (permalink / raw)
  To: git

The main objective of scripts in the perf framework is to
run "test_perf", which measures the time it takes to run
some operation. However, it can also be interesting to see
the change in the output size of certain operations.

This patch introduces test_size, which records a single
numeric output from the test and shows it in the aggregated
output (with pretty printing and relative size comparison).

Signed-off-by: Jeff King <peff@peff.net>
---
 t/perf/README         | 25 ++++++++++++++++++++++
 t/perf/aggregate.perl | 48 ++++++++++++++++++++++++++++++++++++++-----
 t/perf/perf-lib.sh    | 13 ++++++++++++
 3 files changed, 81 insertions(+), 5 deletions(-)

diff --git a/t/perf/README b/t/perf/README
index 21321a0f36..be12090c38 100644
--- a/t/perf/README
+++ b/t/perf/README
@@ -168,3 +168,28 @@ that
   While we have tried to make sure that it can cope with embedded
   whitespace and other special characters, it will not work with
   multi-line data.
+
+Rather than tracking the performance by run-time as `test_perf` does, you
+may also track output size by using `test_size`. The stdout of the
+function should be a single numeric value, which will be captured and
+shown in the aggregated output. For example:
+
+	test_perf 'time foo' '
+		./foo >foo.out
+	'
+
+	test_size 'output size'
+		wc -c <foo.out
+	'
+
+might produce output like:
+
+	Test                origin           HEAD
+	-------------------------------------------------------------
+	1234.1 time foo     0.37(0.79+0.02)  0.26(0.51+0.02) -29.7%
+	1234.2 output size             4.3M             3.6M -14.7%
+
+The item being measured (and its units) is up to the test; the context
+and the test title should make it clear to the user whether bigger or
+smaller numbers are better. Unlike test_perf, the test code will only be
+run once, since output sizes tend to be more deterministic than timings.
diff --git a/t/perf/aggregate.perl b/t/perf/aggregate.perl
index 3181b087ab..494907a892 100755
--- a/t/perf/aggregate.perl
+++ b/t/perf/aggregate.perl
@@ -13,10 +13,16 @@ sub get_times {
 	my $line = <$fh>;
 	return undef if not defined $line;
 	close $fh or die "cannot close $name: $!";
-	$line =~ /^(?:(\d+):)?(\d+):(\d+(?:\.\d+)?) (\d+(?:\.\d+)?) (\d+(?:\.\d+)?)$/
-		or die "bad input line: $line";
-	my $rt = ((defined $1 ? $1 : 0.0)*60+$2)*60+$3;
-	return ($rt, $4, $5);
+	# times
+	if ($line =~ /^(?:(\d+):)?(\d+):(\d+(?:\.\d+)?) (\d+(?:\.\d+)?) (\d+(?:\.\d+)?)$/) {
+		my $rt = ((defined $1 ? $1 : 0.0)*60+$2)*60+$3;
+		return ($rt, $4, $5);
+	# size
+	} elsif ($line =~ /^\d+$/) {
+		return $&;
+	} else {
+		die "bad input line: $line";
+	}
 }
 
 sub relative_change {
@@ -32,9 +38,15 @@ sub relative_change {
 
 sub format_times {
 	my ($r, $u, $s, $firstr) = @_;
+	# no value means we did not finish the test
 	if (!defined $r) {
 		return "<missing>";
 	}
+	# a single value means we have a size, not times
+	if (!defined $u) {
+		return format_size($r, $firstr);
+	}
+	# otherwise, we have real/user/system times
 	my $out = sprintf "%.2f(%.2f+%.2f)", $r, $u, $s;
 	$out .= ' ' . relative_change($r, $firstr) if defined $firstr;
 	return $out;
@@ -54,6 +66,25 @@ sub usage {
 	exit(1);
 }
 
+sub human_size {
+	my $n = shift;
+	my @units = ('', qw(K M G));
+	while ($n > 900 && @units > 1) {
+		$n /= 1000;
+		shift @units;
+	}
+	return $n unless length $units[0];
+	return sprintf '%.1f%s', $n, $units[0];
+}
+
+sub format_size {
+	my ($size, $first) = @_;
+	# match the width of a time: 0.00(0.00+0.00)
+	my $out = sprintf '%15s', human_size($size);
+	$out .= ' ' . relative_change($size, $first) if defined $first;
+	return $out;
+}
+
 my (@dirs, %dirnames, %dirabbrevs, %prefixes, @tests,
     $codespeed, $sortby, $subsection, $reponame);
 
@@ -184,7 +215,14 @@ sub print_default_results {
 		my $firstr;
 		for my $i (0..$#dirs) {
 			my $d = $dirs[$i];
-			$times{$prefixes{$d}.$t} = [get_times("$resultsdir/$prefixes{$d}$t.times")];
+			my $base = "$resultsdir/$prefixes{$d}$t";
+			$times{$prefixes{$d}.$t} = [];
+			foreach my $type (qw(times size)) {
+				if (-e "$base.$type") {
+					$times{$prefixes{$d}.$t} = [get_times("$base.$type")];
+					last;
+				}
+			}
 			my ($r,$u,$s) = @{$times{$prefixes{$d}.$t}};
 			my $w = length format_times($r,$u,$s,$firstr);
 			$colwidth[$i] = $w if $w > $colwidth[$i];
diff --git a/t/perf/perf-lib.sh b/t/perf/perf-lib.sh
index a54be09516..11d1922cf5 100644
--- a/t/perf/perf-lib.sh
+++ b/t/perf/perf-lib.sh
@@ -231,6 +231,19 @@ test_perf () {
 	test_wrapper_ test_perf_ "$@"
 }
 
+test_size_ () {
+	say >&3 "running: $2"
+	if test_eval_ "$2" 3>"$base".size; then
+		test_ok_ "$1"
+	else
+		test_failure_ "$@"
+	fi
+}
+
+test_size () {
+	test_wrapper_ test_size_ "$@"
+}
+
 # We extend test_done to print timings at the end (./run disables this
 # and does it after running everything)
 test_at_end_hook_ () {
-- 
2.19.0.rc0.398.g138a08f6f6


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 4/6] t/perf: add perf tests for fetches from a bitmapped server
  2018-08-21 18:41 [PATCH] test-tool.h: include git-compat-util.h Jeff King
                   ` (3 preceding siblings ...)
  2018-08-21 19:06 ` [PATCH 3/6] t/perf: add infrastructure for measuring sizes Jeff King
@ 2018-08-21 19:06 ` Jeff King
  2018-08-21 19:07 ` [PATCH 5/6] pack-bitmap: save "have" bitmap from walk Jeff King
  2018-08-21 19:07 ` [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects Jeff King
  6 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-21 19:06 UTC (permalink / raw)
  To: git

A server with bitmapped packs can serve a clone very
quickly. However, fetches are not necessarily made any
faster, because we spend a lot less time in object traversal
(which is what bitmaps help with) and more time finding
deltas (because we may have to throw out on-disk deltas if
the client does not have the base).

As a first step to making this faster, this patch introduces
a new perf script to measure fetches into a repo of various
ages from a fully-bitmapped server.

We separately measure the work done by the server (in
pack-objects) and that done by the client (in index-pack).
Furthermore, we measure the size of the resulting pack.

Breaking it down like this (instead of just doing a regular
"git fetch") lets us see how much each side benefits from
any changes. And since we know the pack size, if we estimate
the network speed, then one could calculate a complete
wall-clock time for the operation (though the script does
not do this automatically).

Signed-off-by: Jeff King <peff@peff.net>
---
 t/perf/p5311-pack-bitmaps-fetch.sh | 45 ++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)
 create mode 100755 t/perf/p5311-pack-bitmaps-fetch.sh

diff --git a/t/perf/p5311-pack-bitmaps-fetch.sh b/t/perf/p5311-pack-bitmaps-fetch.sh
new file mode 100755
index 0000000000..b04575951f
--- /dev/null
+++ b/t/perf/p5311-pack-bitmaps-fetch.sh
@@ -0,0 +1,45 @@
+#!/bin/sh
+
+test_description='performance of fetches from bitmapped packs'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+test_expect_success 'create bitmapped server repo' '
+	git config pack.writebitmaps true &&
+	git config pack.writebitmaphashcache true &&
+	git repack -ad
+'
+
+# simulate a fetch from a repository that last fetched N days ago, for
+# various values of N. We do so by following the first-parent chain,
+# and assume the first entry in the chain that is N days older than the current
+# HEAD is where the HEAD would have been then.
+for days in 1 2 4 8 16 32 64 128; do
+	title=$(printf '%10s' "($days days)")
+	test_expect_success "setup revs from $days days ago" '
+		now=$(git log -1 --format=%ct HEAD) &&
+		then=$(($now - ($days * 86400))) &&
+		tip=$(git rev-list -1 --first-parent --until=$then HEAD) &&
+		{
+			echo HEAD &&
+			echo ^$tip
+		} >revs
+	'
+
+	test_perf "server $title" '
+		git pack-objects --stdout --revs \
+				 --thin --delta-base-offset \
+				 <revs >tmp.pack
+	'
+
+	test_size "size   $title" '
+		wc -c <tmp.pack
+	'
+
+	test_perf "client $title" '
+		git index-pack --stdin --fix-thin <tmp.pack
+	'
+done
+
+test_done
-- 
2.19.0.rc0.398.g138a08f6f6


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 5/6] pack-bitmap: save "have" bitmap from walk
  2018-08-21 18:41 [PATCH] test-tool.h: include git-compat-util.h Jeff King
                   ` (4 preceding siblings ...)
  2018-08-21 19:06 ` [PATCH 4/6] t/perf: add perf tests for fetches from a bitmapped server Jeff King
@ 2018-08-21 19:07 ` Jeff King
  2018-08-21 19:47   ` Derrick Stolee
  2018-08-31 15:23   ` Ævar Arnfjörð Bjarmason
  2018-08-21 19:07 ` [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects Jeff King
  6 siblings, 2 replies; 55+ messages in thread
From: Jeff King @ 2018-08-21 19:07 UTC (permalink / raw)
  To: git

When we do a bitmap walk, we save the result, which
represents (WANTs & ~HAVEs); i.e., every object we care
about visiting in our walk. However, we throw away the
haves bitmap, which can sometimes be useful, too. Save it
and provide an access function so code which has performed a
walk can query it.

A few notes on the accessor interface:

 - the bitmap code calls these "haves" because it grew out
   of the want/have negotiation for fetches. But really,
   these are simply the objects that would be flagged
   UNINTERESTING in a regular traversal. Let's use that
   more universal nomenclature for the external module
   interface. We may want to change the internal naming
   inside the bitmap code, but that's outside the scope of
   this patch.

 - it still uses a bare "sha1" rather than "oid". That's
   true of all of the bitmap code. And in this particular
   instance, our caller in pack-objects is dealing with the
   bare sha1 that comes from a packed REF_DELTA (we're
   pointing directly to the mmap'd pack on disk). That's
   something we'll have to deal with as we transition to a
   new hash, but we can wait and see how the caller ends up
   being fixed and adjust this interface accordingly.

Signed-off-by: Jeff King <peff@peff.net>
---
 pack-bitmap.c | 25 ++++++++++++++++++++++++-
 pack-bitmap.h |  7 +++++++
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index f0a1937a1c..c3231ef9ef 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -86,6 +86,9 @@ struct bitmap_index {
 	/* Bitmap result of the last performed walk */
 	struct bitmap *result;
 
+	/* "have" bitmap from the last performed walk */
+	struct bitmap *haves;
+
 	/* Version of the bitmap index */
 	unsigned int version;
 
@@ -759,8 +762,8 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
 		bitmap_and_not(wants_bitmap, haves_bitmap);
 
 	bitmap_git->result = wants_bitmap;
+	bitmap_git->haves = haves_bitmap;
 
-	bitmap_free(haves_bitmap);
 	return bitmap_git;
 
 cleanup:
@@ -1114,5 +1117,25 @@ void free_bitmap_index(struct bitmap_index *b)
 	free(b->ext_index.objects);
 	free(b->ext_index.hashes);
 	bitmap_free(b->result);
+	bitmap_free(b->haves);
 	free(b);
 }
+
+int bitmap_has_sha1_in_uninteresting(struct bitmap_index *bitmap_git,
+				     const unsigned char *sha1)
+{
+	int pos;
+
+	if (!bitmap_git)
+		return 0; /* no bitmap loaded */
+	if (!bitmap_git->result)
+		BUG("failed to perform bitmap walk before querying");
+	if (!bitmap_git->haves)
+		return 0; /* walk had no "haves" */
+
+	pos = bitmap_position_packfile(bitmap_git, sha1);
+	if (pos < 0)
+		return 0;
+
+	return bitmap_get(bitmap_git->haves, pos);
+}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 8a04741e12..c633bf5238 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -53,6 +53,13 @@ int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping
 			     khash_sha1 *reused_bitmaps, int show_progress);
 void free_bitmap_index(struct bitmap_index *);
 
+/*
+ * After a traversal has been performed on the bitmap_index, this can be
+ * queried to see if a particular object was reachable from any of the
+ * objects flagged as UNINTERESTING.
+ */
+int bitmap_has_sha1_in_uninteresting(struct bitmap_index *, const unsigned char *sha1);
+
 void bitmap_writer_show_progress(int show);
 void bitmap_writer_set_checksum(unsigned char *sha1);
 void bitmap_writer_build_type_index(struct packing_data *to_pack,
-- 
2.19.0.rc0.398.g138a08f6f6


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects
  2018-08-21 18:41 [PATCH] test-tool.h: include git-compat-util.h Jeff King
                   ` (5 preceding siblings ...)
  2018-08-21 19:07 ` [PATCH 5/6] pack-bitmap: save "have" bitmap from walk Jeff King
@ 2018-08-21 19:07 ` Jeff King
  2018-08-21 19:43   ` Junio C Hamano
  6 siblings, 1 reply; 55+ messages in thread
From: Jeff King @ 2018-08-21 19:07 UTC (permalink / raw)
  To: git

When we serve a fetch, we pass the "wants" and "haves" from
the fetch negotiation to pack-objects. That tells us not
only which objects we need to send, but we also use the
boundary commits as "preferred bases": their trees and blobs
are candidates for delta bases, both for reusing on-disk
deltas and for finding new ones.

However, this misses some opportunities. Modulo some special
cases like shallow or partial clones, we know that every
object reachable from the "haves" could be a preferred base.
We don't use all of them for two reasons:

  1. It's expensive to traverse the whole history and
     enumerate all of the objects the other side has.

  2. The delta search is expensive, so we want to keep the
     number of candidate bases sane. The boundary commits
     are the most likely to work.

When we have reachability bitmaps, though, reason 1 no
longer applies. We can efficiently compute the set of
reachable objects on the other side (and in fact already did
so as part of the bitmap set-difference to get the list of
interesting objects). And using this set conveniently
covers the shallow and partial cases, since we have to
disable the use of bitmaps for those anyway.

The second reason argues against using these bases in the
search for new deltas. But there's one case where we can use
this information for free: when we have an existing on-disk
delta that we're considering reusing, we can do so if we
know the other side has the base object. This in fact saves
time during the delta search, because it's one less delta we
have to compute.

And that's exactly what this patch does: when we're
considering whether to reuse an on-disk delta, if bitmaps
tell us the other side has the object (and we're making a
thin-pack), then we reuse it.

Here are the results on p5311 using linux.git, which
simulates a client fetching after `N` days since their last
fetch:

Test                         origin              HEAD
--------------------------------------------------------------------------
5311.3: server   (1 days)    0.27(0.27+0.04)     0.12(0.09+0.03) -55.6%
5311.4: size     (1 days)               0.9M              237.0K -73.7%
5311.5: client   (1 days)    0.04(0.05+0.00)     0.10(0.10+0.00) +150.0%
5311.7: server   (2 days)    0.34(0.42+0.04)     0.13(0.10+0.03) -61.8%
5311.8: size     (2 days)               1.5M              347.7K -76.5%
5311.9: client   (2 days)    0.07(0.08+0.00)     0.16(0.15+0.01) +128.6%
5311.11: server   (4 days)   0.56(0.77+0.08)     0.13(0.10+0.02) -76.8%
5311.12: size     (4 days)              2.8M              566.6K -79.8%
5311.13: client   (4 days)   0.13(0.15+0.00)     0.34(0.31+0.02) +161.5%
5311.15: server   (8 days)   0.97(1.39+0.11)     0.30(0.25+0.05) -69.1%
5311.16: size     (8 days)              4.3M                1.0M -76.0%
5311.17: client   (8 days)   0.20(0.22+0.01)     0.53(0.52+0.01) +165.0%
5311.19: server  (16 days)   1.52(2.51+0.12)     0.30(0.26+0.03) -80.3%
5311.20: size    (16 days)              8.0M                2.0M -74.5%
5311.21: client  (16 days)   0.40(0.47+0.03)     1.01(0.98+0.04) +152.5%
5311.23: server  (32 days)   2.40(4.44+0.20)     0.31(0.26+0.04) -87.1%
5311.24: size    (32 days)             14.1M                4.1M -70.9%
5311.25: client  (32 days)   0.70(0.90+0.03)     1.81(1.75+0.06) +158.6%
5311.27: server  (64 days)   11.76(26.57+0.29)   0.55(0.50+0.08) -95.3%
5311.28: size    (64 days)             89.4M               47.4M -47.0%
5311.29: client  (64 days)   5.71(9.31+0.27)     15.20(15.20+0.32) +166.2%
5311.31: server (128 days)   16.15(36.87+0.40)   0.91(0.82+0.14) -94.4%
5311.32: size   (128 days)            134.8M              100.4M -25.5%
5311.33: client (128 days)   9.42(16.86+0.49)    25.34(25.80+0.46) +169.0%

In all cases we save CPU time on the server (sometimes
significant) and the resulting pack is smaller. We do spend
more CPU time on the client side, because it has to
reconstruct more deltas. But that's the right tradeoff to
make, since clients tend to outnumber servers. It just means
the thin pack mechanism is doing its job.

From the user's perspective, the end-to-end time of the
operation will generally be faster. E.g., in the 128-day
case, we saved 15s on the server at a cost of 16s on the
client. Since the resulting pack is 34MB smaller, this is a
net win if the network speed is less than 270Mbit/s. And
that's actually the worst case. The 64-day case saves just
over 11s at a cost of just under 11s. So it's a slight win
at any network speed, and the 40MB saved is pure bonus. That
trend continues for the smaller fetches.

The implementation itself is mostly straightforward, with
the new logic going into check_object(). But there are two
tricky bits.

The first is that check_object() needs access to the
relevant information (the thin flag and bitmap result). We
can do this by pushing these into program-lifetime globals.

The second is that the rest of the code assumes that any
reused delta will point to another "struct object_entry" as
its base. But of course the case we are interested in here
is the one where don't have such an entry!

I looked at a number of options that didn't quite work:

 - we could use a flag to signal a reused delta, but it's
   not a single bit. We have to actually store the oid of
   the base, which is normally done by pointing to the
   existing object_entry. And we'd have to modify all the
   code which looks at deltas.

 - we could add the reused bases to the end of the existing
   object_entry array. While this does create some extra
   work as later stages consider the extra entries, it's
   actually not too bad (we're not sending them, so they
   don't cost much in the delta search, and at most we'd
   have 2*N of them).

   But there's a more subtle problem. Adding to the existing
   array means we might need to grow it with realloc, which
   could move the earlier entries around. While many of the
   references to other entries are done by integer index,
   some (including ones on the stack) use pointers, which
   would become invalidated.

   This isn't insurmountable, but it would require quite a
   bit of refactoring (and it's hard to know that you've got
   it all, since it may work _most_ of the time and then
   fail subtly based on memory allocation patterns).

 - we could allocate a new one-off entry for the base. In
   fact, this is what an earlier version of this patch did.
   However, since the refactoring brought in by ad635e82d6
   (Merge branch 'nd/pack-objects-pack-struct', 2018-05-23),
   the delta_idx code requires that both entries be in the
   main packing list.

So taking all of those options into account, what I ended up
with is a separate list of "external bases" that are not
part of the main packing list. Each delta entry that points
to an external base has a single-bit flag to do so; we have a
little breathing room in the bitfield section of
object_entry.

This lets us limit the change primarily to the oe_delta()
and oe_set_delta_ext() functions. And as a bonus, most of
the rest of the code does not consider these dummy entries
at all, saving both runtime CPU and code complexity.

Signed-off-by: Jeff King <peff@peff.net>

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c | 28 +++++++++++++++++++---------
 pack-objects.c         | 19 +++++++++++++++++++
 pack-objects.h         | 20 ++++++++++++++++++--
 3 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 0d80dee2ba..1bd43432f7 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -40,6 +40,7 @@
 #define DELTA_CHILD(obj) oe_delta_child(&to_pack, obj)
 #define DELTA_SIBLING(obj) oe_delta_sibling(&to_pack, obj)
 #define SET_DELTA(obj, val) oe_set_delta(&to_pack, obj, val)
+#define SET_DELTA_EXT(obj, oid) oe_set_delta_ext(&to_pack, obj, oid)
 #define SET_DELTA_SIZE(obj, val) oe_set_delta_size(&to_pack, obj, val)
 #define SET_DELTA_CHILD(obj, val) oe_set_delta_child(&to_pack, obj, val)
 #define SET_DELTA_SIBLING(obj, val) oe_set_delta_sibling(&to_pack, obj, val)
@@ -59,6 +60,7 @@ static struct packing_data to_pack;
 
 static struct pack_idx_entry **written_list;
 static uint32_t nr_result, nr_written, nr_seen;
+static struct bitmap_index *bitmap_git;
 
 static int non_empty;
 static int reuse_delta = 1, reuse_object = 1;
@@ -79,6 +81,7 @@ static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
+static int thin;
 static int num_preferred_base;
 static struct progress *progress_state;
 
@@ -1510,11 +1513,15 @@ static void check_object(struct object_entry *entry)
 			break;
 		}
 
-		if (base_ref && (base_entry = packlist_find(&to_pack, base_ref, NULL))) {
+		if (base_ref && (
+		    (base_entry = packlist_find(&to_pack, base_ref, NULL)) ||
+		    (thin &&
+		     bitmap_has_sha1_in_uninteresting(bitmap_git, base_ref)))) {
 			/*
 			 * If base_ref was set above that means we wish to
-			 * reuse delta data, and we even found that base
-			 * in the list of objects we want to pack. Goodie!
+			 * reuse delta data, and either we found that object in
+			 * the list of objects we want to pack, or it's one we
+			 * know the receiver has.
 			 *
 			 * Depth value does not matter - find_deltas() will
 			 * never consider reused delta as the base object to
@@ -1523,10 +1530,16 @@ static void check_object(struct object_entry *entry)
 			 */
 			oe_set_type(entry, entry->in_pack_type);
 			SET_SIZE(entry, in_pack_size); /* delta size */
-			SET_DELTA(entry, base_entry);
 			SET_DELTA_SIZE(entry, in_pack_size);
-			entry->delta_sibling_idx = base_entry->delta_child_idx;
-			SET_DELTA_CHILD(base_entry, entry);
+
+			if (base_entry) {
+				SET_DELTA(entry, base_entry);
+				entry->delta_sibling_idx = base_entry->delta_child_idx;
+				SET_DELTA_CHILD(base_entry, entry);
+			} else {
+				SET_DELTA_EXT(entry, base_ref);
+			}
+
 			unuse_pack(&w_curs);
 			return;
 		}
@@ -2954,7 +2967,6 @@ static int pack_options_allow_reuse(void)
 
 static int get_object_list_from_bitmap(struct rev_info *revs)
 {
-	struct bitmap_index *bitmap_git;
 	if (!(bitmap_git = prepare_bitmap_walk(revs)))
 		return -1;
 
@@ -2970,7 +2982,6 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	}
 
 	traverse_bitmap_commit_list(bitmap_git, &add_object_entry_from_bitmap);
-	free_bitmap_index(bitmap_git);
 	return 0;
 }
 
@@ -3118,7 +3129,6 @@ static int option_parse_unpack_unreachable(const struct option *opt,
 int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 {
 	int use_internal_rev_list = 0;
-	int thin = 0;
 	int shallow = 0;
 	int all_progress_implied = 0;
 	struct argv_array rp = ARGV_ARRAY_INIT;
diff --git a/pack-objects.c b/pack-objects.c
index 92708522e7..9ae0cecd81 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -177,3 +177,22 @@ struct object_entry *packlist_alloc(struct packing_data *pdata,
 
 	return new_entry;
 }
+
+void oe_set_delta_ext(struct packing_data *pdata,
+		      struct object_entry *delta,
+		      const unsigned char *sha1)
+{
+	struct object_entry *base;
+
+	ALLOC_GROW(pdata->ext_bases, pdata->nr_ext + 1, pdata->alloc_ext);
+	base = &pdata->ext_bases[pdata->nr_ext++];
+	memset(base, 0, sizeof(*base));
+	hashcpy(base->idx.oid.hash, sha1);
+
+	/* These flags mark that we are not part of the actual pack output. */
+	base->preferred_base = 1;
+	base->filled = 1;
+
+	delta->ext_base = 1;
+	delta->delta_idx = base - pdata->ext_bases + 1;
+}
diff --git a/pack-objects.h b/pack-objects.h
index 08c6b57d49..3fd10b15c9 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -111,6 +111,7 @@ struct object_entry {
 	unsigned dfs_state:OE_DFS_STATE_BITS;
 	unsigned char in_pack_header_size;
 	unsigned depth:OE_DEPTH_BITS;
+	unsigned ext_base:1; /* delta_idx points outside packlist */
 
 	/*
 	 * pahole results on 64-bit linux (gcc and clang)
@@ -141,6 +142,14 @@ struct packing_data {
 	struct packed_git **in_pack_by_idx;
 	struct packed_git **in_pack;
 
+	/*
+	 * This list contains entries for bases which we know the other side
+	 * has (e.g., via reachability bitmaps), but which aren't in our
+	 * "objects" list.
+	 */
+	struct object_entry *ext_bases;
+	uint32_t nr_ext, alloc_ext;
+
 	uintmax_t oe_size_limit;
 };
 
@@ -228,9 +237,12 @@ static inline struct object_entry *oe_delta(
 		const struct packing_data *pack,
 		const struct object_entry *e)
 {
-	if (e->delta_idx)
+	if (!e->delta_idx)
+		return NULL;
+	if (e->ext_base)
+		return &pack->ext_bases[e->delta_idx - 1];
+	else
 		return &pack->objects[e->delta_idx - 1];
-	return NULL;
 }
 
 static inline void oe_set_delta(struct packing_data *pack,
@@ -243,6 +255,10 @@ static inline void oe_set_delta(struct packing_data *pack,
 		e->delta_idx = 0;
 }
 
+void oe_set_delta_ext(struct packing_data *pack,
+		      struct object_entry *e,
+		      const unsigned char *sha1);
+
 static inline struct object_entry *oe_delta_child(
 		const struct packing_data *pack,
 		const struct object_entry *e)
-- 
2.19.0.rc0.398.g138a08f6f6

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects
  2018-08-21 19:07 ` [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects Jeff King
@ 2018-08-21 19:43   ` Junio C Hamano
  2018-08-21 19:50     ` Junio C Hamano
  2018-08-21 20:00     ` [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects Jeff King
  0 siblings, 2 replies; 55+ messages in thread
From: Junio C Hamano @ 2018-08-21 19:43 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> When we serve a fetch, we pass the "wants" and "haves" from
> ...
> This lets us limit the change primarily to the oe_delta()
> and oe_set_delta_ext() functions. And as a bonus, most of
> the rest of the code does not consider these dummy entries
> at all, saving both runtime CPU and code complexity.
>
> Signed-off-by: Jeff King <peff@peff.net>
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---

Sorry for commenting on something completely off-topic, but when
applied with "git am -s", I get a resulting commit with 3 S-o-b (the
above two, plus the one added by "-s"), with a blank line in between
them.  I can understand the first blank line (the one between your
two S-o-b), as the first S-o-b does not even appear to be part of
the trailer block, but cannot explain why I get an extra one before
the one added by "-s".  Puzzled...

> @@ -79,6 +81,7 @@ static unsigned long pack_size_limit;
>  static int depth = 50;
>  static int delta_search_threads;
>  static int pack_to_stdout;
> +static int thin;

It appears that this line is the only change since the previous
round.  The remainder of the patch looks cleanly done and readable.

Thanks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 5/6] pack-bitmap: save "have" bitmap from walk
  2018-08-21 19:07 ` [PATCH 5/6] pack-bitmap: save "have" bitmap from walk Jeff King
@ 2018-08-21 19:47   ` Derrick Stolee
  2018-08-21 19:54     ` Jeff King
  2018-08-31 15:23   ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 55+ messages in thread
From: Derrick Stolee @ 2018-08-21 19:47 UTC (permalink / raw)
  To: Jeff King, git

On 8/21/2018 3:07 PM, Jeff King wrote:
> When we do a bitmap walk, we save the result, which
> represents (WANTs & ~HAVEs); i.e., every object we care
> about visiting in our walk. However, we throw away the
> haves bitmap, which can sometimes be useful, too. Save it
> and provide an access function so code which has performed a
> walk can query it.

This makes a lot of sense. Based on the amount of time the "Counting 
Objects" blog post [1] spent talking about delta bases, I would have 
assumed this "haves" bitmap was already part of it. But, I can also see 
how it could be dropped if you are focusing on performance for 'git clone'.


> A few notes on the accessor interface:
>
>   - the bitmap code calls these "haves" because it grew out
>     of the want/have negotiation for fetches. But really,
>     these are simply the objects that would be flagged
>     UNINTERESTING in a regular traversal. Let's use that
>     more universal nomenclature for the external module
>     interface. We may want to change the internal naming
>     inside the bitmap code, but that's outside the scope of
>     this patch.

I saw the uninteresting-vs-haves name confusion in the patch below, but 
I agree with your logic here.

Sorry that I'm late to the party, but I was interested in the topic.

Thanks,

-Stolee

[1] https://githubengineering.com/counting-objects/

     "Counting Objects" by Vincent Martí


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects
  2018-08-21 19:43   ` Junio C Hamano
@ 2018-08-21 19:50     ` Junio C Hamano
  2018-08-21 20:07       ` Jeff King
  2018-08-21 20:00     ` [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects Jeff King
  1 sibling, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2018-08-21 19:50 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> When we serve a fetch, we pass the "wants" and "haves" from
>> ...
>> This lets us limit the change primarily to the oe_delta()
>> and oe_set_delta_ext() functions. And as a bonus, most of
>> the rest of the code does not consider these dummy entries
>> at all, saving both runtime CPU and code complexity.
>>
>> Signed-off-by: Jeff King <peff@peff.net>
>>
>> Signed-off-by: Jeff King <peff@peff.net>
>> ---
>
> Sorry for commenting on something completely off-topic, but when
> applied with "git am -s", I get a resulting commit with 3 S-o-b (the
> above two, plus the one added by "-s"), with a blank line in between
> them.  I can understand the first blank line (the one between your
> two S-o-b), as the first S-o-b does not even appear to be part of
> the trailer block, but cannot explain why I get an extra one before
> the one added by "-s".  Puzzled...

I think your original "two s-o-b with a blank line in between" was
caused by the same problem, and "git commit --amend -s" perhaps
added an extra one at the end, and added a blank line before the
last "paragraph" while at it?

My suspicion is the long horizontal line at the beginning of the
table, triggers it.  I haven't followed the code closely yet,
though.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 5/6] pack-bitmap: save "have" bitmap from walk
  2018-08-21 19:47   ` Derrick Stolee
@ 2018-08-21 19:54     ` Jeff King
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-21 19:54 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git

On Tue, Aug 21, 2018 at 03:47:36PM -0400, Derrick Stolee wrote:

> On 8/21/2018 3:07 PM, Jeff King wrote:
> > When we do a bitmap walk, we save the result, which
> > represents (WANTs & ~HAVEs); i.e., every object we care
> > about visiting in our walk. However, we throw away the
> > haves bitmap, which can sometimes be useful, too. Save it
> > and provide an access function so code which has performed a
> > walk can query it.
> 
> This makes a lot of sense. Based on the amount of time the "Counting
> Objects" blog post [1] spent talking about delta bases, I would have assumed
> this "haves" bitmap was already part of it. But, I can also see how it could
> be dropped if you are focusing on performance for 'git clone'.

I think that blog post was written after we were already using this
patch. ;)

> > A few notes on the accessor interface:
> > 
> >   - the bitmap code calls these "haves" because it grew out
> >     of the want/have negotiation for fetches. But really,
> >     these are simply the objects that would be flagged
> >     UNINTERESTING in a regular traversal. Let's use that
> >     more universal nomenclature for the external module
> >     interface. We may want to change the internal naming
> >     inside the bitmap code, but that's outside the scope of
> >     this patch.
> 
> I saw the uninteresting-vs-haves name confusion in the patch below, but I
> agree with your logic here.
> 
> Sorry that I'm late to the party, but I was interested in the topic.

This has already missed the freeze for v2.19, so I think there is plenty
of time. More review would be welcome.

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects
  2018-08-21 19:43   ` Junio C Hamano
  2018-08-21 19:50     ` Junio C Hamano
@ 2018-08-21 20:00     ` Jeff King
  1 sibling, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-21 20:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Aug 21, 2018 at 12:43:49PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > When we serve a fetch, we pass the "wants" and "haves" from
> > ...
> > This lets us limit the change primarily to the oe_delta()
> > and oe_set_delta_ext() functions. And as a bonus, most of
> > the rest of the code does not consider these dummy entries
> > at all, saving both runtime CPU and code complexity.
> >
> > Signed-off-by: Jeff King <peff@peff.net>
> >
> > Signed-off-by: Jeff King <peff@peff.net>
> > ---
> 
> Sorry for commenting on something completely off-topic, but when
> applied with "git am -s", I get a resulting commit with 3 S-o-b (the
> above two, plus the one added by "-s"), with a blank line in between
> them.  I can understand the first blank line (the one between your
> two S-o-b), as the first S-o-b does not even appear to be part of
> the trailer block, but cannot explain why I get an extra one before
> the one added by "-s".  Puzzled...

Hmm. This case is most curious.

I don't normally have a S-o-b in my commits themselves, but in this case
I had done quite a bit of editing in my MUA while sending out the
patches, so I picked up the result for my local branch with a "git
reset --hard origin && git am ~/what-i-sent".

Then for sending v2, I did my usual "format-patch -s". For the first 5
patches, it worked fine, but for this one, it created the funny
duplicate, which should have been suppressed. So something about this
commit message seems to confuse our trailer-parsing code. And indeed,
running:

  git show -s --format='%(trailers)'

shows nothing, which explains the other behavior (we don't think there's
a trailer block, so we make a new one). I wonder if it gets confused by
the big block of perf results (which all start with "^%[0-9.]:").

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects
  2018-08-21 19:50     ` Junio C Hamano
@ 2018-08-21 20:07       ` Jeff King
  2018-08-21 20:14         ` Jeff King
  2018-08-21 20:57         ` Junio C Hamano
  0 siblings, 2 replies; 55+ messages in thread
From: Jeff King @ 2018-08-21 20:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Aug 21, 2018 at 12:50:09PM -0700, Junio C Hamano wrote:

> > Sorry for commenting on something completely off-topic, but when
> > applied with "git am -s", I get a resulting commit with 3 S-o-b (the
> > above two, plus the one added by "-s"), with a blank line in between
> > them.  I can understand the first blank line (the one between your
> > two S-o-b), as the first S-o-b does not even appear to be part of
> > the trailer block, but cannot explain why I get an extra one before
> > the one added by "-s".  Puzzled...
> 
> I think your original "two s-o-b with a blank line in between" was
> caused by the same problem, and "git commit --amend -s" perhaps
> added an extra one at the end, and added a blank line before the
> last "paragraph" while at it?
> 
> My suspicion is the long horizontal line at the beginning of the
> table, triggers it.  I haven't followed the code closely yet,
> though.

Ah, yeah, I think you're right. We call find_patch_start(), which thinks
the "---" line is the end of the commit message. That makes sense when
parsing trailers out of "format-patch" output, but not when we know we
have just the commit message.

So one obvious fix is a new option for the trailer code to tell it we
have _just_ a commit message. That would still leave this obvious false
positive for the format-patch case, but I'm not sure it can be helped.

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects
  2018-08-21 20:07       ` Jeff King
@ 2018-08-21 20:14         ` Jeff King
  2018-08-21 20:52           ` Junio C Hamano
  2018-08-21 20:57         ` Junio C Hamano
  1 sibling, 1 reply; 55+ messages in thread
From: Jeff King @ 2018-08-21 20:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Aug 21, 2018 at 04:07:47PM -0400, Jeff King wrote:

> On Tue, Aug 21, 2018 at 12:50:09PM -0700, Junio C Hamano wrote:
> 
> > > Sorry for commenting on something completely off-topic, but when
> > > applied with "git am -s", I get a resulting commit with 3 S-o-b (the
> > > above two, plus the one added by "-s"), with a blank line in between
> > > them.  I can understand the first blank line (the one between your
> > > two S-o-b), as the first S-o-b does not even appear to be part of
> > > the trailer block, but cannot explain why I get an extra one before
> > > the one added by "-s".  Puzzled...
> > 
> > I think your original "two s-o-b with a blank line in between" was
> > caused by the same problem, and "git commit --amend -s" perhaps
> > added an extra one at the end, and added a blank line before the
> > last "paragraph" while at it?
> > 
> > My suspicion is the long horizontal line at the beginning of the
> > table, triggers it.  I haven't followed the code closely yet,
> > though.
> 
> Ah, yeah, I think you're right. We call find_patch_start(), which thinks
> the "---" line is the end of the commit message. That makes sense when
> parsing trailers out of "format-patch" output, but not when we know we
> have just the commit message.
> 
> So one obvious fix is a new option for the trailer code to tell it we
> have _just_ a commit message. That would still leave this obvious false
> positive for the format-patch case, but I'm not sure it can be helped.

Another is to tighten the check. Something like this seems more
sensible:

diff --git a/trailer.c b/trailer.c
index 4e309460d1..92ec5cae82 100644
--- a/trailer.c
+++ b/trailer.c
@@ -793,7 +793,8 @@ static int find_patch_start(const char *str)
 	const char *s;
 
 	for (s = str; *s; s = next_line(s)) {
-		if (starts_with(s, "---"))
+		const char *v;
+		if (skip_prefix(s, "---", &v) && isspace(*v))
 			return s - str;
 	}
 

as it would catch "--- /some/file", "--- ", "---\n", and "---\r\n", but
not longer dashed lines. I wondered what "git am" does, though, and I
think it is mailinfo.c:patchbreak(), which has a few other cases to
handle things that git itself would not have generated. I don't know if
that's worth supporting or not.

I think there really are two bugs here, though. The find_patch_start()
check is overly lax, but we also should not have to use it at all when
we know there is no patch.

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects
  2018-08-21 20:14         ` Jeff King
@ 2018-08-21 20:52           ` Junio C Hamano
  2018-08-21 21:30             ` Jeff King
  0 siblings, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2018-08-21 20:52 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Christian Couder

Jeff King <peff@peff.net> writes:

> Another is to tighten the check. Something like this seems more
> sensible:

Great minds think alike; is it a space was exactly what I was toying
with before I went to lunch.  FWIW I saw the full test suite passes
when I came back and then saw you had an identical patch ;-)

> I think there really are two bugs here, though. The find_patch_start()
> check is overly lax, but we also should not have to use it at all when
> we know there is no patch.

Yes, I was grepping for callchains, and it appeared none of them
actually expected us to feed "log plus --- plus patch" format.  The
obvious candidate to take it is "am" but we ask mailinfo to do the
splitting before the log message part even hits the rest of the
system.  So my inclination right now is to see if that is truly the
case and get rid of that bogus "patch start" thing, and if not, add
a flag to let the caller say "I know we only have message" to the
callchain.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects
  2018-08-21 20:07       ` Jeff King
  2018-08-21 20:14         ` Jeff King
@ 2018-08-21 20:57         ` Junio C Hamano
  2018-08-21 21:32           ` Jeff King
  2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
  1 sibling, 2 replies; 55+ messages in thread
From: Junio C Hamano @ 2018-08-21 20:57 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> Ah, yeah, I think you're right. We call find_patch_start(), which thinks
> the "---" line is the end of the commit message. That makes sense when
> parsing trailers out of "format-patch" output, but not when we know we
> have just the commit message.

Yes, but that does not explain what we are seeing.  If the code
mistakenly thinks that the log message ends before that table, then
it should have inserted the S-o-b: _before_ that table, but that is
not happening.

So there are three issues; (1) find-patch-start uses too weak a
logic to find the beginning of a patch section (2) even if it found
the right place, its caller does not tell "commit --amend -s" where
the log message ends correctly and (3) some callchains that get
there know they only have a log message but there is no way to take
advantage of that information and skip the call to find-patch-start.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects
  2018-08-21 20:52           ` Junio C Hamano
@ 2018-08-21 21:30             ` Jeff King
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-21 21:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Christian Couder

On Tue, Aug 21, 2018 at 01:52:33PM -0700, Junio C Hamano wrote:

> > I think there really are two bugs here, though. The find_patch_start()
> > check is overly lax, but we also should not have to use it at all when
> > we know there is no patch.
> 
> Yes, I was grepping for callchains, and it appeared none of them
> actually expected us to feed "log plus --- plus patch" format.  The
> obvious candidate to take it is "am" but we ask mailinfo to do the
> splitting before the log message part even hits the rest of the
> system.  So my inclination right now is to see if that is truly the
> case and get rid of that bogus "patch start" thing, and if not, add
> a flag to let the caller say "I know we only have message" to the
> callchain.

I suspect this same logic is used by git-interpret-trailers, which is
taking an arbitrary message on stdin. That is probably the lone caller
that needs to retain this magic.

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects
  2018-08-21 20:57         ` Junio C Hamano
@ 2018-08-21 21:32           ` Jeff King
  2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
  1 sibling, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-21 21:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Aug 21, 2018 at 01:57:07PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Ah, yeah, I think you're right. We call find_patch_start(), which thinks
> > the "---" line is the end of the commit message. That makes sense when
> > parsing trailers out of "format-patch" output, but not when we know we
> > have just the commit message.
> 
> Yes, but that does not explain what we are seeing.  If the code
> mistakenly thinks that the log message ends before that table, then
> it should have inserted the S-o-b: _before_ that table, but that is
> not happening.
> 
> So there are three issues; (1) find-patch-start uses too weak a
> logic to find the beginning of a patch section (2) even if it found
> the right place, its caller does not tell "commit --amend -s" where
> the log message ends correctly and (3) some callchains that get
> there know they only have a log message but there is no way to take
> advantage of that information and skip the call to find-patch-start.

Yes, I'd agree with all of that.

I'm going offline, so if you are part-way through a patch, please do not
worry that we might duplicate effort. :)

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/6] t/perf: add infrastructure for measuring sizes
  2018-08-21 19:06 ` [PATCH 3/6] t/perf: add infrastructure for measuring sizes Jeff King
@ 2018-08-22 13:40   ` Derrick Stolee
  2018-08-22 15:31     ` Jeff King
  0 siblings, 1 reply; 55+ messages in thread
From: Derrick Stolee @ 2018-08-22 13:40 UTC (permalink / raw)
  To: Jeff King, git

On 8/21/2018 3:06 PM, Jeff King wrote:
> The main objective of scripts in the perf framework is to
> run "test_perf", which measures the time it takes to run
> some operation. However, it can also be interesting to see
> the change in the output size of certain operations.
>
> This patch introduces test_size, which records a single
> numeric output from the test and shows it in the aggregated
> output (with pretty printing and relative size comparison).

I'm interested in exploring this test_size mechanism. The other area 
that could benefit from size testing is 'git repack', but I don't have 
any plans to change our compression or delta strategies. If we _did_ 
look into that, then using test_size would be a natural fit.

Overall, I really like this series!

Thanks
-Stolee



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/6] t/perf: add infrastructure for measuring sizes
  2018-08-22 13:40   ` Derrick Stolee
@ 2018-08-22 15:31     ` Jeff King
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-22 15:31 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git

On Wed, Aug 22, 2018 at 09:40:55AM -0400, Derrick Stolee wrote:

> On 8/21/2018 3:06 PM, Jeff King wrote:
> > The main objective of scripts in the perf framework is to
> > run "test_perf", which measures the time it takes to run
> > some operation. However, it can also be interesting to see
> > the change in the output size of certain operations.
> > 
> > This patch introduces test_size, which records a single
> > numeric output from the test and shows it in the aggregated
> > output (with pretty printing and relative size comparison).
> 
> I'm interested in exploring this test_size mechanism. The other area that
> could benefit from size testing is 'git repack', but I don't have any plans
> to change our compression or delta strategies. If we _did_ look into that,
> then using test_size would be a natural fit.

Yeah, I agree it would be a good tool for showing off improvements
there.

It may also be useful for catching regressions in topics that are trying
to speed things up, but _don't_ intend to change the size. We could even
do that proactively now. I.e., something like:

  test_perf 'repack' 'git repack -adf'
  test_size 'pack size' 'wc -c <.git/objects/pack/*.pack'

just to see if it ever changes.  But I suspect its usefulness may depend
on how you are packing (e.g., is "-f" more likely to catch issues than
without?).

The new tests I added in this series cover packs created for fetches.
There's no guarantee that will overlap with the behavior of an on-disk
repack, but at least we have some generic coverage of pack-objects
output sizes now. Absent any suspicion of a regression for a particular
case, that's probably an acceptable canary in the coal mine.

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 0/9] trailer-parsing false positives
  2018-08-21 20:57         ` Junio C Hamano
  2018-08-21 21:32           ` Jeff King
@ 2018-08-23  0:43           ` Jeff King
  2018-08-23  0:44             ` [PATCH 1/9] trailer: use size_t for string offsets Jeff King
                               ` (9 more replies)
  1 sibling, 10 replies; 55+ messages in thread
From: Jeff King @ 2018-08-23  0:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git

On Tue, Aug 21, 2018 at 01:57:07PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Ah, yeah, I think you're right. We call find_patch_start(), which thinks
> > the "---" line is the end of the commit message. That makes sense when
> > parsing trailers out of "format-patch" output, but not when we know we
> > have just the commit message.
> 
> Yes, but that does not explain what we are seeing.  If the code
> mistakenly thinks that the log message ends before that table, then
> it should have inserted the S-o-b: _before_ that table, but that is
> not happening.
> 
> So there are three issues; (1) find-patch-start uses too weak a
> logic to find the beginning of a patch section (2) even if it found
> the right place, its caller does not tell "commit --amend -s" where
> the log message ends correctly and (3) some callchains that get
> there know they only have a log message but there is no way to take
> advantage of that information and skip the call to find-patch-start.

So this turned into a bit of a rabbit hole. Here's what I have so far
(which I think is not quite ready, but I wanted to start discussing).

Issues (2) and (3) are actually the same issue. The caller that does the
bogus appending for (2) is always append_signoff(). But it always has
just a commit message (modulo some complications, which I'll get to in a
minute), and so fixing it with respect to (3) magically solves (2).
I.e., the code was simply not prepared for the case of "end of string is
not end of trailers". But since that case cannot exist, we do not have
to deal with it. :)

Now here's the tricky part. I think patches 1-8 are mostly sensible. But
while doing 7/8, I noticed some weirdness around the ignore_footer
parameter. It seems git-commit strips some cruft off the end of the
buffer, including "#" comments, before looking for the trailers. But
that doesn't make any sense. We should just handle the cleaned-up commit
message. Worse, the stripping does not take into account options like
--cleanup=verbatim, where "#" comments are _not_ cruft anymore. Double
worse, it uses the current notion of core.commentChar, but we might be
operating on historical data (i.e., something created with a different
comment char). Triple-worse, this same cleanup is used in the trailer
parsing code!

So:

  git interpret-trailers --parse <<-\EOF
  subject

  body

  Signed-off-by: me

  # is this a comment or not?
  EOF

will find that s-o-b. Should that "#" thing be a comment? I'd say no. We
are not dealing with a commit message template at all. And even in
git-commit, I think we aren't (or at least ought not to be) passing the
template around to the signoff code.

So I think there may be further opportunities for cleanup here. I'm not
sure if we'd need to retain this behavior for git-interpret-trailers.
AFAICT it is not documented, and I suspect is mostly historical
accident, and not anything anybody ever wanted.

If we do keep it by default, then the "--no-divider" option I added in
patch 4 should probably get a more generic name and cover this.
Something like "--verbatim-input".

I'm going to sleep on it and see how I feel tomorrow.

  [1/9]: trailer: use size_t for string offsets
  [2/9]: trailer: use size_t for iterating trailer list
  [3/9]: trailer: pass process_trailer_opts to trailer_info_get()
  [4/9]: interpret-trailers: tighten check for "---" patch boundary
  [5/9]: interpret-trailers: allow suppressing "---" divider
  [6/9]: pretty, ref-filter: format %(trailers) with no_divider option
  [7/9]: sequencer: ignore "---" divider when parsing trailers
  [8/9]: append_signoff: use size_t for string offsets
  [9/9]: sequencer: handle ignore_footer when parsing trailers

 Documentation/git-interpret-trailers.txt | 10 +++-
 builtin/interpret-trailers.c             |  1 +
 commit.c                                 |  6 +--
 commit.h                                 |  2 +-
 pretty.c                                 |  3 ++
 ref-filter.c                             |  2 +
 sequencer.c                              | 20 ++++++--
 sequencer.h                              |  9 +++-
 t/t4205-log-pretty-formats.sh            | 23 +++++++++
 t/t6300-for-each-ref.sh                  | 23 +++++++++
 t/t7501-commit.sh                        | 16 ++++++
 t/t7513-interpret-trailers.sh            | 42 ++++++++++++++++
 trailer.c                                | 62 +++++++++++++-----------
 trailer.h                                |  4 +-
 14 files changed, 184 insertions(+), 39 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/9] trailer: use size_t for string offsets
  2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
@ 2018-08-23  0:44             ` Jeff King
  2018-08-23  0:45             ` [PATCH 2/9] trailer: use size_t for iterating trailer list Jeff King
                               ` (8 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-23  0:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git

Many of the string-parsing functions inside trailer.c return
integer offsets into the string (e.g., to point to the end
of the trailer block). Several of these use an "int" to
return or store the offsets. On a system where "size_t" is
much larger than "int" (e.g., most 64-bit ones), it's easy
to feed a gigantic commit message that results in a negative
offset. This can result in us reading memory before the
string (if the int is used as an index) or far after (if
it's implicitly cast to a size_t by passing to a strbuf
function).

Let's fix this by using size_t for all string offsets. Note
that several of the functions need ssize_t, since they use
"-1" as a sentinel value. The interactions here can be
pretty subtle. E.g., end_of_title in find_trailer_start()
does not itself need to be signed, but it is compared to the
result of last_line(), which is. That promotes the latter to
unsigned, and the ">=" does not behave as you might expect.

Signed-off-by: Jeff King <peff@peff.net>
---
Noticed while I was in the area. I did pretty easily reproduce a
segfault with this.

 trailer.c | 39 ++++++++++++++++++++-------------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/trailer.c b/trailer.c
index 4e309460d1..88b35b8e89 100644
--- a/trailer.c
+++ b/trailer.c
@@ -585,7 +585,7 @@ static const char *token_from_item(struct arg_item *item, char *tok)
 	return item->conf.name;
 }
 
-static int token_matches_item(const char *tok, struct arg_item *item, int tok_len)
+static int token_matches_item(const char *tok, struct arg_item *item, size_t tok_len)
 {
 	if (!strncasecmp(tok, item->conf.name, tok_len))
 		return 1;
@@ -603,7 +603,7 @@ static int token_matches_item(const char *tok, struct arg_item *item, int tok_le
  * distinguished from the non-well-formed-line case (in which this function
  * returns -1) because some callers of this function need such a distinction.
  */
-static int find_separator(const char *line, const char *separators)
+static ssize_t find_separator(const char *line, const char *separators)
 {
 	int whitespace_found = 0;
 	const char *c;
@@ -630,10 +630,10 @@ static int find_separator(const char *line, const char *separators)
  */
 static void parse_trailer(struct strbuf *tok, struct strbuf *val,
 			 const struct conf_info **conf, const char *trailer,
-			 int separator_pos)
+			 ssize_t separator_pos)
 {
 	struct arg_item *item;
-	int tok_len;
+	size_t tok_len;
 	struct list_head *pos;
 
 	if (separator_pos != -1) {
@@ -721,7 +721,7 @@ static void process_command_line_args(struct list_head *arg_head,
 	list_for_each(pos, new_trailer_head) {
 		struct new_trailer_item *tr =
 			list_entry(pos, struct new_trailer_item, list);
-		int separator_pos = find_separator(tr->text, cl_separators);
+		ssize_t separator_pos = find_separator(tr->text, cl_separators);
 
 		if (separator_pos == 0) {
 			struct strbuf sb = STRBUF_INIT;
@@ -763,9 +763,9 @@ static const char *next_line(const char *str)
 /*
  * Return the position of the start of the last line. If len is 0, return -1.
  */
-static int last_line(const char *buf, size_t len)
+static ssize_t last_line(const char *buf, size_t len)
 {
-	int i;
+	ssize_t i;
 	if (len == 0)
 		return -1;
 	if (len == 1)
@@ -788,7 +788,7 @@ static int last_line(const char *buf, size_t len)
  * Return the position of the start of the patch or the length of str if there
  * is no patch in the message.
  */
-static int find_patch_start(const char *str)
+static size_t find_patch_start(const char *str)
 {
 	const char *s;
 
@@ -804,10 +804,11 @@ static int find_patch_start(const char *str)
  * Return the position of the first trailer line or len if there are no
  * trailers.
  */
-static int find_trailer_start(const char *buf, size_t len)
+static size_t find_trailer_start(const char *buf, size_t len)
 {
 	const char *s;
-	int end_of_title, l, only_spaces = 1;
+	ssize_t end_of_title, l;
+	int only_spaces = 1;
 	int recognized_prefix = 0, trailer_lines = 0, non_trailer_lines = 0;
 	/*
 	 * Number of possible continuation lines encountered. This will be
@@ -838,7 +839,7 @@ static int find_trailer_start(const char *buf, size_t len)
 	     l = last_line(buf, l)) {
 		const char *bol = buf + l;
 		const char **p;
-		int separator_pos;
+		ssize_t separator_pos;
 
 		if (bol[0] == comment_line_char) {
 			non_trailer_lines += possible_continuation_lines;
@@ -899,14 +900,14 @@ static int find_trailer_start(const char *buf, size_t len)
 }
 
 /* Return the position of the end of the trailers. */
-static int find_trailer_end(const char *buf, size_t len)
+static size_t find_trailer_end(const char *buf, size_t len)
 {
 	return len - ignore_non_trailer(buf, len);
 }
 
 static int ends_with_blank_line(const char *buf, size_t len)
 {
-	int ll = last_line(buf, len);
+	ssize_t ll = last_line(buf, len);
 	if (ll < 0)
 		return 0;
 	return is_blank_line(buf + ll);
@@ -939,10 +940,10 @@ static void unfold_value(struct strbuf *val)
 	strbuf_release(&out);
 }
 
-static int process_input_file(FILE *outfile,
-			      const char *str,
-			      struct list_head *head,
-			      const struct process_trailer_options *opts)
+static size_t process_input_file(FILE *outfile,
+				 const char *str,
+				 struct list_head *head,
+				 const struct process_trailer_options *opts)
 {
 	struct trailer_info info;
 	struct strbuf tok = STRBUF_INIT;
@@ -1032,7 +1033,7 @@ void process_trailers(const char *file,
 {
 	LIST_HEAD(head);
 	struct strbuf sb = STRBUF_INIT;
-	int trailer_end;
+	size_t trailer_end;
 	FILE *outfile = stdout;
 
 	ensure_configured();
@@ -1132,7 +1133,7 @@ static void format_trailer_info(struct strbuf *out,
 
 	for (i = 0; i < info->trailer_nr; i++) {
 		char *trailer = info->trailers[i];
-		int separator_pos = find_separator(trailer, separators);
+		ssize_t separator_pos = find_separator(trailer, separators);
 
 		if (separator_pos >= 1) {
 			struct strbuf tok = STRBUF_INIT;
-- 
2.19.0.rc0.412.g7005db4e88


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 2/9] trailer: use size_t for iterating trailer list
  2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
  2018-08-23  0:44             ` [PATCH 1/9] trailer: use size_t for string offsets Jeff King
@ 2018-08-23  0:45             ` Jeff King
  2018-08-23  0:46             ` [PATCH 3/9] trailer: pass process_trailer_opts to trailer_info_get() Jeff King
                               ` (7 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-23  0:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git

We store the length of the trailers list in a size_t. So on
a 64-bit system with a 32-bit int, in the unlikely case that
we manage to actually allocate a list with 2^31 entries,
we'd loop forever trying to iterate over it (our "int" would
wrap to negative before exceeding info->trailer_nr).

This probably doesn't matter in practice. Each entry is at
least a pointer plus a non-empty string, so even without
malloc overhead or the memory to hold the original string
we're parsing from, you'd need to allocate tens of
gigabytes. But it's easy enough to do it right.

Signed-off-by: Jeff King <peff@peff.net>
---
 sequencer.c | 2 +-
 trailer.c   | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index 65d371c746..c81b276239 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -228,7 +228,7 @@ static int has_conforming_footer(struct strbuf *sb, struct strbuf *sob,
 	int ignore_footer)
 {
 	struct trailer_info info;
-	int i;
+	size_t i;
 	int found_sob = 0, found_sob_last = 0;
 
 	trailer_info_get(&info, sb->buf);
diff --git a/trailer.c b/trailer.c
index 88b35b8e89..40eef8880e 100644
--- a/trailer.c
+++ b/trailer.c
@@ -948,7 +948,7 @@ static size_t process_input_file(FILE *outfile,
 	struct trailer_info info;
 	struct strbuf tok = STRBUF_INIT;
 	struct strbuf val = STRBUF_INIT;
-	int i;
+	size_t i;
 
 	trailer_info_get(&info, str);
 
@@ -1112,7 +1112,7 @@ void trailer_info_get(struct trailer_info *info, const char *str)
 
 void trailer_info_release(struct trailer_info *info)
 {
-	int i;
+	size_t i;
 	for (i = 0; i < info->trailer_nr; i++)
 		free(info->trailers[i]);
 	free(info->trailers);
@@ -1122,7 +1122,7 @@ static void format_trailer_info(struct strbuf *out,
 				const struct trailer_info *info,
 				const struct process_trailer_options *opts)
 {
-	int i;
+	size_t i;
 
 	/* If we want the whole block untouched, we can take the fast path. */
 	if (!opts->only_trailers && !opts->unfold) {
-- 
2.19.0.rc0.412.g7005db4e88


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 3/9] trailer: pass process_trailer_opts to trailer_info_get()
  2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
  2018-08-23  0:44             ` [PATCH 1/9] trailer: use size_t for string offsets Jeff King
  2018-08-23  0:45             ` [PATCH 2/9] trailer: use size_t for iterating trailer list Jeff King
@ 2018-08-23  0:46             ` Jeff King
  2018-08-23  0:48             ` [PATCH 4/9] interpret-trailers: tighten check for "---" patch boundary Jeff King
                               ` (6 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-23  0:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git

Most of the trailer code has an "opts" struct which is
filled in by the caller. We don't pass it down to
trailer_info_get(), which does the initial parsing, because
there hasn't yet been a need to do so.

Let's start passing it down in preparation for adding new
options. Note that there's a single caller which doesn't
otherwise have such an options struct. Since it's just one
caller (that we'd have to modify anyway), let's not bother
with any special treatment like accepting a NULL options
struct, and just have it allocate one with the defaults.

Signed-off-by: Jeff King <peff@peff.net>
---
We end up needing to pass an option in that caller later anyway...

 sequencer.c | 3 ++-
 trailer.c   | 7 ++++---
 trailer.h   | 3 ++-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index c81b276239..3e15faa94e 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -227,11 +227,12 @@ static const char *get_todo_path(const struct replay_opts *opts)
 static int has_conforming_footer(struct strbuf *sb, struct strbuf *sob,
 	int ignore_footer)
 {
+	struct process_trailer_options opts = PROCESS_TRAILER_OPTIONS_INIT;
 	struct trailer_info info;
 	size_t i;
 	int found_sob = 0, found_sob_last = 0;
 
-	trailer_info_get(&info, sb->buf);
+	trailer_info_get(&info, sb->buf, &opts);
 
 	if (info.trailer_start == info.trailer_end)
 		return 0;
diff --git a/trailer.c b/trailer.c
index 40eef8880e..e769c5b72c 100644
--- a/trailer.c
+++ b/trailer.c
@@ -950,7 +950,7 @@ static size_t process_input_file(FILE *outfile,
 	struct strbuf val = STRBUF_INIT;
 	size_t i;
 
-	trailer_info_get(&info, str);
+	trailer_info_get(&info, str, opts);
 
 	/* Print lines before the trailers as is */
 	if (!opts->only_trailers)
@@ -1067,7 +1067,8 @@ void process_trailers(const char *file,
 	strbuf_release(&sb);
 }
 
-void trailer_info_get(struct trailer_info *info, const char *str)
+void trailer_info_get(struct trailer_info *info, const char *str,
+		      const struct process_trailer_options *opts)
 {
 	int patch_start, trailer_end, trailer_start;
 	struct strbuf **trailer_lines, **ptr;
@@ -1159,7 +1160,7 @@ void format_trailers_from_commit(struct strbuf *out, const char *msg,
 {
 	struct trailer_info info;
 
-	trailer_info_get(&info, msg);
+	trailer_info_get(&info, msg, opts);
 	format_trailer_info(out, &info, opts);
 	trailer_info_release(&info);
 }
diff --git a/trailer.h b/trailer.h
index 9c10026c35..b38c8f0d16 100644
--- a/trailer.h
+++ b/trailer.h
@@ -79,7 +79,8 @@ void process_trailers(const char *file,
 		      const struct process_trailer_options *opts,
 		      struct list_head *new_trailer_head);
 
-void trailer_info_get(struct trailer_info *info, const char *str);
+void trailer_info_get(struct trailer_info *info, const char *str,
+		      const struct process_trailer_options *opts);
 
 void trailer_info_release(struct trailer_info *info);
 
-- 
2.19.0.rc0.412.g7005db4e88


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 4/9] interpret-trailers: tighten check for "---" patch boundary
  2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
                               ` (2 preceding siblings ...)
  2018-08-23  0:46             ` [PATCH 3/9] trailer: pass process_trailer_opts to trailer_info_get() Jeff King
@ 2018-08-23  0:48             ` Jeff King
  2018-08-23  0:49             ` [PATCH 5/9] interpret-trailers: allow suppressing "---" divider Jeff King
                               ` (5 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-23  0:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git

The interpret-trailers command accepts not only raw commit
messages, but it also can manipulate trailers in
format-patch output. That means it must find the "---"
boundary separating the commit message from the patch.
However, it does so by looking for any line starting with
"---", regardless of whether there is further content.

This is overly lax compared to the parsing done in
mailinfo.c's patchbreak(), and may cause false positives
(e.g., t/perf output tables uses dashes; if you cut and
paste them into your commit message, it fools the parser).

We could try to reuse patchbreak() here, but it actually has
several heuristics that are not of interest to us (e.g.,
matching "diff -" without a three-dash separator or even a
CVS "Index:" line). We're not interested in taking in
whatever random cruft people may send, but rather handling
git-formatted patches.

Note that the existing documentation was written in a loose
way, so technically we are changing the behavior from what
it said. But this should implement the original intent in a
more accurate way.

Signed-off-by: Jeff King <peff@peff.net>
---
By itself, this would fix the case that spawned this
discussion, though we still have false positives on "---" in
a commit message.

 Documentation/git-interpret-trailers.txt |  5 +++--
 t/t7513-interpret-trailers.sh            | 22 ++++++++++++++++++++++
 trailer.c                                |  4 +++-
 3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-interpret-trailers.txt b/Documentation/git-interpret-trailers.txt
index b8fafb1e8b..7385bfdb45 100644
--- a/Documentation/git-interpret-trailers.txt
+++ b/Documentation/git-interpret-trailers.txt
@@ -56,8 +56,9 @@ least one Git-generated or user-configured trailer and consists of at
 least 25% trailers.
 The group must be preceded by one or more empty (or whitespace-only) lines.
 The group must either be at the end of the message or be the last
-non-whitespace lines before a line that starts with '---'. Such three
-minus signs start the patch part of the message.
+non-whitespace lines before a line that starts with '---' (followed by a
+space or the end of the line). Such three minus signs start the patch
+part of the message.
 
 When reading trailers, there can be whitespaces after the
 token, the separator and the value. There can also be whitespaces
diff --git a/t/t7513-interpret-trailers.sh b/t/t7513-interpret-trailers.sh
index 164719d1c9..e13b40b43f 100755
--- a/t/t7513-interpret-trailers.sh
+++ b/t/t7513-interpret-trailers.sh
@@ -1417,4 +1417,26 @@ test_expect_success 'unfold' '
 	test_cmp expected actual
 '
 
+test_expect_success 'handling of --- lines in input' '
+	echo "real-trailer: just right" >expected &&
+
+	git interpret-trailers --parse >actual <<-\EOF &&
+	subject
+
+	body
+
+	not-a-trailer: too soon
+	------ this is just a line in the commit message with a bunch of
+	------ dashes; it does not have any syntactic meaning.
+
+	real-trailer: just right
+	---
+	below the dashed line may be a patch, etc.
+
+	not-a-trailer: too late
+	EOF
+
+	test_cmp expected actual
+'
+
 test_done
diff --git a/trailer.c b/trailer.c
index e769c5b72c..8392c6c030 100644
--- a/trailer.c
+++ b/trailer.c
@@ -793,7 +793,9 @@ static size_t find_patch_start(const char *str)
 	const char *s;
 
 	for (s = str; *s; s = next_line(s)) {
-		if (starts_with(s, "---"))
+		const char *v;
+
+		if (skip_prefix(s, "---", &v) && isspace(*v))
 			return s - str;
 	}
 
-- 
2.19.0.rc0.412.g7005db4e88


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 5/9] interpret-trailers: allow suppressing "---" divider
  2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
                               ` (3 preceding siblings ...)
  2018-08-23  0:48             ` [PATCH 4/9] interpret-trailers: tighten check for "---" patch boundary Jeff King
@ 2018-08-23  0:49             ` Jeff King
  2018-08-23  0:50             ` [PATCH 6/9] pretty, ref-filter: format %(trailers) with no_divider option Jeff King
                               ` (4 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-23  0:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git

Even with the newly-tightened "---" parser, it's still
possible for a commit message to trigger a false positive if
it contains something like "--- foo". If the caller knows
that it has only a single commit message, it can now tell us
with the "--no-divider" option, eliminating any false
positives.

If we were designing this from scratch, I'd probably make
this the default. But we've advertised the "---" behavior in
the documentation since interpret-trailers has existed.
Since it's meant to be scripted, breaking that would be a
bad idea.

Note that the logic is in the underlying trailer.c code,
which is used elsewhere. The default there will keep the
current behavior, but many callers will benefit from setting
this new option. That's left for future patches.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-interpret-trailers.txt |  7 ++++++-
 builtin/interpret-trailers.c             |  1 +
 t/t7513-interpret-trailers.sh            | 20 ++++++++++++++++++++
 trailer.c                                |  6 +++++-
 trailer.h                                |  1 +
 5 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-interpret-trailers.txt b/Documentation/git-interpret-trailers.txt
index 7385bfdb45..a5e8b36f62 100644
--- a/Documentation/git-interpret-trailers.txt
+++ b/Documentation/git-interpret-trailers.txt
@@ -58,7 +58,7 @@ The group must be preceded by one or more empty (or whitespace-only) lines.
 The group must either be at the end of the message or be the last
 non-whitespace lines before a line that starts with '---' (followed by a
 space or the end of the line). Such three minus signs start the patch
-part of the message.
+part of the message. See also `--no-divider` below.
 
 When reading trailers, there can be whitespaces after the
 token, the separator and the value. There can also be whitespaces
@@ -126,6 +126,11 @@ OPTIONS
 	A convenience alias for `--only-trailers --only-input
 	--unfold`.
 
+--no-divider::
+	Do not treat `---` as the end of the commit message. Use this
+	when you know your input contains just the commit message itself
+	(and not an email or the output of `git format-patch`).
+
 CONFIGURATION VARIABLES
 -----------------------
 
diff --git a/builtin/interpret-trailers.c b/builtin/interpret-trailers.c
index b742539d4d..4b87e0dd2e 100644
--- a/builtin/interpret-trailers.c
+++ b/builtin/interpret-trailers.c
@@ -104,6 +104,7 @@ int cmd_interpret_trailers(int argc, const char **argv, const char *prefix)
 		OPT_BOOL(0, "unfold", &opts.unfold, N_("join whitespace-continued values")),
 		{ OPTION_CALLBACK, 0, "parse", &opts, NULL, N_("set parsing options"),
 			PARSE_OPT_NOARG | PARSE_OPT_NONEG, parse_opt_parse },
+		OPT_BOOL(0, "no-divider", &opts.no_divider, N_("do not treat --- specially")),
 		OPT_CALLBACK(0, "trailer", &trailers, N_("trailer"),
 				N_("trailer(s) to add"), option_parse_trailer),
 		OPT_END()
diff --git a/t/t7513-interpret-trailers.sh b/t/t7513-interpret-trailers.sh
index e13b40b43f..c441861331 100755
--- a/t/t7513-interpret-trailers.sh
+++ b/t/t7513-interpret-trailers.sh
@@ -1439,4 +1439,24 @@ test_expect_success 'handling of --- lines in input' '
 	test_cmp expected actual
 '
 
+test_expect_success 'suppress --- handling' '
+	echo "real-trailer: just right" >expected &&
+
+	git interpret-trailers --parse --no-divider >actual <<-\EOF &&
+	subject
+
+	This commit message has a "---" in it, but because we tell
+	interpret-trailers not to respect that, it has no effect.
+
+	not-a-trailer: too soon
+	---
+
+	This is still the commit message body.
+
+	real-trailer: just right
+	EOF
+
+	test_cmp expected actual
+'
+
 test_done
diff --git a/trailer.c b/trailer.c
index 8392c6c030..0796f326b3 100644
--- a/trailer.c
+++ b/trailer.c
@@ -1080,7 +1080,11 @@ void trailer_info_get(struct trailer_info *info, const char *str,
 
 	ensure_configured();
 
-	patch_start = find_patch_start(str);
+	if (opts->no_divider)
+		patch_start = strlen(str);
+	else
+		patch_start = find_patch_start(str);
+
 	trailer_end = find_trailer_end(str, patch_start);
 	trailer_start = find_trailer_start(str, trailer_end);
 
diff --git a/trailer.h b/trailer.h
index b38c8f0d16..b997739649 100644
--- a/trailer.h
+++ b/trailer.h
@@ -71,6 +71,7 @@ struct process_trailer_options {
 	int only_trailers;
 	int only_input;
 	int unfold;
+	int no_divider;
 };
 
 #define PROCESS_TRAILER_OPTIONS_INIT {0}
-- 
2.19.0.rc0.412.g7005db4e88


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 6/9] pretty, ref-filter: format %(trailers) with no_divider option
  2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
                               ` (4 preceding siblings ...)
  2018-08-23  0:49             ` [PATCH 5/9] interpret-trailers: allow suppressing "---" divider Jeff King
@ 2018-08-23  0:50             ` Jeff King
  2018-08-23  0:50             ` [PATCH 7/9] sequencer: ignore "---" divider when parsing trailers Jeff King
                               ` (3 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-23  0:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git

In both of these cases we know that we are feeding the
trailer-parsing code a pure commit message. We should tell
it so, which avoids false positives for a commit message
that contains a "---" line.

Signed-off-by: Jeff King <peff@peff.net>
---
 pretty.c                      |  3 +++
 ref-filter.c                  |  2 ++
 t/t4205-log-pretty-formats.sh | 23 +++++++++++++++++++++++
 t/t6300-for-each-ref.sh       | 23 +++++++++++++++++++++++
 4 files changed, 51 insertions(+)

diff --git a/pretty.c b/pretty.c
index 98cf5228f9..8ca29e9281 100644
--- a/pretty.c
+++ b/pretty.c
@@ -1304,6 +1304,9 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
 
 	if (skip_prefix(placeholder, "(trailers", &arg)) {
 		struct process_trailer_options opts = PROCESS_TRAILER_OPTIONS_INIT;
+
+		opts.no_divider = 1;
+
 		if (*arg == ':') {
 			arg++;
 			for (;;) {
diff --git a/ref-filter.c b/ref-filter.c
index 0bccfceff2..3e8ee04d09 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -263,6 +263,8 @@ static int trailers_atom_parser(const struct ref_format *format, struct used_ato
 	struct string_list params = STRING_LIST_INIT_DUP;
 	int i;
 
+	atom->u.contents.trailer_opts.no_divider = 1;
+
 	if (arg) {
 		string_list_split(&params, arg, ',', -1);
 		for (i = 0; i < params.nr; i++) {
diff --git a/t/t4205-log-pretty-formats.sh b/t/t4205-log-pretty-formats.sh
index 2052cadb11..978a8a66ff 100755
--- a/t/t4205-log-pretty-formats.sh
+++ b/t/t4205-log-pretty-formats.sh
@@ -598,4 +598,27 @@ test_expect_success ':only and :unfold work together' '
 	test_cmp expect actual
 '
 
+test_expect_success 'trailer parsing not fooled by --- line' '
+	git commit --allow-empty -F - <<-\EOF &&
+	this is the subject
+
+	This is the body. The message has a "---" line which would confuse a
+	message+patch parser. But here we know we have only a commit message,
+	so we get it right.
+
+	trailer: wrong
+	---
+	This is more body.
+
+	trailer: right
+	EOF
+
+	{
+		echo "trailer: right" &&
+		echo
+	} >expect &&
+	git log --no-walk --format="%(trailers)" >actual &&
+	test_cmp expect actual
+'
+
 test_done
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index 024f8c06f7..97bfbee6e8 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -715,6 +715,29 @@ test_expect_success 'basic atom: head contents:trailers' '
 	test_cmp expect actual.clean
 '
 
+test_expect_success 'trailer parsing not fooled by --- line' '
+	git commit --allow-empty -F - <<-\EOF &&
+	this is the subject
+
+	This is the body. The message has a "---" line which would confuse a
+	message+patch parser. But here we know we have only a commit message,
+	so we get it right.
+
+	trailer: wrong
+	---
+	This is more body.
+
+	trailer: right
+	EOF
+
+	{
+		echo "trailer: right" &&
+		echo
+	} >expect &&
+	git for-each-ref --format="%(trailers)" refs/heads/master >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'Add symbolic ref for the following tests' '
 	git symbolic-ref refs/heads/sym refs/heads/master
 '
-- 
2.19.0.rc0.412.g7005db4e88


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 7/9] sequencer: ignore "---" divider when parsing trailers
  2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
                               ` (5 preceding siblings ...)
  2018-08-23  0:50             ` [PATCH 6/9] pretty, ref-filter: format %(trailers) with no_divider option Jeff King
@ 2018-08-23  0:50             ` Jeff King
  2018-08-23  0:50             ` [PATCH 8/9] append_signoff: use size_t for string offsets Jeff King
                               ` (2 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-23  0:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git

When the sequencer code appends a signoff or cherry-pick
origin, it uses the default trailer-parsing options, which
treat "---" as the end of the commit message. As a result,
it may be fooled by a commit message that contains that
string and fail to find the existing trailer block. Even
more confusing, the actual append code does not know about
"---", and always appends to the end of the string. This can
lead to bizarre results. E.g., appending a signoff to a
commit message like this:

  subject

  body
  ---
  these dashes confuse the parser!

  Signed-off-by: A

results in output with a final block like:

  Signed-off-by: A

  Signed-off-by: A

The parser thinks the final line of the message is "body",
and ignores everything else, claiming there are no trailers.
So we output an extra newline separator (wrong) and add a
duplicate signoff (also wrong).

Since we know we are feeding a pure commit message, we can
simply tell the parser to ignore the "---" divider.

Signed-off-by: Jeff King <peff@peff.net>
---
 sequencer.c       |  2 ++
 t/t7501-commit.sh | 16 ++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/sequencer.c b/sequencer.c
index 3e15faa94e..25bdbfcce1 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -232,6 +232,8 @@ static int has_conforming_footer(struct strbuf *sb, struct strbuf *sob,
 	size_t i;
 	int found_sob = 0, found_sob_last = 0;
 
+	opts.no_divider = 1;
+
 	trailer_info_get(&info, sb->buf, &opts);
 
 	if (info.trailer_start == info.trailer_end)
diff --git a/t/t7501-commit.sh b/t/t7501-commit.sh
index 51646d8019..1cabb8b8ed 100755
--- a/t/t7501-commit.sh
+++ b/t/t7501-commit.sh
@@ -517,6 +517,22 @@ Myfooter: x" &&
 	test_cmp expected actual
 '
 
+test_expect_success 'signoff not confused by ---' '
+	cat >expected <<-EOF &&
+		subject
+
+		body
+		---
+		these dashes confuse the parser!
+
+		Signed-off-by: $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL>
+	EOF
+	# should be a noop, since we already signed
+	git commit --allow-empty --signoff -F expected &&
+	git log -1 --pretty=format:%B >actual &&
+	test_cmp expected actual
+'
+
 test_expect_success 'multiple -m' '
 
 	>negative &&
-- 
2.19.0.rc0.412.g7005db4e88


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 8/9] append_signoff: use size_t for string offsets
  2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
                               ` (6 preceding siblings ...)
  2018-08-23  0:50             ` [PATCH 7/9] sequencer: ignore "---" divider when parsing trailers Jeff King
@ 2018-08-23  0:50             ` Jeff King
  2018-08-23  0:51             ` [PATCH 9/9] sequencer: handle ignore_footer when parsing trailers Jeff King
  2018-08-23 18:30             ` [PATCH 0/9] trailer-parsing false positives Junio C Hamano
  9 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-23  0:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git

The append_signoff() function takes an "int" to specify the
number of bytes to ignore. Most callers just pass 0, and the
remainder use ignore_non_trailer() to skip over cruft.
That function also returns an int, and uses them internally.

On systems where size_t is larger than an int (i.e., most
64-bit systems), dealing with a ridiculously large commit
message could end up overflowing an int, producing
surprising results (e.g., returning a negative offset, which
would cause us to look outside the original string).

Let's consistently use size_t for these offsets through this
whole stack. As a bonus, this makes the meaning of
"ignore_footer" as an offset (and not a boolean) more clear.
But while we're here, let's also document the interface.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit.c    | 6 +++---
 commit.h    | 2 +-
 sequencer.c | 4 ++--
 sequencer.h | 9 ++++++++-
 4 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/commit.c b/commit.c
index 30d1af2b20..9fb67d785d 100644
--- a/commit.c
+++ b/commit.c
@@ -1784,10 +1784,10 @@ const char *find_commit_header(const char *msg, const char *key, size_t *out_len
  * Returns the number of bytes from the tail to ignore, to be fed as
  * the second parameter to append_signoff().
  */
-int ignore_non_trailer(const char *buf, size_t len)
+size_t ignore_non_trailer(const char *buf, size_t len)
 {
-	int boc = 0;
-	int bol = 0;
+	size_t boc = 0;
+	size_t bol = 0;
 	int in_old_conflicts_block = 0;
 	size_t cutoff = wt_status_locate_end(buf, len);
 
diff --git a/commit.h b/commit.h
index da0db36eba..43f93b973d 100644
--- a/commit.h
+++ b/commit.h
@@ -322,7 +322,7 @@ extern const char *find_commit_header(const char *msg, const char *key,
 				      size_t *out_len);
 
 /* Find the end of the log message, the right place for a new trailer. */
-extern int ignore_non_trailer(const char *buf, size_t len);
+extern size_t ignore_non_trailer(const char *buf, size_t len);
 
 typedef int (*each_mergetag_fn)(struct commit *commit, struct commit_extra_header *extra,
 				 void *cb_data);
diff --git a/sequencer.c b/sequencer.c
index 25bdbfcce1..c01ff79ab0 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -225,7 +225,7 @@ static const char *get_todo_path(const struct replay_opts *opts)
  * Returns 3 when sob exists within conforming footer as last entry
  */
 static int has_conforming_footer(struct strbuf *sb, struct strbuf *sob,
-	int ignore_footer)
+	size_t ignore_footer)
 {
 	struct process_trailer_options opts = PROCESS_TRAILER_OPTIONS_INIT;
 	struct trailer_info info;
@@ -3802,7 +3802,7 @@ int sequencer_pick_revisions(struct replay_opts *opts)
 	return res;
 }
 
-void append_signoff(struct strbuf *msgbuf, int ignore_footer, unsigned flag)
+void append_signoff(struct strbuf *msgbuf, size_t ignore_footer, unsigned flag)
 {
 	unsigned no_dup_sob = flag & APPEND_SIGNOFF_DEDUP;
 	struct strbuf sob = STRBUF_INIT;
diff --git a/sequencer.h b/sequencer.h
index c751c9d6e4..c986bc8251 100644
--- a/sequencer.h
+++ b/sequencer.h
@@ -90,7 +90,14 @@ int rearrange_squash(void);
 
 extern const char sign_off_header[];
 
-void append_signoff(struct strbuf *msgbuf, int ignore_footer, unsigned flag);
+/*
+ * Append a signoff to the commit message in "msgbuf". The ignore_footer
+ * parameter specifies the number of bytes at the end of msgbuf that should
+ * not be considered at all. I.e., they are not checked for existing trailers,
+ * and the new signoff will be spliced into the buffer before those bytes.
+ */
+void append_signoff(struct strbuf *msgbuf, size_t ignore_footer, unsigned flag);
+
 void append_conflicts_hint(struct strbuf *msgbuf);
 int message_is_empty(const struct strbuf *sb,
 		     enum commit_msg_cleanup_mode cleanup_mode);
-- 
2.19.0.rc0.412.g7005db4e88


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 9/9] sequencer: handle ignore_footer when parsing trailers
  2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
                               ` (7 preceding siblings ...)
  2018-08-23  0:50             ` [PATCH 8/9] append_signoff: use size_t for string offsets Jeff King
@ 2018-08-23  0:51             ` Jeff King
  2018-08-23 18:30             ` [PATCH 0/9] trailer-parsing false positives Junio C Hamano
  9 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-23  0:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git

The append_signoff() function takes an "ignore_footer"
argument, which specifies a number of bytes at the end of
the message buffer which should not be considered (they
cannot contain trailers, and the trailer is spliced in
before them).

But to find the existing trailers, it calls into
has_conforming_trailer(). That function takes an
ignore_footer parameter, but since 967dfd4d56 (sequencer:
use trailer's trailer layout, 2016-11-02) the parameter is
completely ignored.

The trailer interface we're using takes a single string,
with no option to tell it to use part of the string.
However, since we have a mutable strbuf, we can work around
this by simply overwriting (and later restoring) the
boundary with a NUL.

I'm not sure if this can actually trigger a bug in practice.
It's easy to get a non-zero ignore_footer by doing something
like this:

  git commit -F - --cleanup=verbatim <<-EOF
  subject

  body

  Signed-off-by: me

  # this looks like a comment, but is actually in the
  # message! That makes the earlier s-o-b fake.
  EOF

  git commit --amend -s

There git-commit calls ignore_non_trailer() to count up the
"#" cruft, which becomes the ignore_footer header. But it
works even without this patch! That's because the trailer
code _also_ calls ignore_non_trailer() and skips the cruft,
too. So it happens to work because the only callers with a
non-zero ignore_footer are using the exact same function
that the trailer parser uses internally.

And that seems true for all of the current callers, but
there's nothing guaranteeing it. We're better off only
feeding the correct buffer to the trailer code in the first
place.

Signed-off-by: Jeff King <peff@peff.net>
---
I think this is probably worth doing, even if it's a noop. But I'm
really leaning towards the idea that the trailer code calling
ignore_non_trailer() is probably the wrong thing (in which case this
would be very important, since we'd be relying on the caller to remove
any cruft).

 sequencer.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/sequencer.c b/sequencer.c
index c01ff79ab0..a1f0f17bcd 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -231,11 +231,20 @@ static int has_conforming_footer(struct strbuf *sb, struct strbuf *sob,
 	struct trailer_info info;
 	size_t i;
 	int found_sob = 0, found_sob_last = 0;
+	char saved_char;
 
 	opts.no_divider = 1;
 
+	if (ignore_footer) {
+		saved_char = sb->buf[sb->len - ignore_footer];
+		sb->buf[sb->len - ignore_footer] = '\0';
+	}
+
 	trailer_info_get(&info, sb->buf, &opts);
 
+	if (ignore_footer)
+		sb->buf[sb->len - ignore_footer] = saved_char;
+
 	if (info.trailer_start == info.trailer_end)
 		return 0;
 
-- 
2.19.0.rc0.412.g7005db4e88

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/9] trailer-parsing false positives
  2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
                               ` (8 preceding siblings ...)
  2018-08-23  0:51             ` [PATCH 9/9] sequencer: handle ignore_footer when parsing trailers Jeff King
@ 2018-08-23 18:30             ` Junio C Hamano
  2018-08-24  7:26               ` Jeff King
  9 siblings, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2018-08-23 18:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Christian Couder, git

Jeff King <peff@peff.net> writes:

> So this turned into a bit of a rabbit hole. Here's what I have so far
> (which I think is not quite ready, but I wanted to start discussing).
>
> Issues (2) and (3) are actually the same issue. The caller that does the
> bogus appending for (2) is always append_signoff(). But it always has
> just a commit message (modulo some complications, which I'll get to in a
> minute), and so fixing it with respect to (3) magically solves (2).
> I.e., the code was simply not prepared for the case of "end of string is
> not end of trailers". But since that case cannot exist, we do not have
> to deal with it. :)
>
> Now here's the tricky part. I think patches 1-8 are mostly sensible.

Yeah, nothing that made me go "Huh?" in these 8 patches.  Thanks.

> So I think there may be further opportunities for cleanup here. I'm not
> sure if we'd need to retain this behavior for git-interpret-trailers.
> AFAICT it is not documented, and I suspect is mostly historical
> accident, and not anything anybody ever wanted.

I tend to think the behaviour was not designed but it "just happens
to work that way".

> If we do keep it by default, then the "--no-divider" option I added in
> patch 4 should probably get a more generic name and cover this.
> Something like "--verbatim-input".

Perhaps.  Even if this is not covered, --verbatim-input would be a
good name for the option ;-)

> I'm going to sleep on it and see how I feel tomorrow.
>
>   [1/9]: trailer: use size_t for string offsets
>   [2/9]: trailer: use size_t for iterating trailer list
>   [3/9]: trailer: pass process_trailer_opts to trailer_info_get()
>   [4/9]: interpret-trailers: tighten check for "---" patch boundary
>   [5/9]: interpret-trailers: allow suppressing "---" divider
>   [6/9]: pretty, ref-filter: format %(trailers) with no_divider option
>   [7/9]: sequencer: ignore "---" divider when parsing trailers
>   [8/9]: append_signoff: use size_t for string offsets
>   [9/9]: sequencer: handle ignore_footer when parsing trailers
>
>  Documentation/git-interpret-trailers.txt | 10 +++-
>  builtin/interpret-trailers.c             |  1 +
>  commit.c                                 |  6 +--
>  commit.h                                 |  2 +-
>  pretty.c                                 |  3 ++
>  ref-filter.c                             |  2 +
>  sequencer.c                              | 20 ++++++--
>  sequencer.h                              |  9 +++-
>  t/t4205-log-pretty-formats.sh            | 23 +++++++++
>  t/t6300-for-each-ref.sh                  | 23 +++++++++
>  t/t7501-commit.sh                        | 16 ++++++
>  t/t7513-interpret-trailers.sh            | 42 ++++++++++++++++
>  trailer.c                                | 62 +++++++++++++-----------
>  trailer.h                                |  4 +-
>  14 files changed, 184 insertions(+), 39 deletions(-)
>
> -Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/9] trailer-parsing false positives
  2018-08-23 18:30             ` [PATCH 0/9] trailer-parsing false positives Junio C Hamano
@ 2018-08-24  7:26               ` Jeff King
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-24  7:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Christian Couder, git

On Thu, Aug 23, 2018 at 11:30:54AM -0700, Junio C Hamano wrote:

> > Now here's the tricky part. I think patches 1-8 are mostly sensible.
> 
> Yeah, nothing that made me go "Huh?" in these 8 patches.  Thanks.
> 
> > So I think there may be further opportunities for cleanup here. I'm not
> > sure if we'd need to retain this behavior for git-interpret-trailers.
> > AFAICT it is not documented, and I suspect is mostly historical
> > accident, and not anything anybody ever wanted.
> 
> I tend to think the behaviour was not designed but it "just happens
> to work that way".

OK, so I've slept on it as promised, and looked at ignore_non_trailer()
a bit more closely.

It actually ignores three things:

 - comment blocks

 - "commit -v" cut lines

 - "Conflicts:" blocks for merges

I think I could see an argument for handling the third type everywhere,
if we wanted trailers to go above such a block (since they really are
part of the actual commit message). Of course, we try to make those
"Conflicts" blocks into comments these days. And looking at our history,
I had trouble finding many examples, since most of the old merges do not
have sign-offs in the first place.

There are a handful with the S-o-b below the conflict block (e.g.,
a24a32ddb). There are even some gems like 425b139313 and a5db4b127b)
where there is one sign-off above the block and one below. In our
history, there's nothing more recent than 2015, which is not incredibly
long after 261f315beb (merge & sequencer: turn "Conflicts:" hint into a
comment, 2014-10-28).

But in linux.git, I can find many examples from this year of Conflicts
blocks with the signoffs afterwards. And a few with the signoffs before
(e.g., ed09f6d05c).  I think the current code will do the right thing
either way when parsing those (if the trailers are after, we won't
exclude those, but if they're before, then we'll skip over the Conflicts
block).

So frankly, that all makes me afraid of touching any of it. I do think
it's probably doing the wrong thing in some cases, and we should
probably just make a rule like "trailers always go at the bottom, end of
story". But there are enough weird and historical cases, not to mention
potential interactions with git-commit, that I'd be quite likely to
regress some other case.

So my inclination is to punt on it for now, and go with my patches 1-8,
which fix an actual case that we saw in the real world, without creating
other problems.

I think patch 9 is not hurting anything and may later help us, but I
could take or leave it.

> > If we do keep it by default, then the "--no-divider" option I added in
> > patch 4 should probably get a more generic name and cover this.
> > Something like "--verbatim-input".
> 
> Perhaps.  Even if this is not covered, --verbatim-input would be a
> good name for the option ;-)

Possibly. :) What I was worried about is realizing that it's not really
"verbatim", though, but rather "some mystical set of rules including
nonsense like git-commit cut-lines". And so we should not over-promise
with the option name.

I'd also be OK to call it "verbatim" and consider it a to-be-fixed bug
that it still respects these weird rules. I'm just not sure I want to
spend more time digging on those weird rules now.

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 5/6] pack-bitmap: save "have" bitmap from walk
  2018-08-21 19:07 ` [PATCH 5/6] pack-bitmap: save "have" bitmap from walk Jeff King
  2018-08-21 19:47   ` Derrick Stolee
@ 2018-08-31 15:23   ` Ævar Arnfjörð Bjarmason
  2018-08-31 22:55     ` Jeff King
  1 sibling, 1 reply; 55+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-08-31 15:23 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano


On Tue, Aug 21 2018, Jeff King wrote:

> +int bitmap_has_sha1_in_uninteresting(struct bitmap_index *bitmap_git,
> +				     const unsigned char *sha1)
> +{
> +	int pos;
> +
> +	if (!bitmap_git)
> +		return 0; /* no bitmap loaded */
> +	if (!bitmap_git->result)
> +		BUG("failed to perform bitmap walk before querying");

Some part of what calls this completely breaks pushing from the "next"
branch when you have local bitmaps (we *really* should have some tests
for this...).

I don't have time to dig now, but just on my local git.git on next @
b1634b371dc2e46f9b43c45fd1857c2e2688f96e:

    u git (next $=) $ git repack -Ad --window=10 --depth=10 --write-bitmap-index --pack-kept-objects
    Enumerating objects: 616909, done.
    Counting objects: 100% (616909/616909), done.
    Delta compression using up to 8 threads
    Compressing objects: 100% (146475/146475), done.
    Writing objects: 100% (616909/616909), done.
    Selecting bitmap commits: 168027, done.                                                                                                                                                                           BBuilding bitmaps: 100% (338/338), done.
    Total 616909 (delta 497494), reused 582609 (delta 467530)
    u git (next $=) $ ./git --exec-path=$PWD push avar next:avar/next-push
    Enumerating objects: 1330, done.
    BUG: pack-bitmap.c:1132: failed to perform bitmap walk before querying
    error: pack-objects died of signal 6
    error: pack-objects died of signal 6
    error: remote unpack failed: eof before pack header was fully read
    error: failed to push some refs to 'git@github.com:avar/git.git'

Removing the bitmap makes it work again:

    u git (next $=) $ rm .git/objects/pack/pack-496088d9464cd79dfcac50dd0d72dcd6faee8253.bitmap
    rm: remove write-protected regular file '.git/objects/pack/pack-496088d9464cd79dfcac50dd0d72dcd6faee8253.bitmap'? y
    u git (next $=) $ ./git --exec-path=$PWD push avar next:avar/next-push
    Enumerating objects: 2834, done.
    Counting objects: 100% (1799/1799), done.
    Delta compression using up to 8 threads
    Compressing objects: 100% (591/591), done.
    Writing objects: 100% (1390/1390), 430.11 KiB | 15.36 MiB/s, done.
    Total 1390 (delta 1100), reused 1072 (delta 798)
    remote: Resolving deltas: 100% (1100/1100), completed with 214 local objects.
    To github.com:avar/git.git
     * [new branch]            next -> avar/next-push

Today I deployed next + my patches @ work. Broke pushes in one place
where repacks were being done manually with --write-bitmap-index.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 5/6] pack-bitmap: save "have" bitmap from walk
  2018-08-31 15:23   ` Ævar Arnfjörð Bjarmason
@ 2018-08-31 22:55     ` Jeff King
  2018-09-01  7:41       ` [PATCH 0/4] un-breaking pack-objects with bitmaps Jeff King
  0 siblings, 1 reply; 55+ messages in thread
From: Jeff King @ 2018-08-31 22:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Junio C Hamano

On Fri, Aug 31, 2018 at 05:23:17PM +0200, Ævar Arnfjörð Bjarmason wrote:

> On Tue, Aug 21 2018, Jeff King wrote:
> 
> > +int bitmap_has_sha1_in_uninteresting(struct bitmap_index *bitmap_git,
> > +				     const unsigned char *sha1)
> > +{
> > +	int pos;
> > +
> > +	if (!bitmap_git)
> > +		return 0; /* no bitmap loaded */
> > +	if (!bitmap_git->result)
> > +		BUG("failed to perform bitmap walk before querying");
> 
> Some part of what calls this completely breaks pushing from the "next"
> branch when you have local bitmaps (we *really* should have some tests
> for this...).

Yikes, thanks for reporting. I agree we need better tests here.

This assertion is totally bogus, as traverse_bitmap_commit_list() will
actually free (and NULL) the result field! This is a holdover from the
original bitmap code, where bitmap_git was a static, and this is how you
might prepare a second walk (though AFAIK only ever do one walk per
process). These days prepare_bitmap_walk() actually returns a fresh
bitmap_index struct.

So there's no way to know if traverse_bitmap_commit_list() was called.
But as it turns out, we don't care. We want to know if
"bitmap_git->have" was set up, which is done in prepare_bitmap_walk().
So there's no way[1] to get a bitmap_git that hasn't been properly set
up. This BUG() check can just go away.

I'll prepare a patch and some tests later tonight.

-Peff

[1] Actually, there is also prepare_bitmap_git(), but it is not really
    for general use by callers. It should be made static, or better yet,
    I suspect it can be folded into its callers.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 0/4] un-breaking pack-objects with bitmaps
  2018-08-31 22:55     ` Jeff King
@ 2018-09-01  7:41       ` Jeff King
  2018-09-01  7:44         ` [PATCH 1/4] bitmap_has_sha1_in_uninteresting(): drop BUG check Jeff King
                           ` (5 more replies)
  0 siblings, 6 replies; 55+ messages in thread
From: Jeff King @ 2018-09-01  7:41 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Stefan Beller, git, Junio C Hamano

On Fri, Aug 31, 2018 at 06:55:58PM -0400, Jeff King wrote:

> On Fri, Aug 31, 2018 at 05:23:17PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> > On Tue, Aug 21 2018, Jeff King wrote:
> > 
> > > +int bitmap_has_sha1_in_uninteresting(struct bitmap_index *bitmap_git,
> > > +				     const unsigned char *sha1)
> > > +{
> > > +	int pos;
> > > +
> > > +	if (!bitmap_git)
> > > +		return 0; /* no bitmap loaded */
> > > +	if (!bitmap_git->result)
> > > +		BUG("failed to perform bitmap walk before querying");
> > 
> > Some part of what calls this completely breaks pushing from the "next"
> > branch when you have local bitmaps (we *really* should have some tests
> > for this...).
> 
> Yikes, thanks for reporting. I agree we need better tests here.

OK, here is the fix. Since the problem is in 'next', this is done as a
patch on top of jk/pack-delta-reuse-with-bitmap. But since we're set to
rewind 'next' post-release anyway, we could squash it directly into
30cdc33fba from the original series. That would help later bisections
from running into it, which may be worth it as it's a pretty severe
breakage.  Or maybe not:

  1. The test suite doesn't actually fail, because it's toy repos are
     too small.

  2. It only triggers in the real-world if you have bitmaps turned on,
     which are not the default.

So it may not be that likely in practice to bother a hypothetical future
bisecting developer.

> [1] Actually, there is also prepare_bitmap_git(), but it is not really
>     for general use by callers. It should be made static, or better yet,
>     I suspect it can be folded into its callers.

This actually turned out not to work. There's a caller over in
pack-bitmap-write.c, and it makes things worse to try to expand the
logic there. So it technically _is_ possible to have a bitmap_index
without a "have" field, but it also doesn't make sense to ask about
"uninteresting" objects there. You haven't done (and cannot do) a
traversal on such an object.

Which I think goes back to Stefan's original question: is this just a
crappy API? And the answer is "yes, to some degree". There are really
two uses of bitmaps:

  - you want to do a traverse_commit_list() walk, but faster

  - you want to selectively query the on-disk bitmaps (e.g., you are
    walking for --contains and want to ask "do we have a bitmap for this
    object?"

Those currently use the same struct bitmap_index, but with two different
constructors (prepare_bitmap_git and prepare_bitmap_walk). It probably
ought to be two different ones (with the "walk" variant using the
"query" variant under the hood). I've punted on that full conversion for
now, but did clean up a few confusing bits.

  [1/4]: bitmap_has_sha1_in_uninteresting(): drop BUG check

    The actual fix. This should get merged to next ASAP (or the original
    topic just reverted).

  [2/4]: t5310: test delta reuse with bitmaps

    I did this separately to give us flexibility to squash or merge
    quickly. But it does find Ævar's bug on a git without patch 1.

  [3/4]: traverse_bitmap_commit_list(): don't free result

    The original assert should have simply been useless, but it was the
    surprising behavior of this function that turned it into a bug.

  [4/4]: pack-bitmap: drop "loaded" flag

    And this is just an annoyance I ran into, which is a fallout from
    our conversion to using an allocated bitmap_index struct.

 pack-bitmap.c           | 14 ++-----
 pack-bitmap.h           |  2 +-
 t/t5310-pack-bitmaps.sh | 93 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 97 insertions(+), 12 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/4] bitmap_has_sha1_in_uninteresting(): drop BUG check
  2018-09-01  7:41       ` [PATCH 0/4] un-breaking pack-objects with bitmaps Jeff King
@ 2018-09-01  7:44         ` Jeff King
  2018-09-01  7:48         ` [PATCH 2/4] t5310: test delta reuse with bitmaps Jeff King
                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-09-01  7:44 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Stefan Beller, git, Junio C Hamano

Commit 30cdc33fba (pack-bitmap: save "have" bitmap from
walk, 2018-08-21) introduced a new function for looking at
the "have" side of a bitmap walk. Because it only makes
sense to do so after we've finished the walk, we added an
extra safety assertion, making sure that bitmap_git->result
is non-NULL.

However, this safety is misguided. It was trying to catch
the case where we had called prepare_bitmap_walk() to give
us a "struct bitmap_index", but had not yet called
traverse_bitmap_commit_list() to walk it. But all of the
interesting computation (including setting up the result and
"have" bitmaps) happens in the first function! The latter
function only delivers the result to a callback function.

So the case we were worried about is impossible; if you get
a non-NULL result from prepare_bitmap_walk(), then its
"have" field will be fully formed.

But much worse, traverse_bitmap_commit_list() actually frees
the result field as it finishes. Which means that this
assertion is worse than useless: it's almost guaranteed to
trigger!

Our test suite didn't catch this because the function isn't
actually exercised at all. The only caller comes from
6a1e32d532 (pack-objects: reuse on-disk deltas for thin
"have" objects, 2018-08-21), and that's triggered only when
you fetch or push history that contains an object with a
base that is found deep in history. Our test suite fetches
and pushes either don't use bitmaps, or use too-small
example repositories. But any reasonably-sized real-world
push or fetch (with bitmaps) would trigger this.

This patch drops the harmful assertion and tweaks the
docstring for the function to make the precondition clear.
The tests need to be improved to exercise this new
pack-objects feature, but we'll do that in a separate
commit.

Signed-off-by: Jeff King <peff@peff.net>
---
 pack-bitmap.c | 2 --
 pack-bitmap.h | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index c3231ef9ef..76fd93a3de 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1128,8 +1128,6 @@ int bitmap_has_sha1_in_uninteresting(struct bitmap_index *bitmap_git,
 
 	if (!bitmap_git)
 		return 0; /* no bitmap loaded */
-	if (!bitmap_git->result)
-		BUG("failed to perform bitmap walk before querying");
 	if (!bitmap_git->haves)
 		return 0; /* walk had no "haves" */
 
diff --git a/pack-bitmap.h b/pack-bitmap.h
index c633bf5238..189dd68ad3 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -54,7 +54,7 @@ int rebuild_existing_bitmaps(struct bitmap_index *, struct packing_data *mapping
 void free_bitmap_index(struct bitmap_index *);
 
 /*
- * After a traversal has been performed on the bitmap_index, this can be
+ * After a traversal has been performed by prepare_bitmap_walk(), this can be
  * queried to see if a particular object was reachable from any of the
  * objects flagged as UNINTERESTING.
  */
-- 
2.19.0.rc1.549.g6ce4dc0f08


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 2/4] t5310: test delta reuse with bitmaps
  2018-09-01  7:41       ` [PATCH 0/4] un-breaking pack-objects with bitmaps Jeff King
  2018-09-01  7:44         ` [PATCH 1/4] bitmap_has_sha1_in_uninteresting(): drop BUG check Jeff King
@ 2018-09-01  7:48         ` Jeff King
  2018-09-01  8:03           ` Jeff King
  2018-09-01  7:49         ` [PATCH 3/4] traverse_bitmap_commit_list(): don't free result Jeff King
                           ` (3 subsequent siblings)
  5 siblings, 1 reply; 55+ messages in thread
From: Jeff King @ 2018-09-01  7:48 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Stefan Beller, git, Junio C Hamano

Commit 6a1e32d532 (pack-objects: reuse on-disk deltas for
thin "have" objects, 2018-08-21) taught pack-objects a new
optimization trick. Since this wasn't meant to change
user-visible behavior, but only produce smaller packs more
quickly, testing focused on t/perf/p5311.

However, since people don't run perf tests very often, we
should make sure that the feature is exercised in the
regular test suite. This patch does so.

Signed-off-by: Jeff King <peff@peff.net>
---
Side note: If we do squash patch 1 into the eariler series, that will
invalidate the commit id mentioned in the commit message here.

 t/t5310-pack-bitmaps.sh | 93 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 93 insertions(+)

diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 557bd0d0c0..af38776054 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -342,4 +342,97 @@ test_expect_success 'truncated bitmap fails gracefully' '
 	test_i18ngrep corrupt stderr
 '
 
+# have_delta <obj> <expected_base>
+#
+# Note that because this relies on cat-file, it might find _any_ copy of an
+# object in the repository. The caller is responsible for making sure
+# there's only one (e.g., via "repack -ad", or having just fetched a copy).
+have_delta () {
+	echo $2 >expect &&
+	echo $1 | git cat-file --batch-check="%(deltabase)" >actual &&
+	test_cmp expect actual
+}
+
+# Create a state of history with these properties:
+#
+#  - refs that allow a client to fetch some new history, while sharing some old
+#    history with the server; we use branches delta-reuse-old and
+#    delta-reuse-new here
+#
+#  - the new history contains an object that is stored on the server as a delta
+#    against a base that is in the old history
+#
+#  - the base object is not immediately reachable from the tip of the old
+#    history; finding it would involve digging down through history we know the
+#    other side has
+#
+# This should result in a state where fetching from old->new would not
+# traditionally reuse the on-disk delta (because we'd have to dig to realize
+# that the client has it), but we will do so if bitmaps can tell us cheaply
+# that the other side has it.
+test_expect_success 'set up thin delta-reuse parent' '
+	# This first commit contains the buried base object.
+	test-tool genrandom delta 16384 >file &&
+	git add file &&
+	git commit -m "delta base" &&
+	base=$(git rev-parse --verify HEAD:file) &&
+
+	# These intermediate commits bury the base back in history.
+	# This becomes the "old" state.
+	for i in 1 2 3 4 5
+	do
+		echo $i >file &&
+		git commit -am "intermediate $i" || return 1
+	done &&
+	git branch delta-reuse-old &&
+
+	# And now our new history has a delta against the buried base. Note
+	# that this must be smaller than the original file, since pack-objects
+	# prefers to create deltas from smaller objects to larger.
+	test-tool genrandom delta 16300 >file &&
+	git commit -am "delta result" &&
+	delta=$(git rev-parse --verify HEAD:file) &&
+	git branch delta-reuse-new &&
+
+	# Repack with bitmaps and double check that we have the expected delta
+	# relationship.
+	git repack -adb &&
+	have_delta $delta $base
+'
+
+# Now we can sanity-check the non-bitmap behavior (that the server is not able
+# to reuse the delta). This isn't strictly something we care about, so this
+# test could be scrapped in the future. But it makes sure that the next test is
+# actually triggering the feature we want.
+#
+# Note that our tools for working with on-the-wire "thin" packs are limited. So
+# we actually perform the fetch, retain the resulting pack, and inspect the
+# result.
+test_expect_success 'fetch without bitmaps ignores delta against old base' '
+	test_config pack.usebitmaps false &&
+	test_when_finished "rm -rf client.git" &&
+	git init --bare client.git &&
+	(
+		cd client.git &&
+		git config transfer.unpackLimit 1 &&
+		git fetch .. delta-reuse-old:delta-reuse-old &&
+		git fetch .. delta-reuse-new:delta-reuse-new &&
+		have_delta $delta $ZERO_OID
+	)
+'
+
+# And do the same for the bitmap case, where we do expect to find the delta.
+test_expect_success 'fetch with bitmaps can reuse old base' '
+	test_config pack.usebitmaps true &&
+	test_when_finished "rm -rf client.git" &&
+	git init --bare client.git &&
+	(
+		cd client.git &&
+		git config transfer.unpackLimit 1 &&
+		git fetch .. delta-reuse-old:delta-reuse-old &&
+		git fetch .. delta-reuse-new:delta-reuse-new &&
+		have_delta $delta $base
+	)
+'
+
 test_done
-- 
2.19.0.rc1.549.g6ce4dc0f08


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 3/4] traverse_bitmap_commit_list(): don't free result
  2018-09-01  7:41       ` [PATCH 0/4] un-breaking pack-objects with bitmaps Jeff King
  2018-09-01  7:44         ` [PATCH 1/4] bitmap_has_sha1_in_uninteresting(): drop BUG check Jeff King
  2018-09-01  7:48         ` [PATCH 2/4] t5310: test delta reuse with bitmaps Jeff King
@ 2018-09-01  7:49         ` Jeff King
  2018-09-01  7:50         ` [PATCH 4/4] pack-bitmap: drop "loaded" flag Jeff King
                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-09-01  7:49 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Stefan Beller, git, Junio C Hamano

Since it was introduced in fff42755ef (pack-bitmap: add
support for bitmap indexes, 2013-12-21), this function has
freed the result after traversing it. That is an artifact of
the early days of the bitmap code, when we had a single
static "struct bitmap_index". Back then, it was intended
that you would do:

  prepare_bitmap_walk(&revs);
  traverse_bitmap_commit_list(&revs);

Since the actual bitmap_index struct was totally behind the
scenes, it was convenient for traverse_bitmap_commit_list()
to clean it up, clearing the way for another traversal.

But since 3ae5fa0768 (pack-bitmap: remove bitmap_git global
variable, 2018-06-07), the caller explicitly manages the
bitmap_index struct itself, like this:

  b = prepare_bitmap_walk(&revs);
  traverse_bitmap_commit_list(b, &revs);
  free_bitmap_index(b);

It no longer makes sense to auto-free the result after the
traversal. If you want to do another traversal, you'd just
create a new bitmap_index. And while nobody tries to call
traverse_bitmap_commit_list() twice, the fact that it throws
away the result might be surprising, and is better avoided.

Note that in the "old" way it was possible for two walks to
amortize the cost of opening the on-disk .bitmap file (since
it was stored in the global bitmap_index), but we lost that
in 3ae5fa0768. However, no code actually does this, so it's
not worth addressing now. The solution might involve a new:

  reset_bitmap_walk(b, &revs);

call. Or we might even attach the bitmap data to its
matching packed_git struct, so that multiple
prepare_bitmap_walk() calls could use it. That can wait
until somebody actually has need of the optimization (and
until then, we'll do the correct, unsurprising thing).

Signed-off-by: Jeff King <peff@peff.net>
---
 pack-bitmap.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 76fd93a3de..8c1af1cca2 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -848,9 +848,6 @@ void traverse_bitmap_commit_list(struct bitmap_index *bitmap_git,
 		OBJ_TAG, show_reachable);
 
 	show_extended_objects(bitmap_git, show_reachable);
-
-	bitmap_free(bitmap_git->result);
-	bitmap_git->result = NULL;
 }
 
 static uint32_t count_object_type(struct bitmap_index *bitmap_git,
-- 
2.19.0.rc1.549.g6ce4dc0f08


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 4/4] pack-bitmap: drop "loaded" flag
  2018-09-01  7:41       ` [PATCH 0/4] un-breaking pack-objects with bitmaps Jeff King
                           ` (2 preceding siblings ...)
  2018-09-01  7:49         ` [PATCH 3/4] traverse_bitmap_commit_list(): don't free result Jeff King
@ 2018-09-01  7:50         ` Jeff King
  2018-09-04 19:30         ` [PATCH 0/4] un-breaking pack-objects with bitmaps Stefan Beller
  2018-09-08  6:43         ` Ævar Arnfjörð Bjarmason
  5 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-09-01  7:50 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Stefan Beller, git, Junio C Hamano

In the early days of the bitmap code, there was a single
static bitmap_index struct that was used behind the scenes,
and any bitmap-related functions could lazily check
bitmap_git.loaded to see if they needed to read the on-disk
data.

But since 3ae5fa0768 (pack-bitmap: remove bitmap_git global
variable, 2018-06-07), the caller is responsible for the
lifetime of the bitmap_index struct, and we return it from
prepare_bitmap_git() and prepare_bitmap_walk(), both of
which load the on-disk data (or return NULL).

So outside of these functions, it's not possible to have a
bitmap_index for which the loaded flag is not true. Nor is
it possible to accidentally pass an already-loaded
bitmap_index to the loading function (which is static-local
to the file).

We can drop this unnecessary and confusing flag.

Signed-off-by: Jeff King <peff@peff.net>
---
 pack-bitmap.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 8c1af1cca2..40debd5e20 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -91,8 +91,6 @@ struct bitmap_index {
 
 	/* Version of the bitmap index */
 	unsigned int version;
-
-	unsigned loaded : 1;
 };
 
 static struct ewah_bitmap *lookup_stored_bitmap(struct stored_bitmap *st)
@@ -306,7 +304,7 @@ static int open_pack_bitmap_1(struct bitmap_index *bitmap_git, struct packed_git
 
 static int load_pack_bitmap(struct bitmap_index *bitmap_git)
 {
-	assert(bitmap_git->map && !bitmap_git->loaded);
+	assert(bitmap_git->map);
 
 	bitmap_git->bitmaps = kh_init_sha1();
 	bitmap_git->ext_index.positions = kh_init_sha1_pos();
@@ -321,7 +319,6 @@ static int load_pack_bitmap(struct bitmap_index *bitmap_git)
 	if (load_bitmap_entries_v1(bitmap_git) < 0)
 		goto failed;
 
-	bitmap_git->loaded = 1;
 	return 0;
 
 failed:
@@ -336,7 +333,7 @@ static int open_pack_bitmap(struct bitmap_index *bitmap_git)
 	struct packed_git *p;
 	int ret = -1;
 
-	assert(!bitmap_git->map && !bitmap_git->loaded);
+	assert(!bitmap_git->map);
 
 	for (p = get_packed_git(the_repository); p; p = p->next) {
 		if (open_pack_bitmap_1(bitmap_git, p) == 0)
@@ -738,7 +735,7 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
 	 * from disk. this is the point of no return; after this the rev_list
 	 * becomes invalidated and we must perform the revwalk through bitmaps
 	 */
-	if (!bitmap_git->loaded && load_pack_bitmap(bitmap_git) < 0)
+	if (load_pack_bitmap(bitmap_git) < 0)
 		goto cleanup;
 
 	object_array_clear(&revs->pending);
-- 
2.19.0.rc1.549.g6ce4dc0f08

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/4] t5310: test delta reuse with bitmaps
  2018-09-01  7:48         ` [PATCH 2/4] t5310: test delta reuse with bitmaps Jeff King
@ 2018-09-01  8:03           ` Jeff King
  2018-09-01 20:29             ` Ævar Arnfjörð Bjarmason
  2018-09-04 19:05             ` Stefan Beller
  0 siblings, 2 replies; 55+ messages in thread
From: Jeff King @ 2018-09-01  8:03 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Stefan Beller, git, Junio C Hamano

On Sat, Sep 01, 2018 at 03:48:13AM -0400, Jeff King wrote:

> Commit 6a1e32d532 (pack-objects: reuse on-disk deltas for
> thin "have" objects, 2018-08-21) taught pack-objects a new
> optimization trick. Since this wasn't meant to change
> user-visible behavior, but only produce smaller packs more
> quickly, testing focused on t/perf/p5311.
> 
> However, since people don't run perf tests very often, we
> should make sure that the feature is exercised in the
> regular test suite. This patch does so.

This, by the way, is the crux of how such an obvious and severe bug made
it to 'next'.

The original series was tested quite extensively via t/perf and in
production at GitHub. When I re-rolled v2, the only change was the
addition of the assertion, so I didn't bother re-doing the perf tests,
since they're slow and there wouldn't be a measurable impact.

I did run the normal test suite (as I'm sure Junio did, too) as a
double-check for correctness, but as we noticed, the code wasn't
actually exercised there.

Nor had I yet backported the revised series to the version we run at
GitHub, so it hadn't been run there, either.

And all of that coupled with the fact that it only triggers with
bitmaps, so day-to-day use of the buggy Git (like Junio trying to push
out the result ;) ) wouldn't show it.

Anyway. Not that exciting, and kind of obviously dumb in retrospect. But
I think it was worth analyzing to see what went wrong. If there's an
immediate lesson, it is probably: add tests even for changes that aren't
really user-visible to make sure the code is exercised.

There may be a larger lesson about tracking code coverage, but I don't
know that most general code coverage tools would have helped (any
overall percentage number would be too large to move). A tool that
looked at the diff and said "of the N lines you added/touched, this
percent is exercised in the test suite" might have been useful.

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/4] t5310: test delta reuse with bitmaps
  2018-09-01  8:03           ` Jeff King
@ 2018-09-01 20:29             ` Ævar Arnfjörð Bjarmason
  2018-09-01 22:46               ` Ben Peart
  2018-09-02  5:51               ` Jeff King
  2018-09-04 19:05             ` Stefan Beller
  1 sibling, 2 replies; 55+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-09-01 20:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, git, Junio C Hamano, Derrick Stolee, Ben Peart


On Sat, Sep 01 2018, Jeff King wrote:

> On Sat, Sep 01, 2018 at 03:48:13AM -0400, Jeff King wrote:
>
>> Commit 6a1e32d532 (pack-objects: reuse on-disk deltas for
>> thin "have" objects, 2018-08-21) taught pack-objects a new
>> optimization trick. Since this wasn't meant to change
>> user-visible behavior, but only produce smaller packs more
>> quickly, testing focused on t/perf/p5311.
>>
>> However, since people don't run perf tests very often, we
>> should make sure that the feature is exercised in the
>> regular test suite. This patch does so.
>
> This, by the way, is the crux of how such an obvious and severe bug made
> it to 'next'.
>
> The original series was tested quite extensively via t/perf and in
> production at GitHub. When I re-rolled v2, the only change was the
> addition of the assertion, so I didn't bother re-doing the perf tests,
> since they're slow and there wouldn't be a measurable impact.
>
> I did run the normal test suite (as I'm sure Junio did, too) as a
> double-check for correctness, but as we noticed, the code wasn't
> actually exercised there.
>
> Nor had I yet backported the revised series to the version we run at
> GitHub, so it hadn't been run there, either.
>
> And all of that coupled with the fact that it only triggers with
> bitmaps, so day-to-day use of the buggy Git (like Junio trying to push
> out the result ;) ) wouldn't show it.
>
> Anyway. Not that exciting, and kind of obviously dumb in retrospect. But
> I think it was worth analyzing to see what went wrong. If there's an
> immediate lesson, it is probably: add tests even for changes that aren't
> really user-visible to make sure the code is exercised.

Test-wise, isn't the problem rather that that we didn't have something
like what's described in t/README as "Running tests with special setups"
for bitmaps? I.e. stuff like GIT_TEST_SPLIT_INDEX=<bool>, or running it
with GIT_FSMONITOR_TEST=$PWD/t7519/fsmonitor-all to stress the fsmonitor
code.

That comment b.t.w. is not meant as a "you should have done that!"
blame, but just musings on how we could make things better.

Git has things like bitmaps, midx, commit graph, and probably a few
other things I'm forgetting which all have their own tests, but really
fall more in the category of something like the split index in that they
can potentially impact every test in some unexpected way.

So we could add some option to the test suite to e.g. run a custom
command before every "git push" or "git fetch", and then just do a gc
with a repack/commit graph write/midx write etc. in that codepath, along
with (in the case of stuff like midx) setting any neede config knobs to
turn it on.

Of course the utility of that sort of thing is limited unless we have
some dedicated smoke testers or CI capacity to run the various
combinations of those options. But FWIW when I build our own in-house
git I build the package with:

    # Set "false" to test the build procedure itself
    if true
    then
        export BKNG_GIT_HARNESS_OPTIONS="%{?_smp_mflags} --state=failed,slow,save --timer"
        echo Testing without any custom options:
        (cd t && /usr/bin/prove $BKNG_GIT_HARNESS_OPTIONS t[0-9]*.sh)

        echo Testing while roundtripping everything through the fsmonitor codepath:
        (cd t && GIT_FSMONITOR_TEST=$PWD/t7519/fsmonitor-all GIT_SKIP_TESTS="t3404.7 t7411.3 t7411.4" /usr/bin/prove $BKNG_GIT_HARNESS_OPTIONS t[0-9]*.sh)

        echo Testing split index
        (cd t && GIT_TEST_SPLIT_INDEX=true GIT_SKIP_TESTS="t3903 t4015.77" /usr/bin/prove $BKNG_GIT_HARNESS_OPTIONS t[0-9]*.sh)

        echo Testing uncommon pack modes. See ci/run-tests.sh in git
        (cd t && GIT_TEST_FULL_IN_PACK_ARRAY=true GIT_TEST_OE_SIZE=10 /usr/bin/prove $BKNG_GIT_HARNESS_OPTIONS t[0-9]*.sh)
    fi

Those skipped tests are various intermittent bugs related to those
codpaths which I haven't had time to track down / report yet.

So if there was a "test bitmaps everywhere" mode that would have been
caught during the build, unless I've misunderstood how this particular
bug manifests, but then again, it happened on just a plain git.git after
repack, so wasn't any bitmap + push pretty much all that was needed?, I
haven't read your patches in any detail.

B.t.w. for Ben or anyone else who knows about the fsmonitor part of
this: I've long been running the whole test suite with
`GIT_FSMONITOR_TEST=$PWD/t7519/fsmonitor-all prove ...` (also along with
GIT_TEST_SPLIT_INDEX=) after all the main tests pass as additional
stress testing.

It's not documented under the "special setups" section. So I was going
to add it, but I see that in 5c8cdcfd80 ("fsmonitor: add test cases for
fsmonitor extension", 2017-09-22) it's documented that you should also
set GIT_FORCE_PRELOAD_TEST=true, is that needed for GIT_FSMONITOR_TEST?
Or is it yet another mode, and if so to be combined with fsmonitor in
particular, or stand-alone?

> There may be a larger lesson about tracking code coverage, but I don't
> know that most general code coverage tools would have helped (any
> overall percentage number would be too large to move). A tool that
> looked at the diff and said "of the N lines you added/touched, this
> percent is exercised in the test suite" might have been useful.

This would be very useful.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/4] t5310: test delta reuse with bitmaps
  2018-09-01 20:29             ` Ævar Arnfjörð Bjarmason
@ 2018-09-01 22:46               ` Ben Peart
  2018-09-02  5:51               ` Jeff King
  1 sibling, 0 replies; 55+ messages in thread
From: Ben Peart @ 2018-09-01 22:46 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Jeff King
  Cc: Stefan Beller, git, Junio C Hamano, Derrick Stolee, Ben Peart



On 9/1/2018 4:29 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> 
> B.t.w. for Ben or anyone else who knows about the fsmonitor part of
> this: I've long been running the whole test suite with
> `GIT_FSMONITOR_TEST=$PWD/t7519/fsmonitor-all prove ...` (also along with
> GIT_TEST_SPLIT_INDEX=) after all the main tests pass as additional
> stress testing.
> 
> It's not documented under the "special setups" section. So I was going
> to add it, but I see that in 5c8cdcfd80 ("fsmonitor: add test cases for
> fsmonitor extension", 2017-09-22) it's documented that you should also
> set GIT_FORCE_PRELOAD_TEST=true, is that needed for GIT_FSMONITOR_TEST?
> Or is it yet another mode, and if so to be combined with fsmonitor in
> particular, or stand-alone?
> 

I was unaware of the "special setups" section until recently so if you 
would add the fsmonitor information that would be great.

I added the GIT_FORCE_PRELOAD_TEST as when I went to test fsmonitor I 
realized there was no way to manually override the default behavior and 
force preload to happen. I added tests into t7519-status-fsmonitor.sh 
that use this capability to test fsmonitor with and without preload to 
ensure they play nice together as I had to add logic into the preload 
code to mark cache entries as clean for fsmonitor.

GIT_FORCE_PRELOAD_TEST should probably be added to the "special setups" 
as well so that the entire test suite can be used to exercise that code 
path.

Ultimately, GIT_FSMONITOR_TEST and GIT_FORCE_PRELOAD_TEST can be used to 
test their respective code paths using the test suite.  The challenge is 
that the number of potential combinations (including split index, 
uncommon pack modes, etc) gets large pretty quickly.

>> There may be a larger lesson about tracking code coverage, but I don't
>> know that most general code coverage tools would have helped (any
>> overall percentage number would be too large to move). A tool that
>> looked at the diff and said "of the N lines you added/touched, this
>> percent is exercised in the test suite" might have been useful.
> 
> This would be very useful.
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/4] t5310: test delta reuse with bitmaps
  2018-09-01 20:29             ` Ævar Arnfjörð Bjarmason
  2018-09-01 22:46               ` Ben Peart
@ 2018-09-02  5:51               ` Jeff King
  1 sibling, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-09-02  5:51 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Stefan Beller, git, Junio C Hamano, Derrick Stolee, Ben Peart

On Sat, Sep 01, 2018 at 10:29:25PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Anyway. Not that exciting, and kind of obviously dumb in retrospect. But
> > I think it was worth analyzing to see what went wrong. If there's an
> > immediate lesson, it is probably: add tests even for changes that aren't
> > really user-visible to make sure the code is exercised.
> 
> Test-wise, isn't the problem rather that that we didn't have something
> like what's described in t/README as "Running tests with special setups"
> for bitmaps? I.e. stuff like GIT_TEST_SPLIT_INDEX=<bool>, or running it
> with GIT_FSMONITOR_TEST=$PWD/t7519/fsmonitor-all to stress the fsmonitor
> code.

I'm a little skeptical of that as a general approach, for two reasons:

 - the test suite is filled with toy examples, so they often don't
   trigger the interesting cases

 - the tests are often looking for very particular outcomes or
   repository state; munging that state is going to confuse them

> So we could add some option to the test suite to e.g. run a custom
> command before every "git push" or "git fetch", and then just do a gc
> with a repack/commit graph write/midx write etc. in that codepath, along
> with (in the case of stuff like midx) setting any neede config knobs to
> turn it on.

We can try that out with something like this:

diff --git a/builtin/upload-pack.c b/builtin/upload-pack.c
index 42dc4da5a1..f40e0b7a04 100644
--- a/builtin/upload-pack.c
+++ b/builtin/upload-pack.c
@@ -30,6 +30,14 @@ int cmd_upload_pack(int argc, const char **argv, const char *prefix)
 		OPT_END()
 	};
 
+	/*
+	 * This environment variable hackery is necessary because repack in a
+	 * partial clone might actually try to fetch, spawning an infinite
+	 * recursion.
+	 */
+	system("test -z \"$SUPPRESS_BITMAP_MAGIC\" && "
+	       "SUPPRESS_BITMAP_MAGIC=1 git repack -adb >/dev/null 2>&1");
+
 	packet_trace_identity("upload-pack");
 	read_replace_refs = 0;
 

It actually _does_ find the bug in question (which I wasn't at all sure
it would). But it also turns up several other unrelated test failures.

And it's only triggering in some limited cases (fetches). If we tried to
get better coverage of bitmaps in general, I'm not sure where we would
slip in such a repack command. But I think you'd get even more false
positives.

> Of course the utility of that sort of thing is limited unless we have
> some dedicated smoke testers or CI capacity to run the various
> combinations of those options. But FWIW when I build our own in-house
> git I build the package with:

Yes, it also gets really expensive. That can be helped with more CI
machines, but even neglecting cost, I'm not sure our CI workflow is that
great right now (for example, I still haven't figured out the simple
feature of: if I push up a commit that fails, I'd like to get an email).

> So if there was a "test bitmaps everywhere" mode that would have been
> caught during the build, unless I've misunderstood how this particular
> bug manifests, but then again, it happened on just a plain git.git after
> repack, so wasn't any bitmap + push pretty much all that was needed?, I
> haven't read your patches in any detail.

The test patch describes the minimal scenario you need. Which would be
pretty common on any decent sized repo, but rare on toy ones (though as
I said above, there are a few cases in the test suite big enough to
trigger this).

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/4] t5310: test delta reuse with bitmaps
  2018-09-01  8:03           ` Jeff King
  2018-09-01 20:29             ` Ævar Arnfjörð Bjarmason
@ 2018-09-04 19:05             ` Stefan Beller
  2018-09-04 19:45               ` Junio C Hamano
  2018-09-04 20:02               ` Jeff King
  1 sibling, 2 replies; 55+ messages in thread
From: Stefan Beller @ 2018-09-04 19:05 UTC (permalink / raw)
  To: Jeff King, Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, git, Junio C Hamano

On Sat, Sep 1, 2018 at 1:03 AM Jeff King <peff@peff.net> wrote:
>
> On Sat, Sep 01, 2018 at 03:48:13AM -0400, Jeff King wrote:
>
> > Commit 6a1e32d532 (pack-objects: reuse on-disk deltas for
> > thin "have" objects, 2018-08-21) taught pack-objects a new
> > optimization trick. Since this wasn't meant to change
> > user-visible behavior, but only produce smaller packs more
> > quickly, testing focused on t/perf/p5311.
> >
> > However, since people don't run perf tests very often, we
> > should make sure that the feature is exercised in the
> > regular test suite. This patch does so.
>
> This, by the way, is the crux of how such an obvious and severe bug made
> it to 'next'.
>
> The original series was tested quite extensively via t/perf and in
> production at GitHub. When I re-rolled v2, the only change was the
> addition of the assertion, so I didn't bother re-doing the perf tests,
> since they're slow and there wouldn't be a measurable impact.
>
> I did run the normal test suite (as I'm sure Junio did, too) as a
> double-check for correctness, but as we noticed, the code wasn't
> actually exercised there.
>
> Nor had I yet backported the revised series to the version we run at
> GitHub, so it hadn't been run there, either.
>
> And all of that coupled with the fact that it only triggers with
> bitmaps, so day-to-day use of the buggy Git (like Junio trying to push
> out the result ;) ) wouldn't show it.
>
> Anyway. Not that exciting, and kind of obviously dumb in retrospect. But
> I think it was worth analyzing to see what went wrong. If there's an
> immediate lesson, it is probably: add tests even for changes that aren't
> really user-visible to make sure the code is exercised.

Yeah, maybe we need to ask for more tests in the 'real' test suite, and not
just in some special corner (such as t/perf/ or any of the environment
variable proposals nearby).

I wonder if we can make use of git.git in the test suite for similar things,
e.g. after reading the thread about "index corruption with git commit -p" [1],
I thought that we only have these toy examples in the test suite. Toy examples
show that the new feature barely works, and doesn't show it is working at scale.

[1] https://public-inbox.org/git/20180901214157.hxlqmbz3fds7hsdl@ltop.local/

> There may be a larger lesson about tracking code coverage, but I don't
> know that most general code coverage tools would have helped (any
> overall percentage number would be too large to move). A tool that
> looked at the diff and said "of the N lines you added/touched, this
> percent is exercised in the test suite" might have been useful.

From some offline discussion, maybe we want to adapt a philosophy of

  Each patch needs to add a test, that fails when the patch
  is not applied, but succeeds when it is applied. This shows
  that _some_ code in the patch is exercised at least.

(and automatically/strongly enforce this going forwards; however
enforcing such a strict thing is hard, not sure how we'd do it.)

Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/4] un-breaking pack-objects with bitmaps
  2018-09-01  7:41       ` [PATCH 0/4] un-breaking pack-objects with bitmaps Jeff King
                           ` (3 preceding siblings ...)
  2018-09-01  7:50         ` [PATCH 4/4] pack-bitmap: drop "loaded" flag Jeff King
@ 2018-09-04 19:30         ` Stefan Beller
  2018-09-04 20:03           ` Jeff King
  2018-09-08  6:43         ` Ævar Arnfjörð Bjarmason
  5 siblings, 1 reply; 55+ messages in thread
From: Stefan Beller @ 2018-09-04 19:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Ævar Arnfjörð Bjarmason, git, Junio C Hamano

>   [1/4]: bitmap_has_sha1_in_uninteresting(): drop BUG check
>
>     The actual fix. This should get merged to next ASAP (or the original
>     topic just reverted).
>
>   [2/4]: t5310: test delta reuse with bitmaps
>
>     I did this separately to give us flexibility to squash or merge
>     quickly. But it does find Ævar's bug on a git without patch 1.
>
>   [3/4]: traverse_bitmap_commit_list(): don't free result
>
>     The original assert should have simply been useless, but it was the
>     surprising behavior of this function that turned it into a bug.
>
>   [4/4]: pack-bitmap: drop "loaded" flag
>
>     And this is just an annoyance I ran into, which is a fallout from
>     our conversion to using an allocated bitmap_index struct.

FWIW, the whole series is
Reviewed-by: Stefan Beller <sbeller@google.com>
I am sorry for suggesting the BUG() in the first place.

Stefan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/4] t5310: test delta reuse with bitmaps
  2018-09-04 19:05             ` Stefan Beller
@ 2018-09-04 19:45               ` Junio C Hamano
  2018-09-04 20:02               ` Jeff King
  1 sibling, 0 replies; 55+ messages in thread
From: Junio C Hamano @ 2018-09-04 19:45 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jeff King, Johannes Schindelin,
	Ævar Arnfjörð Bjarmason, git

Stefan Beller <sbeller@google.com> writes:

> From some offline discussion, maybe we want to adapt a philosophy of
>
>   Each patch needs to add a test, that fails when the patch
>   is not applied, but succeeds when it is applied. This shows
>   that _some_ code in the patch is exercised at least.
>
> (and automatically/strongly enforce this going forwards; however
> enforcing such a strict thing is hard, not sure how we'd do it.)

Some patches lack test, but when one comes with a test, I do revert
the code change and try to run the tests before I create a topic
branch for it, when I have time.

An enforcement mechanism may be good (e.g. submitGit could learn to
do that "does it come with a test" check followed by "let's see if
the new test fail without the change to the code part" before
allowing the patch to escape to the list), but 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/4] t5310: test delta reuse with bitmaps
  2018-09-04 19:05             ` Stefan Beller
  2018-09-04 19:45               ` Junio C Hamano
@ 2018-09-04 20:02               ` Jeff King
  1 sibling, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-09-04 20:02 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Johannes Schindelin, Ævar Arnfjörð Bjarmason, git,
	Junio C Hamano

On Tue, Sep 04, 2018 at 12:05:58PM -0700, Stefan Beller wrote:

> Yeah, maybe we need to ask for more tests in the 'real' test suite, and not
> just in some special corner (such as t/perf/ or any of the environment
> variable proposals nearby).
> 
> I wonder if we can make use of git.git in the test suite for similar things,
> e.g. after reading the thread about "index corruption with git commit -p" [1],
> I thought that we only have these toy examples in the test suite. Toy examples
> show that the new feature barely works, and doesn't show it is working at scale.

I think the toy examples do both. Often they drill down directly to a
useful but rare corner case that _wouldn't_ be hit during normal
operation. And being toys, they are a lot quicker to set up then trying
to work with a 50MB repository.

Take the "commit -p" one for example. It's not really about the
repository shape but about a particular set of actions. If you don't
test those actions, you won't reproduce the bug.

> > There may be a larger lesson about tracking code coverage, but I don't
> > know that most general code coverage tools would have helped (any
> > overall percentage number would be too large to move). A tool that
> > looked at the diff and said "of the N lines you added/touched, this
> > percent is exercised in the test suite" might have been useful.
> 
> From some offline discussion, maybe we want to adapt a philosophy of
> 
>   Each patch needs to add a test, that fails when the patch
>   is not applied, but succeeds when it is applied. This shows
>   that _some_ code in the patch is exercised at least.
> 
> (and automatically/strongly enforce this going forwards; however
> enforcing such a strict thing is hard, not sure how we'd do it.)

I wouldn't want a hard-and-fast rule like that. If you're fixing a bug,
sure, I think it's good to make sure it's exercised. And if you're
adding a feature, you almost always add some basic tests (and almost
certainly leave some corner without code coverage).

But if you're writing an optimization, there's often no before/after
test.  Presumably it worked before, and hopefully it still works after,
and it just got faster. You're generally relying on existing regression
tests (from when that code was introduced) to save you from bugs. You
might need to _write_ those tests if nobody did before. But it's hard to
know without digging if there are decent tests or not. That's why I
think code coverage of the lines in your diff is probably the best
measure.

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/4] un-breaking pack-objects with bitmaps
  2018-09-04 19:30         ` [PATCH 0/4] un-breaking pack-objects with bitmaps Stefan Beller
@ 2018-09-04 20:03           ` Jeff King
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-09-04 20:03 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Ævar Arnfjörð Bjarmason, git, Junio C Hamano

On Tue, Sep 04, 2018 at 12:30:14PM -0700, Stefan Beller wrote:

> >   [1/4]: bitmap_has_sha1_in_uninteresting(): drop BUG check
> >
> >     The actual fix. This should get merged to next ASAP (or the original
> >     topic just reverted).
> >
> >   [2/4]: t5310: test delta reuse with bitmaps
> >
> >     I did this separately to give us flexibility to squash or merge
> >     quickly. But it does find Ævar's bug on a git without patch 1.
> >
> >   [3/4]: traverse_bitmap_commit_list(): don't free result
> >
> >     The original assert should have simply been useless, but it was the
> >     surprising behavior of this function that turned it into a bug.
> >
> >   [4/4]: pack-bitmap: drop "loaded" flag
> >
> >     And this is just an annoyance I ran into, which is a fallout from
> >     our conversion to using an allocated bitmap_index struct.
> 
> FWIW, the whole series is
> Reviewed-by: Stefan Beller <sbeller@google.com>
> I am sorry for suggesting the BUG() in the first place.

Heh. You only asked a question about the interface. I was the one who
was _supposed_ to be familiar with the code, and put in the actual
assertion. So if your suggestion was dumb, my response was doubly so. :)

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/4] un-breaking pack-objects with bitmaps
  2018-09-01  7:41       ` [PATCH 0/4] un-breaking pack-objects with bitmaps Jeff King
                           ` (4 preceding siblings ...)
  2018-09-04 19:30         ` [PATCH 0/4] un-breaking pack-objects with bitmaps Stefan Beller
@ 2018-09-08  6:43         ` Ævar Arnfjörð Bjarmason
  2018-09-10 16:53           ` Junio C Hamano
  5 siblings, 1 reply; 55+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-09-08  6:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, git, Junio C Hamano


On Sat, Sep 01 2018, Jeff King wrote:

> On Fri, Aug 31, 2018 at 06:55:58PM -0400, Jeff King wrote:
>
>> On Fri, Aug 31, 2018 at 05:23:17PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>> > On Tue, Aug 21 2018, Jeff King wrote:
>> >
>> > > +int bitmap_has_sha1_in_uninteresting(struct bitmap_index *bitmap_git,
>> > > +				     const unsigned char *sha1)
>> > > +{
>> > > +	int pos;
>> > > +
>> > > +	if (!bitmap_git)
>> > > +		return 0; /* no bitmap loaded */
>> > > +	if (!bitmap_git->result)
>> > > +		BUG("failed to perform bitmap walk before querying");
>> >
>> > Some part of what calls this completely breaks pushing from the "next"
>> > branch when you have local bitmaps (we *really* should have some tests
>> > for this...).
>>
>> Yikes, thanks for reporting. I agree we need better tests here.
>
> OK, here is the fix. Since the problem is in 'next', this is done as a
> patch on top of jk/pack-delta-reuse-with-bitmap. But since we're set to
> rewind 'next' post-release anyway, we could squash it directly into
> 30cdc33fba from the original series. That would help later bisections
> from running into it, which may be worth it as it's a pretty severe
> breakage.  Or maybe not:

Junio: Just a reminder that next is still broken with this, and I see
e.g. the Debian "experimental" has the bug but not the fix at this
point.

I'm just reverting jk/pack-delta-reuse-with-bitmap out of next when
building my own package of git, but I think this really should be fixed
in that branch, either by merging the fix down or reverting the original
series out of next, I think just merging the fix down makes sense, but
have no strong opinion on it.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/4] un-breaking pack-objects with bitmaps
  2018-09-08  6:43         ` Ævar Arnfjörð Bjarmason
@ 2018-09-10 16:53           ` Junio C Hamano
  2018-09-10 18:48             ` Jeff King
  0 siblings, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2018-09-10 16:53 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Jeff King, Stefan Beller, git

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I'm just reverting jk/pack-delta-reuse-with-bitmap out of next when
> building my own package of git, but I think this really should be fixed
> in that branch, either by merging the fix down or reverting the original
> series out of next, I think just merging the fix down makes sense, but
> have no strong opinion on it.

Either is fine.  I am not moving 'next' beyond what is necessary for
this release cycle during the pre-release freeze period, and because
I thought that Peff was in favor of squashing in the changes to the
original when the next cycle starts, I haven't bothered to merge the
fix there yet.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/4] un-breaking pack-objects with bitmaps
  2018-09-10 16:53           ` Junio C Hamano
@ 2018-09-10 18:48             ` Jeff King
  2018-09-10 19:23               ` Junio C Hamano
  0 siblings, 1 reply; 55+ messages in thread
From: Jeff King @ 2018-09-10 18:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ævar Arnfjörð Bjarmason, Stefan Beller, git

On Mon, Sep 10, 2018 at 09:53:56AM -0700, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> 
> > I'm just reverting jk/pack-delta-reuse-with-bitmap out of next when
> > building my own package of git, but I think this really should be fixed
> > in that branch, either by merging the fix down or reverting the original
> > series out of next, I think just merging the fix down makes sense, but
> > have no strong opinion on it.
> 
> Either is fine.  I am not moving 'next' beyond what is necessary for
> this release cycle during the pre-release freeze period, and because
> I thought that Peff was in favor of squashing in the changes to the
> original when the next cycle starts, I haven't bothered to merge the
> fix there yet.

I think Ævar's point is just that the current tip of next is badly
broken if you use bitmaps, and it's worth hot-fixing that in the
meantime.

After the release, I'm OK with either one of squashing that first patch
in, or just merging as-is.

-Peff

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/4] un-breaking pack-objects with bitmaps
  2018-09-10 18:48             ` Jeff King
@ 2018-09-10 19:23               ` Junio C Hamano
  0 siblings, 0 replies; 55+ messages in thread
From: Junio C Hamano @ 2018-09-10 19:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Ævar Arnfjörð Bjarmason, Stefan Beller, git

Jeff King <peff@peff.net> writes:

>> Either is fine.  I am not moving 'next' beyond what is necessary for
>> this release cycle during the pre-release freeze period, and because
>> I thought that Peff was in favor of squashing in the changes to the
>> original when the next cycle starts, I haven't bothered to merge the
>> fix there yet.
>
> I think Ævar's point is just that the current tip of next is badly
> broken if you use bitmaps, and it's worth hot-fixing that in the
> meantime.

I know that ;-)

My point is that anybody who *needs* handholding from the upstream
project (i.e. not Ævar, but those whom he is trying to help with the
suggestion) should not be looking at 'next' during the prerelease
freeze.  They should be looking at 'master' instead, and that is why
I am not spending more than minimum cycles of my own worrying about
'next' during the prerelease.

In any case, I think the final is ready to go, almost.  I still need
to do the preformatted docs, though.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 3/6] t/perf: add infrastructure for measuring sizes
  2018-08-17 20:54 [PATCH 0/6] reuse on-disk deltas for fetches with bitmaps Jeff King
@ 2018-08-17 20:56 ` Jeff King
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2018-08-17 20:56 UTC (permalink / raw)
  To: git

The main objective of scripts in the perf framework is to
run "test_perf", which measures the time it takes to run
some operation. However, it can also be interesting to see
the change in the output size of certain operations.

This patch introduces test_size, which records a single
numeric output from the test and shows it in the aggregated
output (with pretty printing and relative size comparison).

Signed-off-by: Jeff King <peff@peff.net>
---
 t/perf/README         | 25 ++++++++++++++++++++++
 t/perf/aggregate.perl | 48 ++++++++++++++++++++++++++++++++++++++-----
 t/perf/perf-lib.sh    | 13 ++++++++++++
 3 files changed, 81 insertions(+), 5 deletions(-)

diff --git a/t/perf/README b/t/perf/README
index 21321a0f36..be12090c38 100644
--- a/t/perf/README
+++ b/t/perf/README
@@ -168,3 +168,28 @@ that
   While we have tried to make sure that it can cope with embedded
   whitespace and other special characters, it will not work with
   multi-line data.
+
+Rather than tracking the performance by run-time as `test_perf` does, you
+may also track output size by using `test_size`. The stdout of the
+function should be a single numeric value, which will be captured and
+shown in the aggregated output. For example:
+
+	test_perf 'time foo' '
+		./foo >foo.out
+	'
+
+	test_size 'output size'
+		wc -c <foo.out
+	'
+
+might produce output like:
+
+	Test                origin           HEAD
+	-------------------------------------------------------------
+	1234.1 time foo     0.37(0.79+0.02)  0.26(0.51+0.02) -29.7%
+	1234.2 output size             4.3M             3.6M -14.7%
+
+The item being measured (and its units) is up to the test; the context
+and the test title should make it clear to the user whether bigger or
+smaller numbers are better. Unlike test_perf, the test code will only be
+run once, since output sizes tend to be more deterministic than timings.
diff --git a/t/perf/aggregate.perl b/t/perf/aggregate.perl
index 3181b087ab..494907a892 100755
--- a/t/perf/aggregate.perl
+++ b/t/perf/aggregate.perl
@@ -13,10 +13,16 @@ sub get_times {
 	my $line = <$fh>;
 	return undef if not defined $line;
 	close $fh or die "cannot close $name: $!";
-	$line =~ /^(?:(\d+):)?(\d+):(\d+(?:\.\d+)?) (\d+(?:\.\d+)?) (\d+(?:\.\d+)?)$/
-		or die "bad input line: $line";
-	my $rt = ((defined $1 ? $1 : 0.0)*60+$2)*60+$3;
-	return ($rt, $4, $5);
+	# times
+	if ($line =~ /^(?:(\d+):)?(\d+):(\d+(?:\.\d+)?) (\d+(?:\.\d+)?) (\d+(?:\.\d+)?)$/) {
+		my $rt = ((defined $1 ? $1 : 0.0)*60+$2)*60+$3;
+		return ($rt, $4, $5);
+	# size
+	} elsif ($line =~ /^\d+$/) {
+		return $&;
+	} else {
+		die "bad input line: $line";
+	}
 }
 
 sub relative_change {
@@ -32,9 +38,15 @@ sub relative_change {
 
 sub format_times {
 	my ($r, $u, $s, $firstr) = @_;
+	# no value means we did not finish the test
 	if (!defined $r) {
 		return "<missing>";
 	}
+	# a single value means we have a size, not times
+	if (!defined $u) {
+		return format_size($r, $firstr);
+	}
+	# otherwise, we have real/user/system times
 	my $out = sprintf "%.2f(%.2f+%.2f)", $r, $u, $s;
 	$out .= ' ' . relative_change($r, $firstr) if defined $firstr;
 	return $out;
@@ -54,6 +66,25 @@ sub usage {
 	exit(1);
 }
 
+sub human_size {
+	my $n = shift;
+	my @units = ('', qw(K M G));
+	while ($n > 900 && @units > 1) {
+		$n /= 1000;
+		shift @units;
+	}
+	return $n unless length $units[0];
+	return sprintf '%.1f%s', $n, $units[0];
+}
+
+sub format_size {
+	my ($size, $first) = @_;
+	# match the width of a time: 0.00(0.00+0.00)
+	my $out = sprintf '%15s', human_size($size);
+	$out .= ' ' . relative_change($size, $first) if defined $first;
+	return $out;
+}
+
 my (@dirs, %dirnames, %dirabbrevs, %prefixes, @tests,
     $codespeed, $sortby, $subsection, $reponame);
 
@@ -184,7 +215,14 @@ sub print_default_results {
 		my $firstr;
 		for my $i (0..$#dirs) {
 			my $d = $dirs[$i];
-			$times{$prefixes{$d}.$t} = [get_times("$resultsdir/$prefixes{$d}$t.times")];
+			my $base = "$resultsdir/$prefixes{$d}$t";
+			$times{$prefixes{$d}.$t} = [];
+			foreach my $type (qw(times size)) {
+				if (-e "$base.$type") {
+					$times{$prefixes{$d}.$t} = [get_times("$base.$type")];
+					last;
+				}
+			}
 			my ($r,$u,$s) = @{$times{$prefixes{$d}.$t}};
 			my $w = length format_times($r,$u,$s,$firstr);
 			$colwidth[$i] = $w if $w > $colwidth[$i];
diff --git a/t/perf/perf-lib.sh b/t/perf/perf-lib.sh
index a54be09516..11d1922cf5 100644
--- a/t/perf/perf-lib.sh
+++ b/t/perf/perf-lib.sh
@@ -231,6 +231,19 @@ test_perf () {
 	test_wrapper_ test_perf_ "$@"
 }
 
+test_size_ () {
+	say >&3 "running: $2"
+	if test_eval_ "$2" 3>"$base".size; then
+		test_ok_ "$1"
+	else
+		test_failure_ "$@"
+	fi
+}
+
+test_size () {
+	test_wrapper_ test_size_ "$@"
+}
+
 # We extend test_done to print timings at the end (./run disables this
 # and does it after running everything)
 test_at_end_hook_ () {
-- 
2.18.0.1205.g3878b1e64a


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 3/6] t/perf: add infrastructure for measuring sizes
  2014-03-26  7:22 [PATCH/RFC 0/6] reuse deltas found by bitmaps Jeff King
@ 2014-03-26  7:22 ` Jeff King
  0 siblings, 0 replies; 55+ messages in thread
From: Jeff King @ 2014-03-26  7:22 UTC (permalink / raw)
  To: git; +Cc: Ben Maurer, Siddharth Agarwal

The main objective of scripts in the perf framework is to
run "test_perf", which measures the time it takes to run
some operation. However, it can also be interesting to see
the change in the output size of certain operations.

This patch introduces test_size, which records a single
numeric output from the test and shows it in the aggregated
output (with pretty printing and relative size comparison).

Signed-off-by: Jeff King <peff@peff.net>
---
 t/perf/README         | 20 ++++++++++++++++++++
 t/perf/aggregate.perl | 48 +++++++++++++++++++++++++++++++++++++++++++-----
 t/perf/perf-lib.sh    | 13 +++++++++++++
 3 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/t/perf/README b/t/perf/README
index 8848c14..09c400f 100644
--- a/t/perf/README
+++ b/t/perf/README
@@ -144,3 +144,23 @@ that
   While we have tried to make sure that it can cope with embedded
   whitespace and other special characters, it will not work with
   multi-line data.
+
+Rather than tracking the performance by run-time as `test_perf` does, you
+may also track output size by using `test_size`. The stdout of the
+function should be a single numeric value, which will be captured and
+shown in the aggregated output. For example:
+
+	test_perf 'time foo' '
+		./foo >foo.out
+	'
+
+	test_size 'output size'
+		wc -c <foo.out
+	'
+
+might produce output like:
+
+	Test                origin           HEAD
+	-------------------------------------------------------------
+	1234.1 time foo     0.37(0.79+0.02)  0.26(0.51+0.02) -29.7%
+	1234.2 output size             4.3M             3.6M -14.7%
diff --git a/t/perf/aggregate.perl b/t/perf/aggregate.perl
index 690cd8c..42739a5 100755
--- a/t/perf/aggregate.perl
+++ b/t/perf/aggregate.perl
@@ -10,10 +10,16 @@ sub get_times {
 	my $line = <$fh>;
 	return undef if not defined $line;
 	close $fh or die "cannot close $name: $!";
-	$line =~ /^(?:(\d+):)?(\d+):(\d+(?:\.\d+)?) (\d+(?:\.\d+)?) (\d+(?:\.\d+)?)$/
-		or die "bad input line: $line";
-	my $rt = ((defined $1 ? $1 : 0.0)*60+$2)*60+$3;
-	return ($rt, $4, $5);
+	# times
+	if ($line =~ /^(?:(\d+):)?(\d+):(\d+(?:\.\d+)?) (\d+(?:\.\d+)?) (\d+(?:\.\d+)?)$/) {
+		my $rt = ((defined $1 ? $1 : 0.0)*60+$2)*60+$3;
+		return ($rt, $4, $5);
+	# size
+	} elsif ($line =~ /^\d+$/) {
+		return $&;
+	} else {
+		die "bad input line: $line";
+	}
 }
 
 sub relative_change {
@@ -29,14 +35,39 @@ sub relative_change {
 
 sub format_times {
 	my ($r, $u, $s, $firstr) = @_;
+	# no value means we did not finish the test
 	if (!defined $r) {
 		return "<missing>";
 	}
+	# a single value means we have a size, not times
+	if (!defined $u) {
+		return format_size($r, $firstr);
+	}
+	# otherwise, we have real/user/system times
 	my $out = sprintf "%.2f(%.2f+%.2f)", $r, $u, $s;
 	$out .= ' ' . relative_change($r, $firstr) if defined $firstr;
 	return $out;
 }
 
+sub human_size {
+	my $n = shift;
+	my @units = ('', qw(K M G));
+	while ($n > 900 && @units > 1) {
+		$n /= 1000;
+		shift @units;
+	}
+	return $n unless length $units[0];
+	return sprintf '%.1f%s', $n, $units[0];
+}
+
+sub format_size {
+	my ($size, $first) = @_;
+	# match the width of a time: 0.00(0.00+0.00)
+	my $out = sprintf '%15s', human_size($size);
+	$out .= ' ' . relative_change($size, $first) if defined $first;
+	return $out;
+}
+
 my (@dirs, %dirnames, %dirabbrevs, %prefixes, @tests);
 while (scalar @ARGV) {
 	my $arg = $ARGV[0];
@@ -139,7 +170,14 @@ sub have_slash {
 	my $firstr;
 	for my $i (0..$#dirs) {
 		my $d = $dirs[$i];
-		$times{$prefixes{$d}.$t} = [get_times("test-results/$prefixes{$d}$t.times")];
+		my $base = "test-results/$prefixes{$d}$t";
+		$times{$prefixes{$d}.$t} = [];
+		foreach my $type (qw(times size)) {
+			if (-e "$base.$type") {
+				$times{$prefixes{$d}.$t} = [get_times("$base.$type")];
+				last;
+			}
+		}
 		my ($r,$u,$s) = @{$times{$prefixes{$d}.$t}};
 		my $w = length format_times($r,$u,$s,$firstr);
 		$colwidth[$i] = $w if $w > $colwidth[$i];
diff --git a/t/perf/perf-lib.sh b/t/perf/perf-lib.sh
index 20f306a..fb8e017 100644
--- a/t/perf/perf-lib.sh
+++ b/t/perf/perf-lib.sh
@@ -202,6 +202,19 @@ test_perf () {
 	test_wrapper_ test_perf_ "$@"
 }
 
+test_size_ () {
+	say >&3 "running: $2"
+	if test_eval_ "$2" 3>"$base".size; then
+		test_ok_ "$1"
+	else
+		test_failure_ "$@"
+	fi
+}
+
+test_size () {
+	test_wrapper_ test_size_ "$@"
+}
+
 # We extend test_done to print timings at the end (./run disables this
 # and does it after running everything)
 test_at_end_hook_ () {
-- 
1.9.1.601.g7ec968e

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2018-09-10 19:23 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-21 18:41 [PATCH] test-tool.h: include git-compat-util.h Jeff King
2018-08-21 19:03 ` Junio C Hamano
2018-08-21 19:06 ` [PATCH 1/6] t/perf: factor boilerplate out of test_perf Jeff King
2018-08-21 19:06 ` [PATCH 2/6] t/perf: factor out percent calculations Jeff King
2018-08-21 19:06 ` [PATCH 3/6] t/perf: add infrastructure for measuring sizes Jeff King
2018-08-22 13:40   ` Derrick Stolee
2018-08-22 15:31     ` Jeff King
2018-08-21 19:06 ` [PATCH 4/6] t/perf: add perf tests for fetches from a bitmapped server Jeff King
2018-08-21 19:07 ` [PATCH 5/6] pack-bitmap: save "have" bitmap from walk Jeff King
2018-08-21 19:47   ` Derrick Stolee
2018-08-21 19:54     ` Jeff King
2018-08-31 15:23   ` Ævar Arnfjörð Bjarmason
2018-08-31 22:55     ` Jeff King
2018-09-01  7:41       ` [PATCH 0/4] un-breaking pack-objects with bitmaps Jeff King
2018-09-01  7:44         ` [PATCH 1/4] bitmap_has_sha1_in_uninteresting(): drop BUG check Jeff King
2018-09-01  7:48         ` [PATCH 2/4] t5310: test delta reuse with bitmaps Jeff King
2018-09-01  8:03           ` Jeff King
2018-09-01 20:29             ` Ævar Arnfjörð Bjarmason
2018-09-01 22:46               ` Ben Peart
2018-09-02  5:51               ` Jeff King
2018-09-04 19:05             ` Stefan Beller
2018-09-04 19:45               ` Junio C Hamano
2018-09-04 20:02               ` Jeff King
2018-09-01  7:49         ` [PATCH 3/4] traverse_bitmap_commit_list(): don't free result Jeff King
2018-09-01  7:50         ` [PATCH 4/4] pack-bitmap: drop "loaded" flag Jeff King
2018-09-04 19:30         ` [PATCH 0/4] un-breaking pack-objects with bitmaps Stefan Beller
2018-09-04 20:03           ` Jeff King
2018-09-08  6:43         ` Ævar Arnfjörð Bjarmason
2018-09-10 16:53           ` Junio C Hamano
2018-09-10 18:48             ` Jeff King
2018-09-10 19:23               ` Junio C Hamano
2018-08-21 19:07 ` [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects Jeff King
2018-08-21 19:43   ` Junio C Hamano
2018-08-21 19:50     ` Junio C Hamano
2018-08-21 20:07       ` Jeff King
2018-08-21 20:14         ` Jeff King
2018-08-21 20:52           ` Junio C Hamano
2018-08-21 21:30             ` Jeff King
2018-08-21 20:57         ` Junio C Hamano
2018-08-21 21:32           ` Jeff King
2018-08-23  0:43           ` [PATCH 0/9] trailer-parsing false positives Jeff King
2018-08-23  0:44             ` [PATCH 1/9] trailer: use size_t for string offsets Jeff King
2018-08-23  0:45             ` [PATCH 2/9] trailer: use size_t for iterating trailer list Jeff King
2018-08-23  0:46             ` [PATCH 3/9] trailer: pass process_trailer_opts to trailer_info_get() Jeff King
2018-08-23  0:48             ` [PATCH 4/9] interpret-trailers: tighten check for "---" patch boundary Jeff King
2018-08-23  0:49             ` [PATCH 5/9] interpret-trailers: allow suppressing "---" divider Jeff King
2018-08-23  0:50             ` [PATCH 6/9] pretty, ref-filter: format %(trailers) with no_divider option Jeff King
2018-08-23  0:50             ` [PATCH 7/9] sequencer: ignore "---" divider when parsing trailers Jeff King
2018-08-23  0:50             ` [PATCH 8/9] append_signoff: use size_t for string offsets Jeff King
2018-08-23  0:51             ` [PATCH 9/9] sequencer: handle ignore_footer when parsing trailers Jeff King
2018-08-23 18:30             ` [PATCH 0/9] trailer-parsing false positives Junio C Hamano
2018-08-24  7:26               ` Jeff King
2018-08-21 20:00     ` [PATCH 6/6] pack-objects: reuse on-disk deltas for thin "have" objects Jeff King
  -- strict thread matches above, loose matches on Subject: below --
2018-08-17 20:54 [PATCH 0/6] reuse on-disk deltas for fetches with bitmaps Jeff King
2018-08-17 20:56 ` [PATCH 3/6] t/perf: add infrastructure for measuring sizes Jeff King
2014-03-26  7:22 [PATCH/RFC 0/6] reuse deltas found by bitmaps Jeff King
2014-03-26  7:22 ` [PATCH 3/6] t/perf: add infrastructure for measuring sizes Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).