All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] shortlog fixes and optimizations
@ 2016-01-15 17:06 Jeff King
  2016-01-15 17:08 ` [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin Jeff King
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Jeff King @ 2016-01-15 17:06 UTC (permalink / raw)
  To: git

These are split out from my the shortlog-trailer series I sent a few
weeks ago. The "trailer" parts still need some re-working, but there is
no need to hold these fixes hostage.

I also dropped the early part of the series, adding skip_prefix_icase().
After digging into the history, I ended up reworking the first patch
here to do a more thorough parse, so we don't need it anymore.

Thanks to Eric Sunshine for review on the first iteration; this
incorporates his comments.

  [1/6]: shortlog: match both "Author:" and "author" on stdin
  [2/6]: shortlog: use strbufs to read from stdin
  [3/6]: shortlog: replace hand-parsing of author with pretty-printer
  [4/6]: shortlog: optimize "--summary" mode
  [5/6]: shortlog: optimize out useless "<none>" normalization
  [6/6]: shortlog: optimize out useless string list

[1] http://thread.gmane.org/gmane.comp.version-control.git/283091

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin
  2016-01-15 17:06 [PATCH 0/6] shortlog fixes and optimizations Jeff King
@ 2016-01-15 17:08 ` Jeff King
  2016-01-15 23:19   ` Eric Sunshine
  2016-01-18 19:26   ` Jeff King
  2016-01-15 17:08 ` [PATCH 2/6] shortlog: use strbufs to read from stdin Jeff King
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 22+ messages in thread
From: Jeff King @ 2016-01-15 17:08 UTC (permalink / raw)
  To: git

The original git-shortlog could read both the normal "git
log" output as well as "git log --format=raw". However, when it was
converted to C by b8ec592 (Build in shortlog, 2006-10-22),
the trailing colon became mandatory, and we no longer
matched the raw output.

Given the amount of intervening time without any bug
reports, it's probable that nobody cares. But given that
it's easy to fix, and that the end result is hopefully more
obvious and flexible (it could now easily accomodate matching
"Committer"), let's just make it work.

Signed-off-by: Jeff King <peff@peff.net>
---
Another option would be to assume that nobody cares about
"--format=raw" and just do:

  if (!skip_prefix(author, "Author: ", &v))
	continue;

That technically breaks somebody who was feeding shortlog output that
contains "author: ", but since Git itself doesn't generate that, it
seems rather unlikely.

 builtin/shortlog.c  | 27 ++++++++++++++++++++++++---
 t/t4201-shortlog.sh |  6 ++++++
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 35ebd17..fe9fa2f 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -89,13 +89,34 @@ static void insert_one_record(struct shortlog *log,
 	string_list_append(item->util, buffer);
 }
 
+/*
+ * If header is "author", match candidate against the regex /[Aa]uthor:? /,
+ * and return a pointer to the remainder of the string in out_value.
+ */
+static int match_ident_header(const char *candidate, const char *header,
+			      const char **out_value)
+{
+	const char *v;
+
+	if (tolower(*candidate++) != tolower(*header++))
+		return 0;
+	if (!skip_prefix(candidate, header, &v))
+		return 0;
+	if (*v == ':')
+		v++;
+	if (*v++ != ' ')
+		return 0;
+	*out_value = v;
+	return 1;
+}
+
 static void read_from_stdin(struct shortlog *log)
 {
 	char author[1024], oneline[1024];
 
 	while (fgets(author, sizeof(author), stdin) != NULL) {
-		if (!(author[0] == 'A' || author[0] == 'a') ||
-		    !starts_with(author + 1, "uthor: "))
+		const char *v;
+		if (!match_ident_header(author, "author", &v))
 			continue;
 		while (fgets(oneline, sizeof(oneline), stdin) &&
 		       oneline[0] != '\n')
@@ -103,7 +124,7 @@ static void read_from_stdin(struct shortlog *log)
 		while (fgets(oneline, sizeof(oneline), stdin) &&
 		       oneline[0] == '\n')
 			; /* discard blanks */
-		insert_one_record(log, author + 8, oneline);
+		insert_one_record(log, v, oneline);
 	}
 }
 
diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
index 7600a3e..82b2314 100755
--- a/t/t4201-shortlog.sh
+++ b/t/t4201-shortlog.sh
@@ -120,6 +120,12 @@ test_expect_success !MINGW 'shortlog from non-git directory' '
 	test_cmp expect out
 '
 
+test_expect_success !MINGW 'shortlog can read --format=raw output' '
+	git log --format=raw HEAD >log &&
+	GIT_DIR=non-existing git shortlog -w <log >out &&
+	test_cmp expect out
+'
+
 test_expect_success 'shortlog should add newline when input line matches wraplen' '
 	cat >expect <<\EOF &&
 A U Thor (2):
-- 
2.7.0.244.g0701a9d

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/6] shortlog: use strbufs to read from stdin
  2016-01-15 17:06 [PATCH 0/6] shortlog fixes and optimizations Jeff King
  2016-01-15 17:08 ` [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin Jeff King
@ 2016-01-15 17:08 ` Jeff King
  2016-01-15 17:09 ` [PATCH 3/6] shortlog: replace hand-parsing of author with pretty-printer Jeff King
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2016-01-15 17:08 UTC (permalink / raw)
  To: git

We currently use fixed-size buffers with fgets(), which
could lead to incorrect results in the unlikely event that a
line had something like "Author:" at exactly its 1024th
character.

But it's easy to convert this to a strbuf, and because we
can reuse the same buffer through the loop, we don't even
pay the extra allocation cost.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index fe9fa2f..bb2ee5b 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -112,20 +112,23 @@ static int match_ident_header(const char *candidate, const char *header,
 
 static void read_from_stdin(struct shortlog *log)
 {
-	char author[1024], oneline[1024];
+	struct strbuf author = STRBUF_INIT;
+	struct strbuf oneline = STRBUF_INIT;
 
-	while (fgets(author, sizeof(author), stdin) != NULL) {
+	while (strbuf_getline(&author, stdin, '\n') != EOF) {
 		const char *v;
-		if (!match_ident_header(author, "author", &v))
+		if (!match_ident_header(author.buf, "author", &v))
 			continue;
-		while (fgets(oneline, sizeof(oneline), stdin) &&
-		       oneline[0] != '\n')
+		while (strbuf_getline(&oneline, stdin, '\n') != EOF &&
+		       oneline.len)
 			; /* discard headers */
-		while (fgets(oneline, sizeof(oneline), stdin) &&
-		       oneline[0] == '\n')
+		while (strbuf_getline(&oneline, stdin, '\n') != EOF &&
+		       !oneline.len)
 			; /* discard blanks */
-		insert_one_record(log, v, oneline);
+		insert_one_record(log, v, oneline.buf);
 	}
+	strbuf_release(&author);
+	strbuf_release(&oneline);
 }
 
 void shortlog_add_commit(struct shortlog *log, struct commit *commit)
-- 
2.7.0.244.g0701a9d

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/6] shortlog: replace hand-parsing of author with pretty-printer
  2016-01-15 17:06 [PATCH 0/6] shortlog fixes and optimizations Jeff King
  2016-01-15 17:08 ` [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin Jeff King
  2016-01-15 17:08 ` [PATCH 2/6] shortlog: use strbufs to read from stdin Jeff King
@ 2016-01-15 17:09 ` Jeff King
  2016-01-15 17:09 ` [PATCH 4/6] shortlog: optimize "--summary" mode Jeff King
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2016-01-15 17:09 UTC (permalink / raw)
  To: git

When gathering the author and oneline subject for each
commit, we hand-parse the commit headers to find the
"author" line, and then continue past to the blank line at
the end of the header.

We can replace this tricky hand-parsing by simply asking the
pretty-printer for the relevant items. This also decouples
the author and oneline parsing, opening up some new
optimizations in further commits.

One reason to avoid the pretty-printer is that it might be
less efficient than hand-parsing. However, I measured no
slowdown at all running "git shortlog -ns HEAD" on
linux.git.

As a bonus, we also fix a memory leak in the (uncommon) case
that the author field is blank.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c | 62 +++++++++++++++++++++++-------------------------------
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index bb2ee5b..256b868 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -133,45 +133,35 @@ static void read_from_stdin(struct shortlog *log)
 
 void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 {
-	const char *author = NULL, *buffer;
-	struct strbuf buf = STRBUF_INIT;
-	struct strbuf ufbuf = STRBUF_INIT;
-
-	pp_commit_easy(CMIT_FMT_RAW, commit, &buf);
-	buffer = buf.buf;
-	while (*buffer && *buffer != '\n') {
-		const char *eol = strchr(buffer, '\n');
-
-		if (eol == NULL)
-			eol = buffer + strlen(buffer);
-		else
-			eol++;
-
-		if (starts_with(buffer, "author "))
-			author = buffer + 7;
-		buffer = eol;
-	}
-	if (!author) {
+	struct strbuf author = STRBUF_INIT;
+	struct strbuf oneline = STRBUF_INIT;
+	struct pretty_print_context ctx = {0};
+
+	ctx.fmt = CMIT_FMT_USERFORMAT;
+	ctx.abbrev = log->abbrev;
+	ctx.subject = "";
+	ctx.after_subject = "";
+	ctx.date_mode.type = DATE_NORMAL;
+	ctx.output_encoding = get_log_output_encoding();
+
+	format_commit_message(commit, "%an <%ae>", &author, &ctx);
+	/* we can detect a total failure only by seeing " <>" in the output */
+	if (author.len <= 3) {
 		warning(_("Missing author: %s"),
 		    oid_to_hex(&commit->object.oid));
-		return;
-	}
-	if (log->user_format) {
-		struct pretty_print_context ctx = {0};
-		ctx.fmt = CMIT_FMT_USERFORMAT;
-		ctx.abbrev = log->abbrev;
-		ctx.subject = "";
-		ctx.after_subject = "";
-		ctx.date_mode.type = DATE_NORMAL;
-		ctx.output_encoding = get_log_output_encoding();
-		pretty_print_commit(&ctx, commit, &ufbuf);
-		buffer = ufbuf.buf;
-	} else if (*buffer) {
-		buffer++;
+		goto out;
 	}
-	insert_one_record(log, author, !*buffer ? "<none>" : buffer);
-	strbuf_release(&ufbuf);
-	strbuf_release(&buf);
+
+	if (log->user_format)
+		pretty_print_commit(&ctx, commit, &oneline);
+	else
+		format_commit_message(commit, "%s", &oneline, &ctx);
+
+	insert_one_record(log, author.buf, oneline.len ? oneline.buf : "<none>");
+
+out:
+	strbuf_release(&author);
+	strbuf_release(&oneline);
 }
 
 static void get_from_rev(struct rev_info *rev, struct shortlog *log)
-- 
2.7.0.244.g0701a9d

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/6] shortlog: optimize "--summary" mode
  2016-01-15 17:06 [PATCH 0/6] shortlog fixes and optimizations Jeff King
                   ` (2 preceding siblings ...)
  2016-01-15 17:09 ` [PATCH 3/6] shortlog: replace hand-parsing of author with pretty-printer Jeff King
@ 2016-01-15 17:09 ` Jeff King
  2016-01-15 17:10 ` [PATCH 5/6] shortlog: optimize out useless "<none>" normalization Jeff King
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2016-01-15 17:09 UTC (permalink / raw)
  To: git

If the user asked us only to show counts for each author,
rather than the individual summary lines, then there is no
point in us generating the summaries only to throw them
away. With this patch, I measured the following speedup for
"git shortlog -ns HEAD" on linux.git (best-of-five):

  [before]
  real    0m5.644s
  user    0m5.472s
  sys     0m0.176s

  [after]
  real    0m5.257s
  user    0m5.104s
  sys     0m0.156s

That's only ~7%, but it's so easy to do, there's no good
reason not to. We don't have to touch any downstream code,
since we already fill in the magic string "<none>" to handle
commits without a message.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 256b868..3e1e0d6 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -152,10 +152,12 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 		goto out;
 	}
 
-	if (log->user_format)
-		pretty_print_commit(&ctx, commit, &oneline);
-	else
-		format_commit_message(commit, "%s", &oneline, &ctx);
+	if (!log->summary) {
+		if (log->user_format)
+			pretty_print_commit(&ctx, commit, &oneline);
+		else
+			format_commit_message(commit, "%s", &oneline, &ctx);
+	}
 
 	insert_one_record(log, author.buf, oneline.len ? oneline.buf : "<none>");
 
-- 
2.7.0.244.g0701a9d

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/6] shortlog: optimize out useless "<none>" normalization
  2016-01-15 17:06 [PATCH 0/6] shortlog fixes and optimizations Jeff King
                   ` (3 preceding siblings ...)
  2016-01-15 17:09 ` [PATCH 4/6] shortlog: optimize "--summary" mode Jeff King
@ 2016-01-15 17:10 ` Jeff King
  2016-01-15 17:10 ` [PATCH 6/6] shortlog: optimize out useless string list Jeff King
  2016-01-15 22:11 ` [PATCH 0/6] shortlog fixes and optimizations Junio C Hamano
  6 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2016-01-15 17:10 UTC (permalink / raw)
  To: git

If we are in --summary mode, we will always pass <none> to
insert_one_record, which will then do some normalization
(e.g., cutting out "[PATCH]"). There's no point in doing so
if we aren't going to use the result anyway.

This drops my best-of-five for "git shortlog -ns HEAD" on
linux.git from:

  real    0m5.257s
  user    0m5.104s
  sys     0m0.156s

to:

  real    0m5.194s
  user    0m5.028s
  sys     0m0.168s

That's only 1%, but arguably the result is clearer to read,
as we're able to group our variable declarations inside the
conditional block. It also opens up further optimization
possibilities for future patches.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c | 63 +++++++++++++++++++++++++++++-------------------------
 1 file changed, 34 insertions(+), 29 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 3e1e0d6..c2bf9b2 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -31,13 +31,9 @@ static void insert_one_record(struct shortlog *log,
 			      const char *author,
 			      const char *oneline)
 {
-	const char *dot3 = log->common_repo_prefix;
-	char *buffer, *p;
 	struct string_list_item *item;
 	const char *mailbuf, *namebuf;
 	size_t namelen, maillen;
-	const char *eol;
-	struct strbuf subject = STRBUF_INIT;
 	struct strbuf namemailbuf = STRBUF_INIT;
 	struct ident_split ident;
 
@@ -59,34 +55,43 @@ static void insert_one_record(struct shortlog *log,
 	if (item->util == NULL)
 		item->util = xcalloc(1, sizeof(struct string_list));
 
-	/* Skip any leading whitespace, including any blank lines. */
-	while (*oneline && isspace(*oneline))
-		oneline++;
-	eol = strchr(oneline, '\n');
-	if (!eol)
-		eol = oneline + strlen(oneline);
-	if (starts_with(oneline, "[PATCH")) {
-		char *eob = strchr(oneline, ']');
-		if (eob && (!eol || eob < eol))
-			oneline = eob + 1;
-	}
-	while (*oneline && isspace(*oneline) && *oneline != '\n')
-		oneline++;
-	format_subject(&subject, oneline, " ");
-	buffer = strbuf_detach(&subject, NULL);
-
-	if (dot3) {
-		int dot3len = strlen(dot3);
-		if (dot3len > 5) {
-			while ((p = strstr(buffer, dot3)) != NULL) {
-				int taillen = strlen(p) - dot3len;
-				memcpy(p, "/.../", 5);
-				memmove(p + 5, p + dot3len, taillen + 1);
+	if (log->summary)
+		string_list_append(item->util, xstrdup(""));
+	else {
+		const char *dot3 = log->common_repo_prefix;
+		char *buffer, *p;
+		struct strbuf subject = STRBUF_INIT;
+		const char *eol;
+
+		/* Skip any leading whitespace, including any blank lines. */
+		while (*oneline && isspace(*oneline))
+			oneline++;
+		eol = strchr(oneline, '\n');
+		if (!eol)
+			eol = oneline + strlen(oneline);
+		if (starts_with(oneline, "[PATCH")) {
+			char *eob = strchr(oneline, ']');
+			if (eob && (!eol || eob < eol))
+				oneline = eob + 1;
+		}
+		while (*oneline && isspace(*oneline) && *oneline != '\n')
+			oneline++;
+		format_subject(&subject, oneline, " ");
+		buffer = strbuf_detach(&subject, NULL);
+
+		if (dot3) {
+			int dot3len = strlen(dot3);
+			if (dot3len > 5) {
+				while ((p = strstr(buffer, dot3)) != NULL) {
+					int taillen = strlen(p) - dot3len;
+					memcpy(p, "/.../", 5);
+					memmove(p + 5, p + dot3len, taillen + 1);
+				}
 			}
 		}
-	}
 
-	string_list_append(item->util, buffer);
+		string_list_append(item->util, buffer);
+	}
 }
 
 /*
-- 
2.7.0.244.g0701a9d

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/6] shortlog: optimize out useless string list
  2016-01-15 17:06 [PATCH 0/6] shortlog fixes and optimizations Jeff King
                   ` (4 preceding siblings ...)
  2016-01-15 17:10 ` [PATCH 5/6] shortlog: optimize out useless "<none>" normalization Jeff King
@ 2016-01-15 17:10 ` Jeff King
  2016-01-15 22:11 ` [PATCH 0/6] shortlog fixes and optimizations Junio C Hamano
  6 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2016-01-15 17:10 UTC (permalink / raw)
  To: git

If we are in "--summary" mode, then we do not care about the
actual list of subject onelines associated with each author.
We care only about the number. So rather than store a
string-list for each author full of "<none>", let's just
keep a count.

This drops my best-of-five for "git shortlog -ns HEAD" on
linux.git from:

  real    0m5.194s
  user    0m5.028s
  sys     0m0.168s

to:

  real    0m5.057s
  user    0m4.916s
  sys     0m0.144s

That's about 2.5%.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c | 43 +++++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 12 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index c2bf9b2..3a0c863 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -14,7 +14,26 @@ static char const * const shortlog_usage[] = {
 	NULL
 };
 
-static int compare_by_number(const void *a1, const void *a2)
+/*
+ * The util field of our string_list_items will contain one of two things:
+ *
+ *   - if --summary is not in use, it will point to a string list of the
+ *     oneline subjects assigned to this author
+ *
+ *   - if --summary is in use, we don't need that list; we only need to know
+ *     its size. So we abuse the pointer slot to store our integer counter.
+ *
+ *  This macro accesses the latter.
+ */
+#define UTIL_TO_INT(x) ((intptr_t)(x)->util)
+
+static int compare_by_counter(const void *a1, const void *a2)
+{
+	const struct string_list_item *i1 = a1, *i2 = a2;
+	return UTIL_TO_INT(i2) - UTIL_TO_INT(i1);
+}
+
+static int compare_by_list(const void *a1, const void *a2)
 {
 	const struct string_list_item *i1 = a1, *i2 = a2;
 	const struct string_list *l1 = i1->util, *l2 = i2->util;
@@ -52,11 +71,9 @@ static void insert_one_record(struct shortlog *log,
 		strbuf_addf(&namemailbuf, " <%.*s>", (int)maillen, mailbuf);
 
 	item = string_list_insert(&log->list, namemailbuf.buf);
-	if (item->util == NULL)
-		item->util = xcalloc(1, sizeof(struct string_list));
 
 	if (log->summary)
-		string_list_append(item->util, xstrdup(""));
+		item->util = (void *)(UTIL_TO_INT(item) + 1);
 	else {
 		const char *dot3 = log->common_repo_prefix;
 		char *buffer, *p;
@@ -90,6 +107,8 @@ static void insert_one_record(struct shortlog *log,
 			}
 		}
 
+		if (item->util == NULL)
+			item->util = xcalloc(1, sizeof(struct string_list));
 		string_list_append(item->util, buffer);
 	}
 }
@@ -315,14 +334,14 @@ void shortlog_output(struct shortlog *log)
 
 	if (log->sort_by_number)
 		qsort(log->list.items, log->list.nr, sizeof(struct string_list_item),
-			compare_by_number);
+		      log->summary ? compare_by_counter : compare_by_list);
 	for (i = 0; i < log->list.nr; i++) {
-		struct string_list *onelines = log->list.items[i].util;
-
+		const struct string_list_item *item = &log->list.items[i];
 		if (log->summary) {
-			printf("%6d\t%s\n", onelines->nr, log->list.items[i].string);
+			printf("%6d\t%s\n", (int)UTIL_TO_INT(item), item->string);
 		} else {
-			printf("%s (%d):\n", log->list.items[i].string, onelines->nr);
+			struct string_list *onelines = item->util;
+			printf("%s (%d):\n", item->string, onelines->nr);
 			for (j = onelines->nr - 1; j >= 0; j--) {
 				const char *msg = onelines->items[j].string;
 
@@ -335,11 +354,11 @@ void shortlog_output(struct shortlog *log)
 					printf("      %s\n", msg);
 			}
 			putchar('\n');
+			onelines->strdup_strings = 1;
+			string_list_clear(onelines, 0);
+			free(onelines);
 		}
 
-		onelines->strdup_strings = 1;
-		string_list_clear(onelines, 0);
-		free(onelines);
 		log->list.items[i].util = NULL;
 	}
 
-- 
2.7.0.244.g0701a9d

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] shortlog fixes and optimizations
  2016-01-15 17:06 [PATCH 0/6] shortlog fixes and optimizations Jeff King
                   ` (5 preceding siblings ...)
  2016-01-15 17:10 ` [PATCH 6/6] shortlog: optimize out useless string list Jeff King
@ 2016-01-15 22:11 ` Junio C Hamano
  6 siblings, 0 replies; 22+ messages in thread
From: Junio C Hamano @ 2016-01-15 22:11 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> These are split out from my the shortlog-trailer series I sent a few
> weeks ago. The "trailer" parts still need some re-working, but there is
> no need to hold these fixes hostage.
>
> I also dropped the early part of the series, adding skip_prefix_icase().
> After digging into the history, I ended up reworking the first patch
> here to do a more thorough parse, so we don't need it anymore.
>
> Thanks to Eric Sunshine for review on the first iteration; this
> incorporates his comments.
>
>   [1/6]: shortlog: match both "Author:" and "author" on stdin
>   [2/6]: shortlog: use strbufs to read from stdin
>   [3/6]: shortlog: replace hand-parsing of author with pretty-printer
>   [4/6]: shortlog: optimize "--summary" mode
>   [5/6]: shortlog: optimize out useless "<none>" normalization
>   [6/6]: shortlog: optimize out useless string list
>
> [1] http://thread.gmane.org/gmane.comp.version-control.git/283091

Thanks for resurrecting these patches.  They all look good.

[PATCH 2/6] is a bit unfortunate timing-wise, but I think we can
manage ;-).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin
  2016-01-15 17:08 ` [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin Jeff King
@ 2016-01-15 23:19   ` Eric Sunshine
  2016-01-18 19:27     ` Jeff King
  2016-01-18 19:26   ` Jeff King
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Sunshine @ 2016-01-15 23:19 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Friday, January 15, 2016, Jeff King <peff@peff.net> wrote:
> The original git-shortlog could read both the normal "git
> log" output as well as "git log --format=raw". However, when it was
> converted to C by b8ec592 (Build in shortlog, 2006-10-22),
> the trailing colon became mandatory, and we no longer
> matched the raw output.
>
> Given the amount of intervening time without any bug
> reports, it's probable that nobody cares. But given that
> it's easy to fix, and that the end result is hopefully more
> obvious and flexible (it could now easily accomodate matching
> "Committer"), let's just make it work.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> diff --git a/builtin/shortlog.c b/builtin/shortlog.c
> @@ -89,13 +89,34 @@ static void insert_one_record(struct shortlog *log,
> +/*
> + * If header is "author", match candidate against the regex /[Aa]uthor:? /,
> + * and return a pointer to the remainder of the string in out_value.
> + */
> +static int match_ident_header(const char *candidate, const char *header,
> +                             const char **out_value)
> +{
> +       const char *v;
> +
> +       if (tolower(*candidate++) != tolower(*header++))
> +               return 0;

Presumably, this will never be invoked as match_ident_header("", "",
...) so we don't have to worry about it accessing beyond end-of-string
when it gets past this conditional. Does it deserve an
assert(*candidate) at the top of the function, though, or is that
overkill?

> +       if (!skip_prefix(candidate, header, &v))
> +               return 0;
> +       if (*v == ':')
> +               v++;
> +       if (*v++ != ' ')
> +               return 0;
> +       *out_value = v;
> +       return 1;
> +}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin
  2016-01-15 17:08 ` [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin Jeff King
  2016-01-15 23:19   ` Eric Sunshine
@ 2016-01-18 19:26   ` Jeff King
  2016-01-18 19:55     ` Junio C Hamano
  1 sibling, 1 reply; 22+ messages in thread
From: Jeff King @ 2016-01-18 19:26 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Eric Sunshine

On Fri, Jan 15, 2016 at 12:08:23PM -0500, Jeff King wrote:

> The original git-shortlog could read both the normal "git
> log" output as well as "git log --format=raw". However, when it was
> converted to C by b8ec592 (Build in shortlog, 2006-10-22),
> the trailing colon became mandatory, and we no longer
> matched the raw output.
> 
> Given the amount of intervening time without any bug
> reports, it's probable that nobody cares. But given that
> it's easy to fix, and that the end result is hopefully more
> obvious and flexible (it could now easily accomodate matching
> "Committer"), let's just make it work.

I rebased the rest of my shortlog-trailer series on this, and sadly,
this final sentence isn't quite true.

The regular "git log" output uses "Commit:" for the committer line, and
the raw output uses "committer". So the match_ident_header function
_can't_ be reused.

So it's not wrong, but it's perhaps more complicated than it needs to
be. We could scrap this patch in favor of just:

  if (!skip_prefix(author, "Author: ", &v) &&
      !skip_prefix(author, "author ", &v))
          continue;

That is technically more strict (it does not take "author: ", which is
accepted by the current code), but matches "git log" and "git log --raw"
output, and misses nothing that git has ever generated. And it extends
naturally to:

  if (!skip_prefix(author, "Commit: ", &v) &&
      !skip_prefix(author, "committer ", &v))
          continue;

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin
  2016-01-15 23:19   ` Eric Sunshine
@ 2016-01-18 19:27     ` Jeff King
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2016-01-18 19:27 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: git

On Fri, Jan 15, 2016 at 06:19:30PM -0500, Eric Sunshine wrote:

> > +/*
> > + * If header is "author", match candidate against the regex /[Aa]uthor:? /,
> > + * and return a pointer to the remainder of the string in out_value.
> > + */
> > +static int match_ident_header(const char *candidate, const char *header,
> > +                             const char **out_value)
> > +{
> > +       const char *v;
> > +
> > +       if (tolower(*candidate++) != tolower(*header++))
> > +               return 0;
> 
> Presumably, this will never be invoked as match_ident_header("", "",
> ...) so we don't have to worry about it accessing beyond end-of-string
> when it gets past this conditional. Does it deserve an
> assert(*candidate) at the top of the function, though, or is that
> overkill?

Good point. It shouldn't happen (we will always feed a string literal),
but it never hurts to document assumptions with an assertion.

However, there is some reason to think this isn't the ideal function;
see the message I just posted elsewhere in the thread.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin
  2016-01-18 19:26   ` Jeff King
@ 2016-01-18 19:55     ` Junio C Hamano
  2016-01-18 20:01       ` [PATCH v2 0/6] shortlog fixes and optimizations Jeff King
  0 siblings, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2016-01-18 19:55 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Eric Sunshine

Jeff King <peff@peff.net> writes:

> So it's not wrong, but it's perhaps more complicated than it needs to
> be. We could scrap this patch in favor of just:
>
>   if (!skip_prefix(author, "Author: ", &v) &&
>       !skip_prefix(author, "author ", &v))
>           continue;
>
> That is technically more strict (it does not take "author: ", which is
> accepted by the current code), but matches "git log" and "git log --raw"
> output, and misses nothing that git has ever generated. And it extends
> naturally to:
>
>   if (!skip_prefix(author, "Commit: ", &v) &&
>       !skip_prefix(author, "committer ", &v))
>           continue;

Yeah, I agree that the above long-hand would be more readable.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 0/6] shortlog fixes and optimizations
  2016-01-18 19:55     ` Junio C Hamano
@ 2016-01-18 20:01       ` Jeff King
  2016-01-18 20:02         ` [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin Jeff King
                           ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Jeff King @ 2016-01-18 20:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Eric Sunshine

On Mon, Jan 18, 2016 at 11:55:18AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > So it's not wrong, but it's perhaps more complicated than it needs to
> > be. We could scrap this patch in favor of just:
> >
> >   if (!skip_prefix(author, "Author: ", &v) &&
> >       !skip_prefix(author, "author ", &v))
> >           continue;
> >
> > That is technically more strict (it does not take "author: ", which is
> > accepted by the current code), but matches "git log" and "git log --raw"
> > output, and misses nothing that git has ever generated. And it extends
> > naturally to:
> >
> >   if (!skip_prefix(author, "Commit: ", &v) &&
> >       !skip_prefix(author, "committer ", &v))
> >           continue;
> 
> Yeah, I agree that the above long-hand would be more readable.

OK. Here it is again (the whole series, since the change creates minor
conflicts some of the later patches).

The interdiff is:

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 3a0c863..adbf1fd 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -113,27 +113,6 @@ static void insert_one_record(struct shortlog *log,
 	}
 }
 
-/*
- * If header is "author", match candidate against the regex /[Aa]uthor:? /,
- * and return a pointer to the remainder of the string in out_value.
- */
-static int match_ident_header(const char *candidate, const char *header,
-			      const char **out_value)
-{
-	const char *v;
-
-	if (tolower(*candidate++) != tolower(*header++))
-		return 0;
-	if (!skip_prefix(candidate, header, &v))
-		return 0;
-	if (*v == ':')
-		v++;
-	if (*v++ != ' ')
-		return 0;
-	*out_value = v;
-	return 1;
-}
-
 static void read_from_stdin(struct shortlog *log)
 {
 	struct strbuf author = STRBUF_INIT;
@@ -141,7 +120,8 @@ static void read_from_stdin(struct shortlog *log)
 
 	while (strbuf_getline(&author, stdin, '\n') != EOF) {
 		const char *v;
-		if (!match_ident_header(author.buf, "author", &v))
+		if (!skip_prefix(author.buf, "Author: ", &v) &&
+		    !skip_prefix(author.buf, "author ", &v))
 			continue;
 		while (strbuf_getline(&oneline, stdin, '\n') != EOF &&
 		       oneline.len)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin
  2016-01-18 20:01       ` [PATCH v2 0/6] shortlog fixes and optimizations Jeff King
@ 2016-01-18 20:02         ` Jeff King
  2016-01-18 20:02         ` [PATCH 2/6] shortlog: use strbufs to read from stdin Jeff King
                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2016-01-18 20:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Eric Sunshine

The original git-shortlog could read both the normal "git
log" output as well as "git log --format=raw". However, when
it was converted to C by b8ec592 (Build in shortlog,
2006-10-22), the trailing colon became mandatory, and we no
longer matched the raw output.

Given the amount of intervening time without any bug
reports, it's probable that nobody cares. But it's
relatively easy to fix, and the end result is hopefully more
readable than the original.

Note that this no longer matches "author: ", which we did
before, but that has never been a format generated by git.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c  | 7 ++++---
 t/t4201-shortlog.sh | 6 ++++++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 35ebd17..ab25b44 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -94,8 +94,9 @@ static void read_from_stdin(struct shortlog *log)
 	char author[1024], oneline[1024];
 
 	while (fgets(author, sizeof(author), stdin) != NULL) {
-		if (!(author[0] == 'A' || author[0] == 'a') ||
-		    !starts_with(author + 1, "uthor: "))
+		const char *v;
+		if (!skip_prefix(author, "Author: ", &v) &&
+		    !skip_prefix(author, "author ", &v))
 			continue;
 		while (fgets(oneline, sizeof(oneline), stdin) &&
 		       oneline[0] != '\n')
@@ -103,7 +104,7 @@ static void read_from_stdin(struct shortlog *log)
 		while (fgets(oneline, sizeof(oneline), stdin) &&
 		       oneline[0] == '\n')
 			; /* discard blanks */
-		insert_one_record(log, author + 8, oneline);
+		insert_one_record(log, v, oneline);
 	}
 }
 
diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
index 7600a3e..82b2314 100755
--- a/t/t4201-shortlog.sh
+++ b/t/t4201-shortlog.sh
@@ -120,6 +120,12 @@ test_expect_success !MINGW 'shortlog from non-git directory' '
 	test_cmp expect out
 '
 
+test_expect_success !MINGW 'shortlog can read --format=raw output' '
+	git log --format=raw HEAD >log &&
+	GIT_DIR=non-existing git shortlog -w <log >out &&
+	test_cmp expect out
+'
+
 test_expect_success 'shortlog should add newline when input line matches wraplen' '
 	cat >expect <<\EOF &&
 A U Thor (2):
-- 
2.7.0.248.g5eafd77

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/6] shortlog: use strbufs to read from stdin
  2016-01-18 20:01       ` [PATCH v2 0/6] shortlog fixes and optimizations Jeff King
  2016-01-18 20:02         ` [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin Jeff King
@ 2016-01-18 20:02         ` Jeff King
  2016-01-18 20:02         ` [PATCH 3/6] shortlog: replace hand-parsing of author with pretty-printer Jeff King
                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2016-01-18 20:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Eric Sunshine

We currently use fixed-size buffers with fgets(), which
could lead to incorrect results in the unlikely event that a
line had something like "Author:" at exactly its 1024th
character.

But it's easy to convert this to a strbuf, and because we
can reuse the same buffer through the loop, we don't even
pay the extra allocation cost.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index ab25b44..6c0a72e 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -91,21 +91,24 @@ static void insert_one_record(struct shortlog *log,
 
 static void read_from_stdin(struct shortlog *log)
 {
-	char author[1024], oneline[1024];
+	struct strbuf author = STRBUF_INIT;
+	struct strbuf oneline = STRBUF_INIT;
 
-	while (fgets(author, sizeof(author), stdin) != NULL) {
+	while (strbuf_getline(&author, stdin, '\n') != EOF) {
 		const char *v;
-		if (!skip_prefix(author, "Author: ", &v) &&
-		    !skip_prefix(author, "author ", &v))
+		if (!skip_prefix(author.buf, "Author: ", &v) &&
+		    !skip_prefix(author.buf, "author ", &v))
 			continue;
-		while (fgets(oneline, sizeof(oneline), stdin) &&
-		       oneline[0] != '\n')
+		while (strbuf_getline(&oneline, stdin, '\n') != EOF &&
+		       oneline.len)
 			; /* discard headers */
-		while (fgets(oneline, sizeof(oneline), stdin) &&
-		       oneline[0] == '\n')
+		while (strbuf_getline(&oneline, stdin, '\n') != EOF &&
+		       !oneline.len)
 			; /* discard blanks */
-		insert_one_record(log, v, oneline);
+		insert_one_record(log, v, oneline.buf);
 	}
+	strbuf_release(&author);
+	strbuf_release(&oneline);
 }
 
 void shortlog_add_commit(struct shortlog *log, struct commit *commit)
-- 
2.7.0.248.g5eafd77

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/6] shortlog: replace hand-parsing of author with pretty-printer
  2016-01-18 20:01       ` [PATCH v2 0/6] shortlog fixes and optimizations Jeff King
  2016-01-18 20:02         ` [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin Jeff King
  2016-01-18 20:02         ` [PATCH 2/6] shortlog: use strbufs to read from stdin Jeff King
@ 2016-01-18 20:02         ` Jeff King
  2016-01-18 20:13           ` Jeff King
  2016-01-18 20:02         ` [PATCH 4/6] shortlog: optimize "--summary" mode Jeff King
                           ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2016-01-18 20:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Eric Sunshine

When gathering the author and oneline subject for each
commit, we hand-parse the commit headers to find the
"author" line, and then continue past to the blank line at
the end of the header.

We can replace this tricky hand-parsing by simply asking the
pretty-printer for the relevant items. This also decouples
the author and oneline parsing, opening up some new
optimizations in further commits.

One reason to avoid the pretty-printer is that it might be
less efficient than hand-parsing. However, I measured no
slowdown at all running "git shortlog -ns HEAD" on
linux.git.

As a bonus, we also fix a memory leak in the (uncommon) case
that the author field is blank.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c | 62 +++++++++++++++++++++++-------------------------------
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 6c0a72e..1261ec4 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -113,45 +113,35 @@ static void read_from_stdin(struct shortlog *log)
 
 void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 {
-	const char *author = NULL, *buffer;
-	struct strbuf buf = STRBUF_INIT;
-	struct strbuf ufbuf = STRBUF_INIT;
-
-	pp_commit_easy(CMIT_FMT_RAW, commit, &buf);
-	buffer = buf.buf;
-	while (*buffer && *buffer != '\n') {
-		const char *eol = strchr(buffer, '\n');
-
-		if (eol == NULL)
-			eol = buffer + strlen(buffer);
-		else
-			eol++;
-
-		if (starts_with(buffer, "author "))
-			author = buffer + 7;
-		buffer = eol;
-	}
-	if (!author) {
+	struct strbuf author = STRBUF_INIT;
+	struct strbuf oneline = STRBUF_INIT;
+	struct pretty_print_context ctx = {0};
+
+	ctx.fmt = CMIT_FMT_USERFORMAT;
+	ctx.abbrev = log->abbrev;
+	ctx.subject = "";
+	ctx.after_subject = "";
+	ctx.date_mode.type = DATE_NORMAL;
+	ctx.output_encoding = get_log_output_encoding();
+
+	format_commit_message(commit, "%an <%ae>", &author, &ctx);
+	/* we can detect a total failure only by seeing " <>" in the output */
+	if (author.len <= 3) {
 		warning(_("Missing author: %s"),
 		    oid_to_hex(&commit->object.oid));
-		return;
-	}
-	if (log->user_format) {
-		struct pretty_print_context ctx = {0};
-		ctx.fmt = CMIT_FMT_USERFORMAT;
-		ctx.abbrev = log->abbrev;
-		ctx.subject = "";
-		ctx.after_subject = "";
-		ctx.date_mode.type = DATE_NORMAL;
-		ctx.output_encoding = get_log_output_encoding();
-		pretty_print_commit(&ctx, commit, &ufbuf);
-		buffer = ufbuf.buf;
-	} else if (*buffer) {
-		buffer++;
+		goto out;
 	}
-	insert_one_record(log, author, !*buffer ? "<none>" : buffer);
-	strbuf_release(&ufbuf);
-	strbuf_release(&buf);
+
+	if (log->user_format)
+		pretty_print_commit(&ctx, commit, &oneline);
+	else
+		format_commit_message(commit, "%s", &oneline, &ctx);
+
+	insert_one_record(log, author.buf, oneline.len ? oneline.buf : "<none>");
+
+out:
+	strbuf_release(&author);
+	strbuf_release(&oneline);
 }
 
 static void get_from_rev(struct rev_info *rev, struct shortlog *log)
-- 
2.7.0.248.g5eafd77

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/6] shortlog: optimize "--summary" mode
  2016-01-18 20:01       ` [PATCH v2 0/6] shortlog fixes and optimizations Jeff King
                           ` (2 preceding siblings ...)
  2016-01-18 20:02         ` [PATCH 3/6] shortlog: replace hand-parsing of author with pretty-printer Jeff King
@ 2016-01-18 20:02         ` Jeff King
  2016-01-18 20:02         ` [PATCH 5/6] shortlog: optimize out useless "<none>" normalization Jeff King
  2016-01-18 20:02         ` [PATCH 6/6] shortlog: optimize out useless string list Jeff King
  5 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2016-01-18 20:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Eric Sunshine

If the user asked us only to show counts for each author,
rather than the individual summary lines, then there is no
point in us generating the summaries only to throw them
away. With this patch, I measured the following speedup for
"git shortlog -ns HEAD" on linux.git (best-of-five):

  [before]
  real    0m5.644s
  user    0m5.472s
  sys     0m0.176s

  [after]
  real    0m5.257s
  user    0m5.104s
  sys     0m0.156s

That's only ~7%, but it's so easy to do, there's no good
reason not to. We don't have to touch any downstream code,
since we already fill in the magic string "<none>" to handle
commits without a message.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 1261ec4..973b50d 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -132,10 +132,12 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 		goto out;
 	}
 
-	if (log->user_format)
-		pretty_print_commit(&ctx, commit, &oneline);
-	else
-		format_commit_message(commit, "%s", &oneline, &ctx);
+	if (!log->summary) {
+		if (log->user_format)
+			pretty_print_commit(&ctx, commit, &oneline);
+		else
+			format_commit_message(commit, "%s", &oneline, &ctx);
+	}
 
 	insert_one_record(log, author.buf, oneline.len ? oneline.buf : "<none>");
 
-- 
2.7.0.248.g5eafd77

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/6] shortlog: optimize out useless "<none>" normalization
  2016-01-18 20:01       ` [PATCH v2 0/6] shortlog fixes and optimizations Jeff King
                           ` (3 preceding siblings ...)
  2016-01-18 20:02         ` [PATCH 4/6] shortlog: optimize "--summary" mode Jeff King
@ 2016-01-18 20:02         ` Jeff King
  2016-01-18 20:02         ` [PATCH 6/6] shortlog: optimize out useless string list Jeff King
  5 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2016-01-18 20:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Eric Sunshine

If we are in --summary mode, we will always pass <none> to
insert_one_record, which will then do some normalization
(e.g., cutting out "[PATCH]"). There's no point in doing so
if we aren't going to use the result anyway.

This drops my best-of-five for "git shortlog -ns HEAD" on
linux.git from:

  real    0m5.257s
  user    0m5.104s
  sys     0m0.156s

to:

  real    0m5.194s
  user    0m5.028s
  sys     0m0.168s

That's only 1%, but arguably the result is clearer to read,
as we're able to group our variable declarations inside the
conditional block. It also opens up further optimization
possibilities for future patches.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c | 63 +++++++++++++++++++++++++++++-------------------------
 1 file changed, 34 insertions(+), 29 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 973b50d..a7708c3 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -31,13 +31,9 @@ static void insert_one_record(struct shortlog *log,
 			      const char *author,
 			      const char *oneline)
 {
-	const char *dot3 = log->common_repo_prefix;
-	char *buffer, *p;
 	struct string_list_item *item;
 	const char *mailbuf, *namebuf;
 	size_t namelen, maillen;
-	const char *eol;
-	struct strbuf subject = STRBUF_INIT;
 	struct strbuf namemailbuf = STRBUF_INIT;
 	struct ident_split ident;
 
@@ -59,34 +55,43 @@ static void insert_one_record(struct shortlog *log,
 	if (item->util == NULL)
 		item->util = xcalloc(1, sizeof(struct string_list));
 
-	/* Skip any leading whitespace, including any blank lines. */
-	while (*oneline && isspace(*oneline))
-		oneline++;
-	eol = strchr(oneline, '\n');
-	if (!eol)
-		eol = oneline + strlen(oneline);
-	if (starts_with(oneline, "[PATCH")) {
-		char *eob = strchr(oneline, ']');
-		if (eob && (!eol || eob < eol))
-			oneline = eob + 1;
-	}
-	while (*oneline && isspace(*oneline) && *oneline != '\n')
-		oneline++;
-	format_subject(&subject, oneline, " ");
-	buffer = strbuf_detach(&subject, NULL);
-
-	if (dot3) {
-		int dot3len = strlen(dot3);
-		if (dot3len > 5) {
-			while ((p = strstr(buffer, dot3)) != NULL) {
-				int taillen = strlen(p) - dot3len;
-				memcpy(p, "/.../", 5);
-				memmove(p + 5, p + dot3len, taillen + 1);
+	if (log->summary)
+		string_list_append(item->util, xstrdup(""));
+	else {
+		const char *dot3 = log->common_repo_prefix;
+		char *buffer, *p;
+		struct strbuf subject = STRBUF_INIT;
+		const char *eol;
+
+		/* Skip any leading whitespace, including any blank lines. */
+		while (*oneline && isspace(*oneline))
+			oneline++;
+		eol = strchr(oneline, '\n');
+		if (!eol)
+			eol = oneline + strlen(oneline);
+		if (starts_with(oneline, "[PATCH")) {
+			char *eob = strchr(oneline, ']');
+			if (eob && (!eol || eob < eol))
+				oneline = eob + 1;
+		}
+		while (*oneline && isspace(*oneline) && *oneline != '\n')
+			oneline++;
+		format_subject(&subject, oneline, " ");
+		buffer = strbuf_detach(&subject, NULL);
+
+		if (dot3) {
+			int dot3len = strlen(dot3);
+			if (dot3len > 5) {
+				while ((p = strstr(buffer, dot3)) != NULL) {
+					int taillen = strlen(p) - dot3len;
+					memcpy(p, "/.../", 5);
+					memmove(p + 5, p + dot3len, taillen + 1);
+				}
 			}
 		}
-	}
 
-	string_list_append(item->util, buffer);
+		string_list_append(item->util, buffer);
+	}
 }
 
 static void read_from_stdin(struct shortlog *log)
-- 
2.7.0.248.g5eafd77

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/6] shortlog: optimize out useless string list
  2016-01-18 20:01       ` [PATCH v2 0/6] shortlog fixes and optimizations Jeff King
                           ` (4 preceding siblings ...)
  2016-01-18 20:02         ` [PATCH 5/6] shortlog: optimize out useless "<none>" normalization Jeff King
@ 2016-01-18 20:02         ` Jeff King
  5 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2016-01-18 20:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Eric Sunshine

If we are in "--summary" mode, then we do not care about the
actual list of subject onelines associated with each author.
We care only about the number. So rather than store a
string-list for each author full of "<none>", let's just
keep a count.

This drops my best-of-five for "git shortlog -ns HEAD" on
linux.git from:

  real    0m5.194s
  user    0m5.028s
  sys     0m0.168s

to:

  real    0m5.057s
  user    0m4.916s
  sys     0m0.144s

That's about 2.5%.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c | 43 +++++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 12 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index a7708c3..adbf1fd 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -14,7 +14,26 @@ static char const * const shortlog_usage[] = {
 	NULL
 };
 
-static int compare_by_number(const void *a1, const void *a2)
+/*
+ * The util field of our string_list_items will contain one of two things:
+ *
+ *   - if --summary is not in use, it will point to a string list of the
+ *     oneline subjects assigned to this author
+ *
+ *   - if --summary is in use, we don't need that list; we only need to know
+ *     its size. So we abuse the pointer slot to store our integer counter.
+ *
+ *  This macro accesses the latter.
+ */
+#define UTIL_TO_INT(x) ((intptr_t)(x)->util)
+
+static int compare_by_counter(const void *a1, const void *a2)
+{
+	const struct string_list_item *i1 = a1, *i2 = a2;
+	return UTIL_TO_INT(i2) - UTIL_TO_INT(i1);
+}
+
+static int compare_by_list(const void *a1, const void *a2)
 {
 	const struct string_list_item *i1 = a1, *i2 = a2;
 	const struct string_list *l1 = i1->util, *l2 = i2->util;
@@ -52,11 +71,9 @@ static void insert_one_record(struct shortlog *log,
 		strbuf_addf(&namemailbuf, " <%.*s>", (int)maillen, mailbuf);
 
 	item = string_list_insert(&log->list, namemailbuf.buf);
-	if (item->util == NULL)
-		item->util = xcalloc(1, sizeof(struct string_list));
 
 	if (log->summary)
-		string_list_append(item->util, xstrdup(""));
+		item->util = (void *)(UTIL_TO_INT(item) + 1);
 	else {
 		const char *dot3 = log->common_repo_prefix;
 		char *buffer, *p;
@@ -90,6 +107,8 @@ static void insert_one_record(struct shortlog *log,
 			}
 		}
 
+		if (item->util == NULL)
+			item->util = xcalloc(1, sizeof(struct string_list));
 		string_list_append(item->util, buffer);
 	}
 }
@@ -295,14 +314,14 @@ void shortlog_output(struct shortlog *log)
 
 	if (log->sort_by_number)
 		qsort(log->list.items, log->list.nr, sizeof(struct string_list_item),
-			compare_by_number);
+		      log->summary ? compare_by_counter : compare_by_list);
 	for (i = 0; i < log->list.nr; i++) {
-		struct string_list *onelines = log->list.items[i].util;
-
+		const struct string_list_item *item = &log->list.items[i];
 		if (log->summary) {
-			printf("%6d\t%s\n", onelines->nr, log->list.items[i].string);
+			printf("%6d\t%s\n", (int)UTIL_TO_INT(item), item->string);
 		} else {
-			printf("%s (%d):\n", log->list.items[i].string, onelines->nr);
+			struct string_list *onelines = item->util;
+			printf("%s (%d):\n", item->string, onelines->nr);
 			for (j = onelines->nr - 1; j >= 0; j--) {
 				const char *msg = onelines->items[j].string;
 
@@ -315,11 +334,11 @@ void shortlog_output(struct shortlog *log)
 					printf("      %s\n", msg);
 			}
 			putchar('\n');
+			onelines->strdup_strings = 1;
+			string_list_clear(onelines, 0);
+			free(onelines);
 		}
 
-		onelines->strdup_strings = 1;
-		string_list_clear(onelines, 0);
-		free(onelines);
 		log->list.items[i].util = NULL;
 	}
 
-- 
2.7.0.248.g5eafd77

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] shortlog: replace hand-parsing of author with pretty-printer
  2016-01-18 20:02         ` [PATCH 3/6] shortlog: replace hand-parsing of author with pretty-printer Jeff King
@ 2016-01-18 20:13           ` Jeff King
  2016-01-18 23:04             ` Jeff King
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2016-01-18 20:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Eric Sunshine

On Mon, Jan 18, 2016 at 03:02:48PM -0500, Jeff King wrote:

> +	format_commit_message(commit, "%an <%ae>", &author, &ctx);
> +	/* we can detect a total failure only by seeing " <>" in the output */
> +	if (author.len <= 3) {
>  		warning(_("Missing author: %s"),
>  		    oid_to_hex(&commit->object.oid));
> [...]
> +		goto out;
>  	}

One note on this. In the linux.git tree, this warning actually triggers,
because there is a commit with a bogus empty author:

  $ git log -1 --format=raw af25e94d4dc | grep ^author
  author  <> 1120285620 -0700

Whereas in the original code, you really do get a line with a blank
name.

I think what the new code does is quite reasonable, but I'm not sure if:

  1. People really want a syntactically valid empty name to be
     represented.

and

  2. Regardless of the output, if kernel folks will be annoyed by the
     warning whenever they run a full-repo shortlog.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] shortlog: replace hand-parsing of author with pretty-printer
  2016-01-18 20:13           ` Jeff King
@ 2016-01-18 23:04             ` Jeff King
  2016-01-19  3:47               ` Junio C Hamano
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2016-01-18 23:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Eric Sunshine

On Mon, Jan 18, 2016 at 03:13:37PM -0500, Jeff King wrote:

> On Mon, Jan 18, 2016 at 03:02:48PM -0500, Jeff King wrote:
> 
> > +	format_commit_message(commit, "%an <%ae>", &author, &ctx);
> > +	/* we can detect a total failure only by seeing " <>" in the output */
> > +	if (author.len <= 3) {
> >  		warning(_("Missing author: %s"),
> >  		    oid_to_hex(&commit->object.oid));
> > [...]
> > +		goto out;
> >  	}
> 
> One note on this. In the linux.git tree, this warning actually triggers,
> because there is a commit with a bogus empty author:
> 
>   $ git log -1 --format=raw af25e94d4dc | grep ^author
>   author  <> 1120285620 -0700
> 
> Whereas in the original code, you really do get a line with a blank
> name.
> 
> I think what the new code does is quite reasonable, but I'm not sure if:
> 
>   1. People really want a syntactically valid empty name to be
>      represented.
> 
> and
> 
>   2. Regardless of the output, if kernel folks will be annoyed by the
>      warning whenever they run a full-repo shortlog.

After thinking on this, I'm in favor of removing this warning entirely.
My reasons are given in the commit message below, which can apply on top
of the series.  It could also be squashed in to 2/6, but given that it
is removing the test added by cd4f09e (shortlog: ignore commits with
missing authors, 2013-09-18), I think it's worth recording as a separate
commit.

-- >8 --
Subject: [PATCH] shortlog: don't warn on empty author

Git tries to avoid creating a commit with an empty author
name or email. However, commits created by older, less
strict versions of git may still be in the history.  There's
not much point in issuing a warning to stderr for an empty
author. The user can't do anything about it now, and we are
better off to simply include it in the shortlog output as an
empty name/email, and let the caller process it however they
see fit.

Older versions of shortlog differentiated between "author
header not present" (which complained) and "author
name/email are blank" (which included the empty ident in the
output).  But since switching to format_commit_message, we
complain to stderr about either case (linux.git has a blank
author deep in its history which triggers this).

We could try to restore the older behavior (complaining only
about the missing header), but in retrospect, there's not
much point in differentiating these cases. A missing
author header is bogus, but as for the "blank" case, the
only useful behavior is to add it to the "empty name"
collection.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c  |  8 --------
 t/t4201-shortlog.sh | 16 ----------------
 2 files changed, 24 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index adbf1fd..e32be39 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -149,13 +149,6 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 	ctx.output_encoding = get_log_output_encoding();
 
 	format_commit_message(commit, "%an <%ae>", &author, &ctx);
-	/* we can detect a total failure only by seeing " <>" in the output */
-	if (author.len <= 3) {
-		warning(_("Missing author: %s"),
-		    oid_to_hex(&commit->object.oid));
-		goto out;
-	}
-
 	if (!log->summary) {
 		if (log->user_format)
 			pretty_print_commit(&ctx, commit, &oneline);
@@ -165,7 +158,6 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 
 	insert_one_record(log, author.buf, oneline.len ? oneline.buf : "<none>");
 
-out:
 	strbuf_release(&author);
 	strbuf_release(&oneline);
 }
diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
index 82b2314..f5e6367 100755
--- a/t/t4201-shortlog.sh
+++ b/t/t4201-shortlog.sh
@@ -178,22 +178,6 @@ test_expect_success !MINGW 'shortlog encoding' '
 	git shortlog HEAD~2.. > out &&
 test_cmp expect out'
 
-test_expect_success 'shortlog ignores commits with missing authors' '
-	git commit --allow-empty -m normal &&
-	git commit --allow-empty -m soon-to-be-broken &&
-	git cat-file commit HEAD >commit.tmp &&
-	sed "/^author/d" commit.tmp >broken.tmp &&
-	commit=$(git hash-object -w -t commit --stdin <broken.tmp) &&
-	git update-ref HEAD $commit &&
-	cat >expect <<-\EOF &&
-	A U Thor (1):
-	      normal
-
-	EOF
-	git shortlog HEAD~2.. >actual &&
-	test_cmp expect actual
-'
-
 test_expect_success 'shortlog with revision pseudo options' '
 	git shortlog --all &&
 	git shortlog --branches &&
-- 
2.7.0.248.g5eafd77

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] shortlog: replace hand-parsing of author with pretty-printer
  2016-01-18 23:04             ` Jeff King
@ 2016-01-19  3:47               ` Junio C Hamano
  0 siblings, 0 replies; 22+ messages in thread
From: Junio C Hamano @ 2016-01-19  3:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Eric Sunshine

Jeff King <peff@peff.net> writes:

> After thinking on this, I'm in favor of removing this warning entirely.
> My reasons are given in the commit message below, which can apply on top
> of the series.  It could also be squashed in to 2/6, but given that it
> is removing the test added by cd4f09e (shortlog: ignore commits with
> missing authors, 2013-09-18), I think it's worth recording as a separate
> commit.

I agree 100% with the reasoning. Thanks for thinking things through.

>
> -- >8 --
> Subject: [PATCH] shortlog: don't warn on empty author
>
> Git tries to avoid creating a commit with an empty author
> name or email. However, commits created by older, less
> strict versions of git may still be in the history.  There's
> not much point in issuing a warning to stderr for an empty
> author. The user can't do anything about it now, and we are
> better off to simply include it in the shortlog output as an
> empty name/email, and let the caller process it however they
> see fit.
>
> Older versions of shortlog differentiated between "author
> header not present" (which complained) and "author
> name/email are blank" (which included the empty ident in the
> output).  But since switching to format_commit_message, we
> complain to stderr about either case (linux.git has a blank
> author deep in its history which triggers this).
>
> We could try to restore the older behavior (complaining only
> about the missing header), but in retrospect, there's not
> much point in differentiating these cases. A missing
> author header is bogus, but as for the "blank" case, the
> only useful behavior is to add it to the "empty name"
> collection.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  builtin/shortlog.c  |  8 --------
>  t/t4201-shortlog.sh | 16 ----------------
>  2 files changed, 24 deletions(-)
>
> diff --git a/builtin/shortlog.c b/builtin/shortlog.c
> index adbf1fd..e32be39 100644
> --- a/builtin/shortlog.c
> +++ b/builtin/shortlog.c
> @@ -149,13 +149,6 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
>  	ctx.output_encoding = get_log_output_encoding();
>  
>  	format_commit_message(commit, "%an <%ae>", &author, &ctx);
> -	/* we can detect a total failure only by seeing " <>" in the output */
> -	if (author.len <= 3) {
> -		warning(_("Missing author: %s"),
> -		    oid_to_hex(&commit->object.oid));
> -		goto out;
> -	}
> -
>  	if (!log->summary) {
>  		if (log->user_format)
>  			pretty_print_commit(&ctx, commit, &oneline);
> @@ -165,7 +158,6 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
>  
>  	insert_one_record(log, author.buf, oneline.len ? oneline.buf : "<none>");
>  
> -out:
>  	strbuf_release(&author);
>  	strbuf_release(&oneline);
>  }
> diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
> index 82b2314..f5e6367 100755
> --- a/t/t4201-shortlog.sh
> +++ b/t/t4201-shortlog.sh
> @@ -178,22 +178,6 @@ test_expect_success !MINGW 'shortlog encoding' '
>  	git shortlog HEAD~2.. > out &&
>  test_cmp expect out'
>  
> -test_expect_success 'shortlog ignores commits with missing authors' '
> -	git commit --allow-empty -m normal &&
> -	git commit --allow-empty -m soon-to-be-broken &&
> -	git cat-file commit HEAD >commit.tmp &&
> -	sed "/^author/d" commit.tmp >broken.tmp &&
> -	commit=$(git hash-object -w -t commit --stdin <broken.tmp) &&
> -	git update-ref HEAD $commit &&
> -	cat >expect <<-\EOF &&
> -	A U Thor (1):
> -	      normal
> -
> -	EOF
> -	git shortlog HEAD~2.. >actual &&
> -	test_cmp expect actual
> -'
> -
>  test_expect_success 'shortlog with revision pseudo options' '
>  	git shortlog --all &&
>  	git shortlog --branches &&

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-01-19  3:48 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-15 17:06 [PATCH 0/6] shortlog fixes and optimizations Jeff King
2016-01-15 17:08 ` [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin Jeff King
2016-01-15 23:19   ` Eric Sunshine
2016-01-18 19:27     ` Jeff King
2016-01-18 19:26   ` Jeff King
2016-01-18 19:55     ` Junio C Hamano
2016-01-18 20:01       ` [PATCH v2 0/6] shortlog fixes and optimizations Jeff King
2016-01-18 20:02         ` [PATCH 1/6] shortlog: match both "Author:" and "author" on stdin Jeff King
2016-01-18 20:02         ` [PATCH 2/6] shortlog: use strbufs to read from stdin Jeff King
2016-01-18 20:02         ` [PATCH 3/6] shortlog: replace hand-parsing of author with pretty-printer Jeff King
2016-01-18 20:13           ` Jeff King
2016-01-18 23:04             ` Jeff King
2016-01-19  3:47               ` Junio C Hamano
2016-01-18 20:02         ` [PATCH 4/6] shortlog: optimize "--summary" mode Jeff King
2016-01-18 20:02         ` [PATCH 5/6] shortlog: optimize out useless "<none>" normalization Jeff King
2016-01-18 20:02         ` [PATCH 6/6] shortlog: optimize out useless string list Jeff King
2016-01-15 17:08 ` [PATCH 2/6] shortlog: use strbufs to read from stdin Jeff King
2016-01-15 17:09 ` [PATCH 3/6] shortlog: replace hand-parsing of author with pretty-printer Jeff King
2016-01-15 17:09 ` [PATCH 4/6] shortlog: optimize "--summary" mode Jeff King
2016-01-15 17:10 ` [PATCH 5/6] shortlog: optimize out useless "<none>" normalization Jeff King
2016-01-15 17:10 ` [PATCH 6/6] shortlog: optimize out useless string list Jeff King
2016-01-15 22:11 ` [PATCH 0/6] shortlog fixes and optimizations Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.