All of lore.kernel.org
 help / color / mirror / Atom feed
From: tboegi@web.de
To: git@vger.kernel.org, alexander.s.m@gmail.com, Johannes.Schindelin@gmx.de
Cc: "Torsten Bögershausen" <tboegi@web.de>
Subject: [PATCH v4 2/2] diff.c: More changes and tests around utf8_strwidth()
Date: Sat,  3 Sep 2022 07:39:34 +0200	[thread overview]
Message-ID: <20220903053934.15629-1-tboegi@web.de> (raw)
In-Reply-To: <CA+VDVVVmi99i6ZY64tg8RkVXDc5gOzQP_SH12zhDKRkUnhWFgw@mail.gmail.com>

From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen(filename).

For example, `git log --stat` may produce an output like this:

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

The last commit uses utf8_strwidth() instead of strlen() in diff.c
and it is time to test the change.

And now we detect another problem that is fixed here: code like this
strbuf_addf(&out, "%-*s", len, name);
(or using the underlying snprintf() function) does not align the
buffer to a minimum of len measured in screen-width, but uses the
memory count.

One could be tempted to wish that snprintf() was UTF-8 aware.
That doesn't seem to be the case anywhere (tested on Linux and Mac),
probably snprintf() uses the "bytes in memory"/strlen() approach to be
compatible with older versions and this will never change.

The basic idea is to change code in diff.c like this
strbuf_addf(&out, "%-*s", len, name);

into something like this:
int padding = len - utf8_strwidth(name);
if (padding < 0)
	padding = 0;
strbuf_addf(&out, " %s%*s", name, padding, "");

The real change is slighty bigger, as it, as well, integrates two calls
of strbuf_addf() into one.

Tests:
Two things need to be tested:
 - The calculation of the maximum width
 - The calculation of padding

The name "textfile" is changed into "tëxtfilë", both have a width of 8.
If strlen() was used, to get the maximum width, the shorter "binfile" would
have been mis-aligned:
 binfile    | [snip]
 tëxtfilë | [snip]

If only "binfile" would be renamed into "binfilë":
 binfilë | [snip]
 textfile | [snip]

In order to verify that the width is calculated correctly everywhere,
"binfile" is renamed into "binfilë", giving 1 bytes more in strlen()
"tëxtfile" is renamed into "tëxtfilë", 2 byte more in strlen().

The updated t4012-diff-binary.sh checks the correct aligment:
 binfilë  | [snip]
 tëxtfilë | [snip]

Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c                 | 23 ++++++++++++++---------
 t/t4012-diff-binary.sh | 14 +++++++-------
 2 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/diff.c b/diff.c
index b5df464de5..35b9da90fe 100644
--- a/diff.c
+++ b/diff.c
@@ -2734,7 +2734,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		char *name = file->print_name;
 		uintmax_t added = file->added;
 		uintmax_t deleted = file->deleted;
-		int name_len;
+		int name_len, padding;

 		if (!file->is_interesting && (added + deleted == 0))
 			continue;
@@ -2753,10 +2753,14 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			if (slash)
 				name = slash;
 		}
+		padding = len - utf8_strwidth(name);
+		if (padding < 0)
+			padding = 0;

 		if (file->is_binary) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addf(&out, " %*s", number_width, "Bin");
+			strbuf_addf(&out, " %s%s%*s | %*s",
+				    prefix, name, padding, "",
+				    number_width, "Bin");
 			if (!added && !deleted) {
 				strbuf_addch(&out, '\n');
 				emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
@@ -2776,8 +2780,9 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		else if (file->is_unmerged) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addstr(&out, " Unmerged\n");
+			strbuf_addf(&out, " %s%s%*s | %*s",
+				    prefix, name, padding, "",
+				    number_width, "Unmerged");
 			emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
 					 out.buf, out.len, 0);
 			strbuf_reset(&out);
@@ -2803,10 +2808,10 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
 				add = total - del;
 			}
 		}
-		strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-		strbuf_addf(&out, " %*"PRIuMAX"%s",
-			number_width, added + deleted,
-			added + deleted ? " " : "");
+		strbuf_addf(&out, " %s%s%*s | %*"PRIuMAX"%s",
+			    prefix, name, padding, "",
+			    number_width, added + deleted,
+			    added + deleted ? " " : "");
 		show_graph(&out, '+', add, add_c, reset);
 		show_graph(&out, '-', del, del_c, reset);
 		strbuf_addch(&out, '\n');
diff --git a/t/t4012-diff-binary.sh b/t/t4012-diff-binary.sh
index c509143c81..c64d9d2f40 100755
--- a/t/t4012-diff-binary.sh
+++ b/t/t4012-diff-binary.sh
@@ -113,20 +113,20 @@ test_expect_success 'diff --no-index with binary creation' '
 '

 cat >expect <<EOF
- binfile  |   Bin 0 -> 1026 bytes
- textfile | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ binfilë  |   Bin 0 -> 1026 bytes
+ tëxtfilë | 10000 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 EOF

 test_expect_success 'diff --stat with binary files and big change count' '
-	printf "\01\00%1024d" 1 >binfile &&
-	git add binfile &&
+	printf "\01\00%1024d" 1 >binfilë &&
+	git add binfilë &&
 	i=0 &&
 	while test $i -lt 10000; do
 		echo $i &&
 		i=$(($i + 1)) || return 1
-	done >textfile &&
-	git add textfile &&
-	git diff --cached --stat binfile textfile >output &&
+	done >tëxtfilë &&
+	git add tëxtfilë &&
+	git -c core.quotepath=false diff --cached --stat binfilë tëxtfilë >output &&
 	grep " | " output >actual &&
 	test_cmp expect actual
 '
--
2.34.0


  parent reply	other threads:[~2022-09-03  5:39 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
2022-08-09 18:20 ` Calvin Wan
2022-08-09 19:03   ` Alexander Meshcheryakov
2022-08-09 21:36     ` Calvin Wan
2022-08-10  5:55   ` Junio C Hamano
2022-08-10  8:40     ` Torsten Bögershausen
2022-08-10  8:56       ` Alexander Meshcheryakov
2022-08-10  9:51         ` Torsten Bögershausen
2022-08-10 11:41           ` Torsten Bögershausen
2022-08-10 15:53       ` Junio C Hamano
2022-08-10 17:35         ` Torsten Bögershausen
2022-08-14 13:35 ` [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
2022-08-14 23:12   ` Junio C Hamano
2022-08-15  6:34     ` Torsten Bögershausen
2022-08-18 21:00       ` Junio C Hamano
2022-08-27  8:50 ` [PATCH v2 " tboegi
2022-08-27  8:54   ` Torsten Bögershausen
2022-08-27  9:50     ` Eric Sunshine
2022-08-29 12:04   ` Johannes Schindelin
2022-08-29 17:54     ` Torsten Bögershausen
2022-08-29 18:37       ` Junio C Hamano
2022-09-02  9:47       ` Johannes Schindelin
2022-09-02  4:21 ` [PATCH v3 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
2022-09-02  9:39   ` Johannes Schindelin
2022-09-02  4:21 ` [PATCH v3 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
2022-09-02 10:12   ` Johannes Schindelin
2022-09-03  5:39 ` [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
2022-09-05 20:46   ` Junio C Hamano
2022-09-07  4:30     ` Torsten Bögershausen
2022-09-07 18:31       ` Junio C Hamano
2022-09-03  5:39 ` tboegi [this message]
2022-09-05 10:13   ` [PATCH v4 2/2] diff.c: More changes and tests around utf8_strwidth() Johannes Schindelin
2022-09-14 15:13 ` [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
2022-09-14 16:40   ` Junio C Hamano
2022-09-26 18:43     ` Torsten Bögershausen
2022-10-10 21:58       ` Junio C Hamano
2022-10-20 15:46         ` Torsten Bögershausen
2022-10-20 17:43           ` Junio C Hamano
2022-10-21 15:19             ` Torsten Bögershausen
2022-10-21 21:59               ` Junio C Hamano
2022-10-23 20:02                 ` Torsten Bögershausen
2022-09-15  2:57   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220903053934.15629-1-tboegi@web.de \
    --to=tboegi@web.de \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=alexander.s.m@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.