git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: tboegi@web.de
Cc: git@vger.kernel.org, alexander.s.m@gmail.com, Johannes.Schindelin@gmx.de
Subject: Re: [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1
Date: Mon, 05 Sep 2022 13:46:57 -0700	[thread overview]
Message-ID: <xmqqv8q1zgzi.fsf@gitster.g> (raw)
In-Reply-To: <20220903053931.15611-1-tboegi@web.de> (tboegi@web.de's message of "Sat, 3 Sep 2022 07:39:31 +0200")

tboegi@web.de writes:

> From: Torsten Bögershausen <tboegi@web.de>
> Subject: Re: [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1

Given 2/2 does not share a similar title, "part1" sounds somewhat
strange.  In any case, 'when appropriate,' is probalby best unsaid,
as it is almost a given.  We won't deliberately use something that
is not appropriate on purpose anyway.  Even if we =were to keep that
word, downcase "When".


> When unicode filenames (encoded in UTF-8) are used, the visible width
> on the screen is not the same as strlen(filename).
>
> For example, `git log --stat` may produce an output like this:
>
> [snip the header]
>
>  Arger.txt  | 1 +
>  Ärger.txt | 1 +
>  2 files changed, 2 insertions(+)
>
> A side note: the original report was about cyrillic filenames.
> After some investigations it turned out that
> a) This is not a problem with "ambiguous characters" in unicode
> b) The same problem exists for all unicode code points (so we
>   can use Latin based Umlauts for demonstrations below)
>
> The 'Ä' takes the same space on the screen as the 'A'.
> But needs one more byte in memory, so the the `git log --stat` output
> for "Arger.txt" (!) gets mis-aligned:
> The maximum length is derived from "Ärger.txt", 10 bytes in memory,
> 9 positions on the screen. That is why "Arger.txt" gets one extra ' '
> for aligment, it needs 9 bytes in memory.
> If there was a file "Ö", it would be correctly aligned by chance,
> but "Öhö" would not.
>
> The solution is of course, to use utf8_strwidth() instead of strlen()
> when dealing with the width on screen.
>
> Side note 1:
> Needed changes for this fix are split into 2 commits:
> This commit only changes strlen() into utf8_strwidth() in diff.c:
> The next commit will add tests and further needed changes.

I am not sure if it makes sense to split them into two.  It is hard
for us to demonistrate the need for this step if it does not come
with its own test.

> Side note 2:
> Junio C Hamano suspects that there is probably more work to be done,
> in a separate commit:
> Code in diff.c::pprint_rename() that "abbreviates" overly long pathnames
> and "transforms" renames lines like
> "a/b/c -> a/B/c" into the shorter
> "a/{b->B}/c" form, and IIRC this is all byte based.

I already said that I suspect {b->B} conversion is OK, so the side
note is probably more noise than being useful.
>
> Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
> Signed-off-by: Torsten Bögershausen <tboegi@web.de>
> ---
>  diff.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/diff.c b/diff.c
> index 974626a621..b5df464de5 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		fill_print_name(file);
> -		len = strlen(file->print_name);
> +		len = utf8_strwidth(file->print_name);
>  		if (max_len < len)
>  			max_len = len;
>
> @@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		 * "scale" the filename
>  		 */
>  		len = name_width;
> -		name_len = strlen(name);
> +		name_len = utf8_strwidth(name);
>  		if (name_width < name_len) {
>  			char *slash;
>  			prefix = "...";
> --
> 2.34.0

  reply	other threads:[~2022-09-05 20:47 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-09 13:11 [BUG] Unicode filenames handling in `git log --stat` Alexander Meshcheryakov
2022-08-09 18:20 ` Calvin Wan
2022-08-09 19:03   ` Alexander Meshcheryakov
2022-08-09 21:36     ` Calvin Wan
2022-08-10  5:55   ` Junio C Hamano
2022-08-10  8:40     ` Torsten Bögershausen
2022-08-10  8:56       ` Alexander Meshcheryakov
2022-08-10  9:51         ` Torsten Bögershausen
2022-08-10 11:41           ` Torsten Bögershausen
2022-08-10 15:53       ` Junio C Hamano
2022-08-10 17:35         ` Torsten Bögershausen
2022-08-14 13:35 ` [PATCH/RFC 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
2022-08-14 23:12   ` Junio C Hamano
2022-08-15  6:34     ` Torsten Bögershausen
2022-08-18 21:00       ` Junio C Hamano
2022-08-27  8:50 ` [PATCH v2 " tboegi
2022-08-27  8:54   ` Torsten Bögershausen
2022-08-27  9:50     ` Eric Sunshine
2022-08-29 12:04   ` Johannes Schindelin
2022-08-29 17:54     ` Torsten Bögershausen
2022-08-29 18:37       ` Junio C Hamano
2022-09-02  9:47       ` Johannes Schindelin
2022-09-02  4:21 ` [PATCH v3 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
2022-09-02  9:39   ` Johannes Schindelin
2022-09-02  4:21 ` [PATCH v3 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
2022-09-02 10:12   ` Johannes Schindelin
2022-09-03  5:39 ` [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1 tboegi
2022-09-05 20:46   ` Junio C Hamano [this message]
2022-09-07  4:30     ` Torsten Bögershausen
2022-09-07 18:31       ` Junio C Hamano
2022-09-03  5:39 ` [PATCH v4 2/2] diff.c: More changes and tests around utf8_strwidth() tboegi
2022-09-05 10:13   ` Johannes Schindelin
2022-09-14 15:13 ` [PATCH v5 1/1] diff.c: When appropriate, use utf8_strwidth() tboegi
2022-09-14 16:40   ` Junio C Hamano
2022-09-26 18:43     ` Torsten Bögershausen
2022-10-10 21:58       ` Junio C Hamano
2022-10-20 15:46         ` Torsten Bögershausen
2022-10-20 17:43           ` Junio C Hamano
2022-10-21 15:19             ` Torsten Bögershausen
2022-10-21 21:59               ` Junio C Hamano
2022-10-23 20:02                 ` Torsten Bögershausen
2022-09-15  2:57   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqv8q1zgzi.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=alexander.s.m@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).