git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Linux with musl libc improvement
@ 2019-10-31  9:26 Doan Tran Cong Danh
  2019-10-31  9:26 ` [PATCH 1/3] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
                   ` (7 more replies)
  0 siblings, 8 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-10-31  9:26 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

This series of patch tries to ease the life of musl libc user (by running
configure once) and clear out those last failure on musl.

Doan Tran Cong Danh (3):
  t0028: eliminate non-standard usage of printf
  configure.ac: define ICONV_OMITS_BOM if necessary
  sequencer: reencode to utf-8 before arrange rebase's todo list

 configure.ac                     | 22 ++++++++++++++++++++++
 sequencer.c                      |  2 +-
 t/t0028-working-tree-encoding.sh |  4 ++--
 3 files changed, 25 insertions(+), 3 deletions(-)

-- 
2.24.0.rc1.3.g89530838a3.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 1/3] t0028: eliminate non-standard usage of printf
  2019-10-31  9:26 [PATCH 0/3] Linux with musl libc improvement Doan Tran Cong Danh
@ 2019-10-31  9:26 ` Doan Tran Cong Danh
  2019-10-31 17:41   ` Jeff King
  2019-10-31 19:50   ` brian m. carlson
  2019-10-31  9:26 ` [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-10-31  9:26 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

man 1p printf:
   In addition to the escape sequences shown in the Base Definitions
   volume of POSIX.1‐2008, Chapter 5, File Format Notation ('\\',
   '\a', '\b', '\f', '\n', '\r', '\t', '\v'), "\ddd", where ddd is a
   one, two, or three-digit octal number, shall be written as a byte
   with the numeric value specified by the octal number.

printf '\xfe\xff' in an extension of some libc.

With dash:
$ printf '\xfe\xff' | xxd
00000000: 5c78 6665 5c78 6666                      \xfe\xff

Correct its usage.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---

Notes:
    Despite that dash's printf doesn't accept \x escape sequence.
    
    My glibc box (with sh linked to dash) can run the test just fine.
    But my musl box couldn't run the test, (because the header).

 t/t0028-working-tree-encoding.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7aa0945d8d..bfc4fb9af5 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -17,7 +17,7 @@ test_lazy_prereq NO_UTF32_BOM '
 write_utf16 () {
 	if test_have_prereq NO_UTF16_BOM
 	then
-		printf '\xfe\xff'
+		printf '\376\377'
 	fi &&
 	iconv -f UTF-8 -t UTF-16
 }
@@ -25,7 +25,7 @@ write_utf16 () {
 write_utf32 () {
 	if test_have_prereq NO_UTF32_BOM
 	then
-		printf '\x00\x00\xfe\xff'
+		printf '\0\0\376\377'
 	fi &&
 	iconv -f UTF-8 -t UTF-32
 }
-- 
2.24.0.rc1.3.g89530838a3.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary
  2019-10-31  9:26 [PATCH 0/3] Linux with musl libc improvement Doan Tran Cong Danh
  2019-10-31  9:26 ` [PATCH 1/3] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
@ 2019-10-31  9:26 ` Doan Tran Cong Danh
  2019-10-31 18:11   ` Jeff King
  2019-10-31  9:26 ` [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-10-31  9:26 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

From commit 79444c9294, ("utf8: handle systems that don't write BOM for
UTF-16", 2019-02-12), we're supporting those systems with iconv that
omits BOM with:

    make ICONV_OMITS_BOM=Yes

However, typing the flag all the time is cumbersome and error-prone.

Add a checking into configure script to detect this flag automatically.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---

Notes:
    We deliberately fail for ac_cv_iconv_omits_bom on cross-compiling,
    in order to ask builder provide the value for the target.
    
    We're relied on this technik for ac_cv_fread_reads_directories and
    ac_cv_snprintf_returns_bogus.
    
    Adding one more failure for configuring on cross-compiling
    is not going to be a burden for distro.

 configure.ac | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/configure.ac b/configure.ac
index a43b476402..790b53bbdc 100644
--- a/configure.ac
+++ b/configure.ac
@@ -690,6 +690,28 @@ fi
 
 fi
 
+#
+# Define ICONV_OMITS_BOM if you are on a system which
+# iconv omits bom for utf-{16,32}
+if test -z "$NO_ICONV"; then
+AC_CACHE_CHECK([whether iconv omits bom for utf-16 and utf-32],
+ [ac_cv_iconv_omits_bom],
+[
+if test "x$cross_compiling" = xyes; then
+	AC_MSG_FAILURE([please provide ac_cv_iconv_omits_bom])
+elif test `printf a | iconv -f utf-8 -t utf-16 | wc -c` = 2; then
+	ac_cv_iconv_omits_bom=yes
+else
+	ac_cv_iconv_omits_bom=no
+fi
+])
+if test "x$ac_cv_iconv_omits_bom" = xyes; then
+	ICONV_OMITS_BOM=Yes
+else
+	ICONV_OMITS_BOM=
+fi
+GIT_CONF_SUBST([ICONV_OMITS_BOM])
+fi
 #
 # Define NO_DEFLATE_BOUND if deflateBound is missing from zlib.
 
-- 
2.24.0.rc1.3.g89530838a3.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-10-31  9:26 [PATCH 0/3] Linux with musl libc improvement Doan Tran Cong Danh
  2019-10-31  9:26 ` [PATCH 1/3] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
  2019-10-31  9:26 ` [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
@ 2019-10-31  9:26 ` Doan Tran Cong Danh
  2019-10-31 10:38   ` Johannes Schindelin
  2019-11-01  8:25 ` [PATCH v2 0/3] Linux with musl libc improvement Doan Tran Cong Danh
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-10-31  9:26 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

On musl libc, ISO-2022-JP encoder is too eager to switch back to
1 byte encoding, musl's iconv always switch back after every combining
character. Comparing glibc and musl's output for this command
$ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
	iconv -f utf-8 -t ISO-2022-JP | xxd

glibc:
00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
00000010: 0a                                       .

musl:
00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.

Although musl iconv's output isn't optimal, it's still correct.

From commit 7d509878b8, ("pretty.c: format string with truncate respects
logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
first, then format it and convert the message to the actual output
encoding on git commit --squash.

Thus, t3900 is failing on Linux with musl libc.

Reencode to utf-8 before arranging rebase's todo list.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---

Notes:
    The todo list shown to user has already been reencoded by sequencer_make_script,
    without this patch it looks like this:
    
    $ head -3 .git/rebase-merge/git-rebase-todo | xxd
    00000000: 7069 636b 2065 6633 3961 3033 201b 2442  pick ef39a03 .$B
    00000010: 244f 1b28 421b 2442 246c 1b28 421b 2442  $O.(B.$B$l.(B.$B
    00000020: 2452 1b28 421b 2442 245b 1b28 421b 2442  $R.(B.$B$[.(B.$B
    00000030: 2455 1b28 420a 7069 636b 2062 3832 3931  $U.(B.pick b8291
    00000040: 3336 2073 7175 6173 6821 201b 2442 244f  36 squash! .$B$O
    00000050: 1b28 421b 2442 246c 1b28 421b 2442 2452  .(B.$B$l.(B.$B$R
    00000060: 1b28 421b 2442 245b 1b28 421b 2442 2455  .(B.$B$[.(B.$B$U
    00000070: 1b28 420a 7069 636b 2062 3532 3132 6437  .(B.pick b5212d7
    00000080: 2069 6e74 6572 6d65 6469 6174 6520 636f   intermediate co
    00000090: 6d6d 6974 0a                             mmit.

 sequencer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sequencer.c b/sequencer.c
index 9d5964fd81..69430fe23f 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -5169,7 +5169,7 @@ int todo_list_rearrange_squash(struct todo_list *todo_list)
 		*commit_todo_item_at(&commit_todo, item->commit) = item;
 
 		parse_commit(item->commit);
-		commit_buffer = get_commit_buffer(item->commit, NULL);
+		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
 		find_commit_subject(commit_buffer, &subject);
 		format_subject(&buf, subject, " ");
 		subject = subjects[i] = strbuf_detach(&buf, &subject_len);
-- 
2.24.0.rc1.3.g89530838a3.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-10-31  9:26 ` [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
@ 2019-10-31 10:38   ` Johannes Schindelin
  2019-10-31 19:26     ` Jeff King
  0 siblings, 1 reply; 89+ messages in thread
From: Johannes Schindelin @ 2019-10-31 10:38 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

Hi,


On Thu, 31 Oct 2019, Doan Tran Cong Danh wrote:

> On musl libc, ISO-2022-JP encoder is too eager to switch back to
> 1 byte encoding, musl's iconv always switch back after every combining
> character. Comparing glibc and musl's output for this command
> $ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
> 	iconv -f utf-8 -t ISO-2022-JP | xxd
>
> glibc:
> 00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
> 00000010: 0a                                       .
>
> musl:
> 00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
> 00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
> 00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.
>
> Although musl iconv's output isn't optimal, it's still correct.
>
> From commit 7d509878b8, ("pretty.c: format string with truncate respects
> logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
> first, then format it and convert the message to the actual output
> encoding on git commit --squash.
>
> Thus, t3900 is failing on Linux with musl libc.
>
> Reencode to utf-8 before arranging rebase's todo list.

Since the re-encoded commit messages are only used for figuring out the
relationships between the `fixup!`/`squash!` commits and their targets,
but are not used in any of the lines that are written out, this change
looks good to me.

In fact, all three patches look good to me.

Thanks for unbreaking Git with musl!
Johannes

>
> Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
> ---
>
> Notes:
>     The todo list shown to user has already been reencoded by sequencer_make_script,
>     without this patch it looks like this:
>
>     $ head -3 .git/rebase-merge/git-rebase-todo | xxd
>     00000000: 7069 636b 2065 6633 3961 3033 201b 2442  pick ef39a03 .$B
>     00000010: 244f 1b28 421b 2442 246c 1b28 421b 2442  $O.(B.$B$l.(B.$B
>     00000020: 2452 1b28 421b 2442 245b 1b28 421b 2442  $R.(B.$B$[.(B.$B
>     00000030: 2455 1b28 420a 7069 636b 2062 3832 3931  $U.(B.pick b8291
>     00000040: 3336 2073 7175 6173 6821 201b 2442 244f  36 squash! .$B$O
>     00000050: 1b28 421b 2442 246c 1b28 421b 2442 2452  .(B.$B$l.(B.$B$R
>     00000060: 1b28 421b 2442 245b 1b28 421b 2442 2455  .(B.$B$[.(B.$B$U
>     00000070: 1b28 420a 7069 636b 2062 3532 3132 6437  .(B.pick b5212d7
>     00000080: 2069 6e74 6572 6d65 6469 6174 6520 636f   intermediate co
>     00000090: 6d6d 6974 0a                             mmit.
>
>  sequencer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/sequencer.c b/sequencer.c
> index 9d5964fd81..69430fe23f 100644
> --- a/sequencer.c
> +++ b/sequencer.c
> @@ -5169,7 +5169,7 @@ int todo_list_rearrange_squash(struct todo_list *todo_list)
>  		*commit_todo_item_at(&commit_todo, item->commit) = item;
>
>  		parse_commit(item->commit);
> -		commit_buffer = get_commit_buffer(item->commit, NULL);
> +		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
>  		find_commit_subject(commit_buffer, &subject);
>  		format_subject(&buf, subject, " ");
>  		subject = subjects[i] = strbuf_detach(&buf, &subject_len);
> --
> 2.24.0.rc1.3.g89530838a3.dirty
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 1/3] t0028: eliminate non-standard usage of printf
  2019-10-31  9:26 ` [PATCH 1/3] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
@ 2019-10-31 17:41   ` Jeff King
  2019-11-01  1:33     ` Danh Doan
  2019-10-31 19:50   ` brian m. carlson
  1 sibling, 1 reply; 89+ messages in thread
From: Jeff King @ 2019-10-31 17:41 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

On Thu, Oct 31, 2019 at 04:26:16PM +0700, Doan Tran Cong Danh wrote:

> man 1p printf:
>    In addition to the escape sequences shown in the Base Definitions
>    volume of POSIX.1‐2008, Chapter 5, File Format Notation ('\\',
>    '\a', '\b', '\f', '\n', '\r', '\t', '\v'), "\ddd", where ddd is a
>    one, two, or three-digit octal number, shall be written as a byte
>    with the numeric value specified by the octal number.
> 
> printf '\xfe\xff' in an extension of some libc.

I don't think it's libc here at all, but rather the shell.

And as you note, dash does not handle this:

> With dash:
> $ printf '\xfe\xff' | xxd
> 00000000: 5c78 6665 5c78 6666                      \xfe\xff

Which makes me wonder how these tests worked at all for people on say,
Debian systems.

I think the answer is that they don't trigger this prereq:

>  write_utf16 () {
>  	if test_have_prereq NO_UTF16_BOM
>  	then
> -		printf '\xfe\xff'
> +		printf '\376\377'
>  	fi &&
>  	iconv -f UTF-8 -t UTF-16
>  }

which comes from:

  test_lazy_prereq NO_UTF16_BOM '
          test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
  '

Presumably on your system iconv is using musl's internal iconv, which
behaves differently than the glibc one.

So I think your patch is the right thing to do (hex escapes in shell
printf are definitely not portable), but we might want to note something
like:

  This wasn't caught by most people running the tests, even though
  common shells like dash don't handle hex escapes, because their
  systems don't trigger the NO_UTF16_BOM prereq. But systems with musl
  libc do; when combined with dash, the test fails.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary
  2019-10-31  9:26 ` [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
@ 2019-10-31 18:11   ` Jeff King
  2019-10-31 20:02     ` brian m. carlson
  2019-11-01  1:40     ` Danh Doan
  0 siblings, 2 replies; 89+ messages in thread
From: Jeff King @ 2019-10-31 18:11 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

On Thu, Oct 31, 2019 at 04:26:17PM +0700, Doan Tran Cong Danh wrote:

> From commit 79444c9294, ("utf8: handle systems that don't write BOM for
> UTF-16", 2019-02-12), we're supporting those systems with iconv that
> omits BOM with:
> 
>     make ICONV_OMITS_BOM=Yes
> 
> However, typing the flag all the time is cumbersome and error-prone.
> 
> Add a checking into configure script to detect this flag automatically.

I think it's worth adding this to the configure script, but note that
you can also do:

  echo ICONV_OMITS_BOM=Yes >config.mak

and then "make" by itself will do what you want (it's still annoying and
error-prone to realize you have to specify the flag in the first place).

> diff --git a/configure.ac b/configure.ac
> index a43b476402..790b53bbdc 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -690,6 +690,28 @@ fi
>  
>  fi
>  
> +#
> +# Define ICONV_OMITS_BOM if you are on a system which
> +# iconv omits bom for utf-{16,32}
> +if test -z "$NO_ICONV"; then
> +AC_CACHE_CHECK([whether iconv omits bom for utf-16 and utf-32],
> + [ac_cv_iconv_omits_bom],
> +[
> +if test "x$cross_compiling" = xyes; then
> +	AC_MSG_FAILURE([please provide ac_cv_iconv_omits_bom])
> +elif test `printf a | iconv -f utf-8 -t utf-16 | wc -c` = 2; then

The ICONV_OMITS_BOM flag is about the libc iconv that Git will be linked
against. But this is checking the iconv tool. For a system that is using
musl across the board, that would work. But it might not always be the
case (in particular, I don't know if people statically link some
binaries against musl; certainly I've seen people do it with dietlibc).

I think we should be test-compiling a small program, similar to the way
OLD_ICONV works (though I guess we may even need to run the result to
see what happens).

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-10-31 10:38   ` Johannes Schindelin
@ 2019-10-31 19:26     ` Jeff King
  2019-11-01  4:49       ` Danh Doan
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2019-10-31 19:26 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Doan Tran Cong Danh, git

On Thu, Oct 31, 2019 at 11:38:14AM +0100, Johannes Schindelin wrote:

> On Thu, 31 Oct 2019, Doan Tran Cong Danh wrote:
> 
> > On musl libc, ISO-2022-JP encoder is too eager to switch back to
> > 1 byte encoding, musl's iconv always switch back after every combining
> > character. Comparing glibc and musl's output for this command
> > $ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
> > 	iconv -f utf-8 -t ISO-2022-JP | xxd
> >
> > glibc:
> > 00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
> > 00000010: 0a                                       .
> >
> > musl:
> > 00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
> > 00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
> > 00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.
> >
> > Although musl iconv's output isn't optimal, it's still correct.
> >
> > From commit 7d509878b8, ("pretty.c: format string with truncate respects
> > logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
> > first, then format it and convert the message to the actual output
> > encoding on git commit --squash.
> >
> > Thus, t3900 is failing on Linux with musl libc.
> >
> > Reencode to utf-8 before arranging rebase's todo list.
> 
> Since the re-encoded commit messages are only used for figuring out the
> relationships between the `fixup!`/`squash!` commits and their targets,
> but are not used in any of the lines that are written out, this change
> looks good to me.

I'm confused about a few things here, though. I agree with you that the
subjects here are only used for finding the fixup/squash relationships.
But I don't understand the musl connection.

Wouldn't failure to reencode here always be a problem? E.g., if I do:

  for encoding in utf-8 iso-8859-1; do
    # commit using the encoding
    echo $encoding >file && git add file
    echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
      git -c i18n.commitEncoding=$encoding commit -F -
    # and then fixup without it
    echo "$encoding fixed" >file && git add file
    git commit --fixup HEAD
  done
  
  GIT_EDITOR='echo; grep -v ^#' git rebase -i --root --autosquash

then the resulting todo-list output (on my glibc system) is:

  pick 3a5bace éñcödèd with utf-8
  fixup aa9f09c fixup! éñcödèd with utf-8
  pick 6e85d32 éñcödèd with iso-8859-1
  pick 3ceac05 fixup! éñcödèd with iso-8859-1

I.e., we don't actually match up the second pair, and I think we
probably ought to.

I guess the test in t3900 is less exotic; it uses the same encoding for
both commits. And it's just that "foo" and "!fixup foo" can (and do in
musl) end up with different encodings (because of the specific language,
and the vagaries of each iconv implementation).

Would we have similar problems in all of the other functions which use
get_commit_buffer() without reencoding? For instance if I do this:

  echo base >file && git add file && git commit -m base
  for encoding in utf-8 iso-8859-1; do
    echo $encoding >file && git add file
    echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
      git -c i18n.commitEncoding=$encoding commit -F -
  done
  git checkout -b side HEAD~2
  git cherry-pick master master^
  cat .git/sequencer/todo

then the resulting todo file has a mix of iso-8859-1 and utf-8.

It seems to me that we should always be working with the subjects in a
single encoding internally, and likewise outputting in that format
(which should probably be git_log_output_encoding(), for the instances
where we show it to the user).

I.e., we should always call logmsg_reencode() instead of
get_commit_buffer().

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 1/3] t0028: eliminate non-standard usage of printf
  2019-10-31  9:26 ` [PATCH 1/3] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
  2019-10-31 17:41   ` Jeff King
@ 2019-10-31 19:50   ` brian m. carlson
  1 sibling, 0 replies; 89+ messages in thread
From: brian m. carlson @ 2019-10-31 19:50 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1105 bytes --]

On 2019-10-31 at 09:26:16, Doan Tran Cong Danh wrote:
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> index 7aa0945d8d..bfc4fb9af5 100755
> --- a/t/t0028-working-tree-encoding.sh
> +++ b/t/t0028-working-tree-encoding.sh
> @@ -17,7 +17,7 @@ test_lazy_prereq NO_UTF32_BOM '
>  write_utf16 () {
>  	if test_have_prereq NO_UTF16_BOM
>  	then
> -		printf '\xfe\xff'
> +		printf '\376\377'
>  	fi &&
>  	iconv -f UTF-8 -t UTF-16
>  }
> @@ -25,7 +25,7 @@ write_utf16 () {
>  write_utf32 () {
>  	if test_have_prereq NO_UTF32_BOM
>  	then
> -		printf '\x00\x00\xfe\xff'
> +		printf '\0\0\376\377'
>  	fi &&
>  	iconv -f UTF-8 -t UTF-32
>  }

Yeah, this patch looks obviously correct.  For some reason, I knew that
printf(1) didn't accept hex sequences and yet I still wrote them.

As Peff pointed out, this is because we only trigger this case on musl;
musl is the only known implementation of iconv that produces big-endian
data without a BOM for UTF-16 and UTF-32.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary
  2019-10-31 18:11   ` Jeff King
@ 2019-10-31 20:02     ` brian m. carlson
  2019-11-01  1:40     ` Danh Doan
  1 sibling, 0 replies; 89+ messages in thread
From: brian m. carlson @ 2019-10-31 20:02 UTC (permalink / raw)
  To: Jeff King; +Cc: Doan Tran Cong Danh, git

[-- Attachment #1: Type: text/plain, Size: 1616 bytes --]

On 2019-10-31 at 18:11:16, Jeff King wrote:
> On Thu, Oct 31, 2019 at 04:26:17PM +0700, Doan Tran Cong Danh wrote:
> > diff --git a/configure.ac b/configure.ac
> > index a43b476402..790b53bbdc 100644
> > --- a/configure.ac
> > +++ b/configure.ac
> > @@ -690,6 +690,28 @@ fi
> >  
> >  fi
> >  
> > +#
> > +# Define ICONV_OMITS_BOM if you are on a system which
> > +# iconv omits bom for utf-{16,32}
> > +if test -z "$NO_ICONV"; then
> > +AC_CACHE_CHECK([whether iconv omits bom for utf-16 and utf-32],
> > + [ac_cv_iconv_omits_bom],
> > +[
> > +if test "x$cross_compiling" = xyes; then
> > +	AC_MSG_FAILURE([please provide ac_cv_iconv_omits_bom])
> > +elif test `printf a | iconv -f utf-8 -t utf-16 | wc -c` = 2; then
> 
> The ICONV_OMITS_BOM flag is about the libc iconv that Git will be linked
> against. But this is checking the iconv tool. For a system that is using
> musl across the board, that would work. But it might not always be the
> case (in particular, I don't know if people statically link some
> binaries against musl; certainly I've seen people do it with dietlibc).

Yeah, I think we need to do that, and not only for musl.  There are
folks for whom the iconv in libc is missing or inadequate and they use
an additional iconv(3) implementation which may not be the same as
iconv(1) (or vice versa).

Granted, as far as we know this option is only needed for musl, but that
doesn't mean there aren't other environments where this is a problem,
only that we don't yet know about them.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 1/3] t0028: eliminate non-standard usage of printf
  2019-10-31 17:41   ` Jeff King
@ 2019-11-01  1:33     ` Danh Doan
  0 siblings, 0 replies; 89+ messages in thread
From: Danh Doan @ 2019-11-01  1:33 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 2019-10-31 13:41:18 -0400, Jeff King wrote:
> So I think your patch is the right thing to do (hex escapes in shell
> printf are definitely not portable), but we might want to note something
> like:
> 
>   This wasn't caught by most people running the tests, even though
>   common shells like dash don't handle hex escapes, because their
>   systems don't trigger the NO_UTF16_BOM prereq. But systems with musl
>   libc do; when combined with dash, the test fails.

Make sense. I will take this note in the reroll.

-- 
Danh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary
  2019-10-31 18:11   ` Jeff King
  2019-10-31 20:02     ` brian m. carlson
@ 2019-11-01  1:40     ` Danh Doan
  1 sibling, 0 replies; 89+ messages in thread
From: Danh Doan @ 2019-11-01  1:40 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 2019-10-31 14:11:16 -0400, Jeff King wrote:
> The ICONV_OMITS_BOM flag is about the libc iconv that Git will be linked
> against. But this is checking the iconv tool. For a system that is using
> musl across the board, that would work. But it might not always be the
> case (in particular, I don't know if people statically link some
> binaries against musl; certainly I've seen people do it with dietlibc).
> 
> I think we should be test-compiling a small program, similar to the way
> OLD_ICONV works (though I guess we may even need to run the result to
> see what happens).

Originally, I wrote a C program for AC_RUN_IFELSE, but I found out the
complicated of OLD_ICONV and NEEDS_LIBICONV, then I switch to this
approach.

But, your reasoning makes sense. I think people could build a static-linked
git binary for a clean-room build system.

I'll pursue the C program in the re-roll.

-- 
Danh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-10-31 19:26     ` Jeff King
@ 2019-11-01  4:49       ` Danh Doan
  0 siblings, 0 replies; 89+ messages in thread
From: Danh Doan @ 2019-11-01  4:49 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, git

On 2019-10-31 15:26:50 -0400, Jeff King wrote:
> I'm confused about a few things here, though. I agree with you that the
> subjects here are only used for finding the fixup/squash relationships.
> But I don't understand the musl connection.

You're right.

Because of musl's iconv implementation, the problem is being shown up
earlier.

> Wouldn't failure to reencode here always be a problem? E.g., if I do:
> 
>   for encoding in utf-8 iso-8859-1; do
>     # commit using the encoding
>     echo $encoding >file && git add file
>     echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
>       git -c i18n.commitEncoding=$encoding commit -F -
>     # and then fixup without it
>     echo "$encoding fixed" >file && git add file
>     git commit --fixup HEAD
>   done
>   
>   GIT_EDITOR='echo; grep -v ^#' git rebase -i --root --autosquash
> 
> then the resulting todo-list output (on my glibc system) is:
> 
>   pick 3a5bace éñcödèd with utf-8
>   fixup aa9f09c fixup! éñcödèd with utf-8
>   pick 6e85d32 éñcödèd with iso-8859-1
>   pick 3ceac05 fixup! éñcödèd with iso-8859-1
> 
> I.e., we don't actually match up the second pair, and I think we
> probably ought to.

Yes, we ought to match up the second pair, and after changing
get_commit_buffer to logmsg_reencode, we do.

> 
> I guess the test in t3900 is less exotic; it uses the same encoding for
> both commits. And it's just that "foo" and "!fixup foo" can (and do in
> musl) end up with different encodings (because of the specific language,
> and the vagaries of each iconv implementation).
> 
> Would we have similar problems in all of the other functions which use
> get_commit_buffer() without reencoding? For instance if I do this:
> 
>   echo base >file && git add file && git commit -m base
>   for encoding in utf-8 iso-8859-1; do
>     echo $encoding >file && git add file
>     echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
>       git -c i18n.commitEncoding=$encoding commit -F -
>   done
>   git checkout -b side HEAD~2
>   git cherry-pick master master^
>   cat .git/sequencer/todo
> 
> then the resulting todo file has a mix of iso-8859-1 and utf-8.
> 
> It seems to me that we should always be working with the subjects in a
> single encoding internally,

I'm in favour of this idea.

> and likewise outputting in that format
> (which should probably be git_log_output_encoding(), for the instances
> where we show it to the user).

This is git's current behaviour but it's get_log_output_encoding()
instead of git_log_output_encoding().

> I.e., we should always call logmsg_reencode() instead of
> get_commit_buffer().

-- 
Danh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v2 0/3] Linux with musl libc improvement
  2019-10-31  9:26 [PATCH 0/3] Linux with musl libc improvement Doan Tran Cong Danh
                   ` (2 preceding siblings ...)
  2019-10-31  9:26 ` [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
@ 2019-11-01  8:25 ` Doan Tran Cong Danh
  2019-11-01  8:25   ` [PATCH v2 1/3] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
                     ` (2 more replies)
  2019-11-06  9:19 ` [PATCH v3 0/8] Correct internal working and output encoding Doan Tran Cong Danh
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-01  8:25 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

This series of patch tries to ease the life of musl libc user
and clear those last failure on Linux with musl libc.

The second patch is a complete rewrite.

Doan Tran Cong Danh (3):
  t0028: eliminate non-standard usage of printf
  configure.ac: define ICONV_OMITS_BOM if necessary
  sequencer: reencode to utf-8 before arrange rebase's todo list

 configure.ac                     | 49 ++++++++++++++++++++++++++++++++
 sequencer.c                      |  2 +-
 t/t0028-working-tree-encoding.sh |  4 +--
 3 files changed, 52 insertions(+), 3 deletions(-)

Range-diff against v1:
1:  840c40cea1 ! 1:  8b30028425 t0028: eliminate non-standard usage of printf
    @@ Commit message
            one, two, or three-digit octal number, shall be written as a byte
            with the numeric value specified by the octal number.
     
    -    printf '\xfe\xff' in an extension of some libc.
    +    printf '\xfe\xff' in an extension of some shell.
    +    Dash, a popular yet simple shell, do not implement this extension.
     
    -    With dash:
    -    $ printf '\xfe\xff' | xxd
    -    00000000: 5c78 6665 5c78 6666                      \xfe\xff
    +    This wasn't caught by most people running the tests, even though
    +    common shells like dash don't handle hex escapes, because their
    +    systems don't trigger the NO_UTF16_BOM prereq. But systems with musl
    +    libc do; when combined with dash, the test fails.
     
    -    Correct its usage.
    +    Correct it.
     
         Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
    -    Despite that dash's printf doesn't accept \x escape sequence.
    -
    -    My glibc box (with sh linked to dash) can run the test just fine.
    -    But my musl box couldn't run the test, (because the header).
     
      ## t/t0028-working-tree-encoding.sh ##
     @@ t/t0028-working-tree-encoding.sh: test_lazy_prereq NO_UTF32_BOM '
2:  4a28ee7ef6 < -:  ---------- configure.ac: define ICONV_OMITS_BOM if necessary
-:  ---------- > 2:  7c2c6f0603 configure.ac: define ICONV_OMITS_BOM if necessary
3:  c8da3990e5 ! 3:  b7927b2723 sequencer: reencode to utf-8 before arrange rebase's todo list
    @@ Commit message
     
         Thus, t3900 is failing on Linux with musl libc.
     
    +    This problem wasn't specific to musl libc. On Linux with glibc, this
    +    problem can be observed by:
    +
    +    for encoding in utf-8 iso-8859-1; do
    +            # commit using the encoding
    +            echo $encoding >file && git add file
    +            echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
    +              git -c i18n.commitEncoding=$encoding commit -F -
    +            # and then fixup without it
    +            echo "$encoding fixed" >file && git add file
    +            git commit --fixup HEAD
    +    done
    +    git rebase -i --autosquash --root
    +
         Reencode to utf-8 before arranging rebase's todo list.
     
         Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
    -    The todo list shown to user has already been reencoded by sequencer_make_script,
    -    without this patch it looks like this:
    -
    -    $ head -3 .git/rebase-merge/git-rebase-todo | xxd
    -    00000000: 7069 636b 2065 6633 3961 3033 201b 2442  pick ef39a03 .$B
    -    00000010: 244f 1b28 421b 2442 246c 1b28 421b 2442  $O.(B.$B$l.(B.$B
    -    00000020: 2452 1b28 421b 2442 245b 1b28 421b 2442  $R.(B.$B$[.(B.$B
    -    00000030: 2455 1b28 420a 7069 636b 2062 3832 3931  $U.(B.pick b8291
    -    00000040: 3336 2073 7175 6173 6821 201b 2442 244f  36 squash! .$B$O
    -    00000050: 1b28 421b 2442 246c 1b28 421b 2442 2452  .(B.$B$l.(B.$B$R
    -    00000060: 1b28 421b 2442 245b 1b28 421b 2442 2455  .(B.$B$[.(B.$B$U
    -    00000070: 1b28 420a 7069 636b 2062 3532 3132 6437  .(B.pick b5212d7
    -    00000080: 2069 6e74 6572 6d65 6469 6174 6520 636f   intermediate co
    -    00000090: 6d6d 6974 0a                             mmit.
     
      ## sequencer.c ##
     @@ sequencer.c: int todo_list_rearrange_squash(struct todo_list *todo_list)
-- 
2.24.0.rc2.296.geaf699fcc3


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v2 1/3] t0028: eliminate non-standard usage of printf
  2019-11-01  8:25 ` [PATCH v2 0/3] Linux with musl libc improvement Doan Tran Cong Danh
@ 2019-11-01  8:25   ` Doan Tran Cong Danh
  2019-11-01 16:54     ` Jeff King
  2019-11-01  8:25   ` [PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
  2019-11-01  8:25   ` [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
  2 siblings, 1 reply; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-01  8:25 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

man 1p printf:
   In addition to the escape sequences shown in the Base Definitions
   volume of POSIX.1‐2008, Chapter 5, File Format Notation ('\\',
   '\a', '\b', '\f', '\n', '\r', '\t', '\v'), "\ddd", where ddd is a
   one, two, or three-digit octal number, shall be written as a byte
   with the numeric value specified by the octal number.

printf '\xfe\xff' in an extension of some shell.
Dash, a popular yet simple shell, do not implement this extension.

This wasn't caught by most people running the tests, even though
common shells like dash don't handle hex escapes, because their
systems don't trigger the NO_UTF16_BOM prereq. But systems with musl
libc do; when combined with dash, the test fails.

Correct it.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 t/t0028-working-tree-encoding.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7aa0945d8d..bfc4fb9af5 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -17,7 +17,7 @@ test_lazy_prereq NO_UTF32_BOM '
 write_utf16 () {
 	if test_have_prereq NO_UTF16_BOM
 	then
-		printf '\xfe\xff'
+		printf '\376\377'
 	fi &&
 	iconv -f UTF-8 -t UTF-16
 }
@@ -25,7 +25,7 @@ write_utf16 () {
 write_utf32 () {
 	if test_have_prereq NO_UTF32_BOM
 	then
-		printf '\x00\x00\xfe\xff'
+		printf '\0\0\376\377'
 	fi &&
 	iconv -f UTF-8 -t UTF-32
 }
-- 
2.24.0.rc2.296.geaf699fcc3


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary
  2019-11-01  8:25 ` [PATCH v2 0/3] Linux with musl libc improvement Doan Tran Cong Danh
  2019-11-01  8:25   ` [PATCH v2 1/3] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
@ 2019-11-01  8:25   ` Doan Tran Cong Danh
  2019-11-01 16:56     ` Jeff King
  2019-11-01  8:25   ` [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
  2 siblings, 1 reply; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-01  8:25 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

From commit 79444c9294, ("utf8: handle systems that don't write BOM for
UTF-16", 2019-02-12), we're supporting those systems with iconv that
omits BOM with:

    make ICONV_OMITS_BOM=Yes

However, typing the flag all the time is cumbersome and error-prone.

Add a checking into configure script to detect this flag automatically.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 configure.ac | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/configure.ac b/configure.ac
index a43b476402..ecba7e6e51 100644
--- a/configure.ac
+++ b/configure.ac
@@ -844,12 +844,61 @@ AC_MSG_CHECKING([for old iconv()])
 AC_COMPILE_IFELSE([OLDICONVTEST_SRC],
 	[AC_MSG_RESULT([no])],
 	[AC_MSG_RESULT([yes])
+	AC_DEFINE(HAVE_OLD_ICONV, 1)
 	OLD_ICONV=UnfortunatelyYes])
 
 GIT_UNSTASH_FLAGS($ICONVDIR)
 
 GIT_CONF_SUBST([OLD_ICONV])
 
+#
+# Define ICONV_OMITS_BOM if you are on a system which
+# iconv omits bom for utf-{16,32}
+if test -z "$NO_ICONV"; then
+AC_CACHE_CHECK([whether iconv omits bom for utf-16 and utf-32],
+ [ac_cv_iconv_omits_bom],
+[
+old_LIBS="$LIBS"
+if test -n "$NEEDS_LIBICONV"; then
+	LIBS="$LIBS -liconv"
+fi
+
+AC_RUN_IFELSE(
+	[AC_LANG_PROGRAM([AC_INCLUDES_DEFAULT
+	#include <iconv.h>
+	#ifdef HAVE_OLD_ICONV
+	typedef const char *iconv_ibp;
+	#else
+	typedef char *iconv_ibp;
+	#endif
+	],
+	[[
+	int v;
+	iconv_t conv;
+	char in[] = "a"; iconv_ibp pin = in;
+	char out[20] = ""; char *pout = out;
+	size_t isz = sizeof in;
+	size_t osz = sizeof out;
+
+	conv = iconv_open("UTF-16", "UTF-8");
+	iconv(conv, &pin, &isz, &pout, &osz);
+	iconv_close(conv);
+	v = (unsigned char)(out[0]) + (unsigned char)(out[1]);
+	return v != 0xfe + 0xff;
+	]])],
+	[ac_cv_iconv_omits_bom=no],
+	[ac_cv_iconv_omits_bom=yes])
+
+LIBS="$old_LIBS"
+])
+if test "x$ac_cv_iconv_omits_bom" = xyes; then
+	ICONV_OMITS_BOM=Yes
+else
+	ICONV_OMITS_BOM=
+fi
+GIT_CONF_SUBST([ICONV_OMITS_BOM])
+fi
+
 ## Checks for typedefs, structures, and compiler characteristics.
 AC_MSG_NOTICE([CHECKS for typedefs, structures, and compiler characteristics])
 #
-- 
2.24.0.rc2.296.geaf699fcc3


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-01  8:25 ` [PATCH v2 0/3] Linux with musl libc improvement Doan Tran Cong Danh
  2019-11-01  8:25   ` [PATCH v2 1/3] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
  2019-11-01  8:25   ` [PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
@ 2019-11-01  8:25   ` Doan Tran Cong Danh
  2019-11-01 16:59     ` Jeff King
  2 siblings, 1 reply; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-01  8:25 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

On musl libc, ISO-2022-JP encoder is too eager to switch back to
1 byte encoding, musl's iconv always switch back after every combining
character. Comparing glibc and musl's output for this command
$ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
	iconv -f utf-8 -t ISO-2022-JP | xxd

glibc:
00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
00000010: 0a                                       .

musl:
00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.

Although musl iconv's output isn't optimal, it's still correct.

From commit 7d509878b8, ("pretty.c: format string with truncate respects
logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
first, then format it and convert the message to the actual output
encoding on git commit --squash.

Thus, t3900 is failing on Linux with musl libc.

This problem wasn't specific to musl libc. On Linux with glibc, this
problem can be observed by:

for encoding in utf-8 iso-8859-1; do
	# commit using the encoding
	echo $encoding >file && git add file
	echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
	  git -c i18n.commitEncoding=$encoding commit -F -
	# and then fixup without it
	echo "$encoding fixed" >file && git add file
	git commit --fixup HEAD
done
git rebase -i --autosquash --root

Reencode to utf-8 before arranging rebase's todo list.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
The code that demonstrate the problem on Linux with glibc is written by Jeff.
But I don't know how to attribute him properly.

 sequencer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sequencer.c b/sequencer.c
index 9d5964fd81..69430fe23f 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -5169,7 +5169,7 @@ int todo_list_rearrange_squash(struct todo_list *todo_list)
 		*commit_todo_item_at(&commit_todo, item->commit) = item;
 
 		parse_commit(item->commit);
-		commit_buffer = get_commit_buffer(item->commit, NULL);
+		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
 		find_commit_subject(commit_buffer, &subject);
 		format_subject(&buf, subject, " ");
 		subject = subjects[i] = strbuf_detach(&buf, &subject_len);
-- 
2.24.0.rc2.296.geaf699fcc3


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v2 1/3] t0028: eliminate non-standard usage of printf
  2019-11-01  8:25   ` [PATCH v2 1/3] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
@ 2019-11-01 16:54     ` Jeff King
  0 siblings, 0 replies; 89+ messages in thread
From: Jeff King @ 2019-11-01 16:54 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

On Fri, Nov 01, 2019 at 03:25:09PM +0700, Doan Tran Cong Danh wrote:

> man 1p printf:
>    In addition to the escape sequences shown in the Base Definitions
>    volume of POSIX.1‐2008, Chapter 5, File Format Notation ('\\',
>    '\a', '\b', '\f', '\n', '\r', '\t', '\v'), "\ddd", where ddd is a
>    one, two, or three-digit octal number, shall be written as a byte
>    with the numeric value specified by the octal number.
> 
> printf '\xfe\xff' in an extension of some shell.

Typo: s/in/is/

Other than that, this looks great to me. Thanks.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary
  2019-11-01  8:25   ` [PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
@ 2019-11-01 16:56     ` Jeff King
  2019-11-02  0:43       ` Danh Doan
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2019-11-01 16:56 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

On Fri, Nov 01, 2019 at 03:25:10PM +0700, Doan Tran Cong Danh wrote:

> From commit 79444c9294, ("utf8: handle systems that don't write BOM for
> UTF-16", 2019-02-12), we're supporting those systems with iconv that
> omits BOM with:
> 
>     make ICONV_OMITS_BOM=Yes
> 
> However, typing the flag all the time is cumbersome and error-prone.
> 
> Add a checking into configure script to detect this flag automatically.

s/checking/check/

The patch looks good to me, though my knowledge of both iconv and
autoconf is relatively lacking. I'll assume it works on your system,
though. :)

One thing I noticed:

> +	v = (unsigned char)(out[0]) + (unsigned char)(out[1]);
> +	return v != 0xfe + 0xff;

I wondered at first why not just:

  return v[0] != 0xfe || v[1] != 0xff;

but the answer is that you are catching BOMs of either endianness. We
might at some point need to distinguish the two, but I think for the
purposes of OMITS_BOM, we don't have to care. So we can worry about that
later (if ever).

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-01  8:25   ` [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
@ 2019-11-01 16:59     ` Jeff King
  2019-11-02  1:02       ` Danh Doan
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2019-11-01 16:59 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

On Fri, Nov 01, 2019 at 03:25:11PM +0700, Doan Tran Cong Danh wrote:

> This problem wasn't specific to musl libc. On Linux with glibc, this
> problem can be observed by:
> 
> for encoding in utf-8 iso-8859-1; do
> 	# commit using the encoding
> 	echo $encoding >file && git add file
> 	echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
> 	  git -c i18n.commitEncoding=$encoding commit -F -
> 	# and then fixup without it
> 	echo "$encoding fixed" >file && git add file
> 	git commit --fixup HEAD
> done
> git rebase -i --autosquash --root

Is it worth adding this as a test in t3900?

> ---
> The code that demonstrate the problem on Linux with glibc is written by Jeff.
> But I don't know how to attribute him properly.

I'm OK with or without attribution, but people sometimes add a
"Helped-by" trailer if they want to acknowledge someone.

> diff --git a/sequencer.c b/sequencer.c
> index 9d5964fd81..69430fe23f 100644
> --- a/sequencer.c
> +++ b/sequencer.c
> @@ -5169,7 +5169,7 @@ int todo_list_rearrange_squash(struct todo_list *todo_list)
>  		*commit_todo_item_at(&commit_todo, item->commit) = item;
>  
>  		parse_commit(item->commit);
> -		commit_buffer = get_commit_buffer(item->commit, NULL);
> +		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");

I think there are several other spots in this file that could use the
same treatment. But I can live with it if you want to just fix the one
that's bugging you and move on. It's still a strict improvement.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary
  2019-11-01 16:56     ` Jeff King
@ 2019-11-02  0:43       ` Danh Doan
  0 siblings, 0 replies; 89+ messages in thread
From: Danh Doan @ 2019-11-02  0:43 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 2019-11-01 12:56:57 -0400, Jeff King wrote:
> On Fri, Nov 01, 2019 at 03:25:10PM +0700, Doan Tran Cong Danh wrote:
> > +	v = (unsigned char)(out[0]) + (unsigned char)(out[1]);
> > +	return v != 0xfe + 0xff;
> 
> I wondered at first why not just:
> 
>   return v[0] != 0xfe || v[1] != 0xff;
> 
> but the answer is that you are catching BOMs of either endianness. We
> might at some point need to distinguish the two,

Yes, it's about catching BOM of either endianess. To distinguish
between LE and BE, it'll be better to use AC_C_BIGENDIAN instead (it's
friendlier for cross-compiling).

Anyway, we couldn't compare out[0] and out[1] with 0xf{e,f} directly
because char could be either signed (-2 and -1) or unsigned (0xfe, 0xff).
We must cast it to unsigned char (required to be 8 bits by POSIX),
then let the integer promotion promote it to int (at least 16 bits).

-- 
Danh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-01 16:59     ` Jeff King
@ 2019-11-02  1:02       ` Danh Doan
  2019-11-02 12:20         ` Danh Doan
  2019-11-05  8:00         ` Jeff King
  0 siblings, 2 replies; 89+ messages in thread
From: Danh Doan @ 2019-11-02  1:02 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 2019-11-01 12:59:21 -0400, Jeff King wrote:
> On Fri, Nov 01, 2019 at 03:25:11PM +0700, Doan Tran Cong Danh wrote:
> 
> > for encoding in utf-8 iso-8859-1; do
> > 	# commit using the encoding
> > 	echo $encoding >file && git add file
> > 	echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
> > 	  git -c i18n.commitEncoding=$encoding commit -F -
> > 	# and then fixup without it
> > 	echo "$encoding fixed" >file && git add file
> > 	git commit --fixup HEAD
> > done
> > git rebase -i --autosquash --root
> 
> Is it worth adding this as a test in t3900?

I think yes, but with a little more work.
I'll make it as a separated patch in a re-roll.

> >  		parse_commit(item->commit);
> > -		commit_buffer = get_commit_buffer(item->commit, NULL);
> > +		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
> 
> I think there are several other spots in this file that could use the
> same treatment. But I can live with it if you want to just fix the one
> that's bugging you and move on. It's still a strict improvement.

There're 6 more occurence of get_commit_buffer in sequencer.c, and 13
occurences in other C source files. I'll try to figure out if it's
safe to change.

Anyway, if we're going to working with a single encoding internally,
can we take other extreme approach: reencode the commit message to
utf-8 before writing the commit object? (Is there any codepoint in
other encoding that can't be reencoded to utf-8?)
Since git-log and friends are doing 2 steps conversion for commit
message for now (reencode to utf-8 first, then reencode again to
get_log_output_encoding()). With this new approach, first step is
likely a noop (but must be kept for backward compatible).

-- 
Danh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-02  1:02       ` Danh Doan
@ 2019-11-02 12:20         ` Danh Doan
  2019-11-05  8:00         ` Jeff King
  1 sibling, 0 replies; 89+ messages in thread
From: Danh Doan @ 2019-11-02 12:20 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 2019-11-02 08:02:15 +0700, Danh Doan wrote:
> Anyway, if we're going to working with a single encoding internally,
> can we take other extreme approach: reencode the commit message to
> utf-8 before writing the commit object? (Is there any codepoint in
> other encoding that can't be reencoded to utf-8?)

With this test (added into t3900):

    git checkout -b fixup-ISO-2022-JP-ISO-8859-1 C0 &&
    git config i18n.commitencoding ISO-2022-JP &&
    echo ISO-2022-JP >>F &&
    git commit -a -F "$TEST_DIRECTORY"/t3900/ISO-2022-JP.txt &&
    test_tick &&
    echo intermediate stuff >>G &&
    git add G &&
    git commit -a -m "intermediate commit" &&
    test_tick &&
    git config i18n.commitencoding ISO-8859-1 &&
    echo ISO-8859-1-fixup >>F &&
    git commit -a --fixup HEAD^ &&
    git config --unset-all i18n.commitencoding &&
    git rebase --autosquash -i HEAD^^^ &&
    git rev-list HEAD >actual &&
    test_line_count = 3 actual

reencode the commit message to utf-8 before writing the commit object
is (likely) the most simple option to fix it.

At the very least, the commit message for fixup/squash-ing commit must
be encoded in either utf-8 or the target-commit's encode.

-- 
Danh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-02  1:02       ` Danh Doan
  2019-11-02 12:20         ` Danh Doan
@ 2019-11-05  8:00         ` Jeff King
  2019-11-06  1:30           ` Junio C Hamano
  1 sibling, 1 reply; 89+ messages in thread
From: Jeff King @ 2019-11-05  8:00 UTC (permalink / raw)
  To: Danh Doan; +Cc: git

On Sat, Nov 02, 2019 at 08:02:15AM +0700, Danh Doan wrote:

> Anyway, if we're going to working with a single encoding internally,
> can we take other extreme approach: reencode the commit message to
> utf-8 before writing the commit object? (Is there any codepoint in
> other encoding that can't be reencoded to utf-8?)

That's normally what we do. The only cases we're covering here are when
somebody has explicitly asked that the commit object be stored in
another encoding. Presumably they'd also be using a matching
i18n.logOutputEncoding in that case, in which case logmsg_reencode()
would be a noop. I think the only reasons to do that are:

  1. You're stuck on some legacy encoding for your terminal. But in that
     case, I think you'd still be better off storing utf-8 and
     translating on the fly, since whatever encoding you do store is
     baked into your objects for all time (so accept some slowness now,
     but eventually move to utf-8).

  2. Your preferred language is bigger in utf-8 than in some specific
     encoding, and you'd rather save some bytes. I'm not sure how big a
     deal this is, given that commit messages don't tend to be that big
     in the first place (compared to trees and blobs). And the zlib
     deflation on the result might help remove some of the redundancy,
     too.

So I'd actually expect very few people to be using this feature at all
these days (which is part of why I would not be all that broken up if we
just fix the test and move on, if nobody is reporting real-world
problems).

> Since git-log and friends are doing 2 steps conversion for commit
> message for now (reencode to utf-8 first, then reencode again to
> get_log_output_encoding()). With this new approach, first step is
> likely a noop (but must be kept for backward compatible).

Interesting. Traditionally we did a single step conversion to the output
format, and it looks like most output formats still do that (i.e.,
everything in pretty_print_commit() except FMT_USERFORMAT, which is what
powers "--pretty=format:%s", etc).

The two-part user-format thing goes back to 7e77df39bf (pretty: two
phase conversion for non utf-8 commits, 2013-04-19). It does seem like
it would be cheaper to convert the format string into the output
encoding (it would need to be an ascii superset, but that's already the
case, since we expect to parse "author", etc out of the re-encoded
commit object). But again, I have trouble caring too much about the
performance of this case, as I consider it to be mostly legacy at this
point. But I also don't write in (say) Japanese, so maybe I'm being too
narrow-minded about whether people really want to avoid utf-8.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-05  8:00         ` Jeff King
@ 2019-11-06  1:30           ` Junio C Hamano
  2019-11-06  4:03             ` Jeff King
  0 siblings, 1 reply; 89+ messages in thread
From: Junio C Hamano @ 2019-11-06  1:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Danh Doan, git

Jeff King <peff@peff.net> writes:

> That's normally what we do. The only cases we're covering here are when
> somebody has explicitly asked that the commit object be stored in
> another encoding. Presumably they'd also be using a matching
> i18n.logOutputEncoding in that case, in which case logmsg_reencode()
> would be a noop. I think the only reasons to do that are:
>
>   1. You're stuck on some legacy encoding for your terminal. But in that
>      case, I think you'd still be better off storing utf-8 and
>      translating on the fly, since whatever encoding you do store is
>      baked into your objects for all time (so accept some slowness now,
>      but eventually move to utf-8).
>
>   2. Your preferred language is bigger in utf-8 than in some specific
>      encoding, and you'd rather save some bytes. I'm not sure how big a
>      deal this is, given that commit messages don't tend to be that big
>      in the first place (compared to trees and blobs). And the zlib
>      deflation on the result might help remove some of the redundancy,
>      too.

Perhaps add

    3. You are dealing with a project originated on and migrated
       from a foreign SCM, and older parts of the history is stored
       in a non-utf-8, even though recent history is in utf-8

to the mix?

> The two-part user-format thing goes back to 7e77df39bf (pretty: two
> phase conversion for non utf-8 commits, 2013-04-19). It does seem like
> it would be cheaper to convert the format string into the output
> encoding (it would need to be an ascii superset, but that's already the
> case, since we expect to parse "author", etc out of the re-encoded
> commit object). But again, I have trouble caring too much about the
> performance of this case, as I consider it to be mostly legacy at this
> point. But I also don't write in (say) Japanese, so maybe I'm being too
> narrow-minded about whether people really want to avoid utf-8.

I suspect even the heavy Windows/Mac users in Japan have migrated
out of legacy (the suspicion comes from an anecdote that is offtopic
here).

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-06  1:30           ` Junio C Hamano
@ 2019-11-06  4:03             ` Jeff King
  2019-11-06 10:03               ` Danh Doan
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2019-11-06  4:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Danh Doan, git

On Wed, Nov 06, 2019 at 10:30:34AM +0900, Junio C Hamano wrote:

> > That's normally what we do. The only cases we're covering here are when
> > somebody has explicitly asked that the commit object be stored in
> > another encoding. Presumably they'd also be using a matching
> > i18n.logOutputEncoding in that case, in which case logmsg_reencode()
> > would be a noop. I think the only reasons to do that are:
> >
> >   1. You're stuck on some legacy encoding for your terminal. But in that
> >      case, I think you'd still be better off storing utf-8 and
> >      translating on the fly, since whatever encoding you do store is
> >      baked into your objects for all time (so accept some slowness now,
> >      but eventually move to utf-8).
> >
> >   2. Your preferred language is bigger in utf-8 than in some specific
> >      encoding, and you'd rather save some bytes. I'm not sure how big a
> >      deal this is, given that commit messages don't tend to be that big
> >      in the first place (compared to trees and blobs). And the zlib
> >      deflation on the result might help remove some of the redundancy,
> >      too.
> 
> Perhaps add
> 
>     3. You are dealing with a project originated on and migrated
>        from a foreign SCM, and older parts of the history is stored
>        in a non-utf-8, even though recent history is in utf-8
> 
> to the mix?

I would think you'd want to convert to utf-8 as you do the migration in
that case, since you're writing new hashes anyway. But I think a similar
case would just be an old Git repository, where for some reason you
thought i18n.commitEncoding was a good idea back then (perhaps because
you were in situation (1) then, but now you aren't).

In either case, though, I don't think it's a compelling motivation for
optimization, if only because those old commits will be shown less and
less (and even without modern optimizations like commit-graph, we'd
generally avoid reencoding those old commits unless we're actually going
to _show_ them).

> I suspect even the heavy Windows/Mac users in Japan have migrated
> out of legacy (the suspicion comes from an anecdote that is offtopic
> here).

Thanks for the data point. All of this is very far from my personal
experience, so I mostly go on scraps of hearsay I pick up reading this
or that. :)

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 0/8] Correct internal working and output encoding
  2019-10-31  9:26 [PATCH 0/3] Linux with musl libc improvement Doan Tran Cong Danh
                   ` (3 preceding siblings ...)
  2019-11-01  8:25 ` [PATCH v2 0/3] Linux with musl libc improvement Doan Tran Cong Danh
@ 2019-11-06  9:19 ` Doan Tran Cong Danh
  2019-11-06  9:19   ` [PATCH v3 1/8] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
                     ` (7 more replies)
  2019-11-07  2:56 ` [PATCH v4 0/8] Correct internal working and output encoding Doan Tran Cong Danh
                   ` (2 subsequent siblings)
  7 siblings, 8 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-06  9:19 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

The series is shifting from fixing test that failed on musl based Linux to
correct the internal working encoding and output encoding of git-am
git-cherry-pick git-rebase and git-revert.

Doan Tran Cong Danh (8):
  t0028: eliminate non-standard usage of printf
  configure.ac: define ICONV_OMITS_BOM if necessary
  t3900: demonstrate git-rebase problem with multi encoding
  sequencer: reencode to utf-8 before arrange rebase's todo list
  sequencer: reencode revert/cherry-pick's todo list
  sequencer: reencode squashing commit's message
  sequencer: reencode old merge-commit message
  sequencer: reencode commit message for am/rebase --show-current-patch

 configure.ac                     | 49 ++++++++++++++++++++++++++++++++
 sequencer.c                      | 21 +++++++++-----
 t/t0028-working-tree-encoding.sh |  4 +--
 t/t3900-i18n-commit.sh           | 42 +++++++++++++++++++++++++++
 4 files changed, 107 insertions(+), 9 deletions(-)

Range-diff against v2:
1:  8b30028425 ! 1:  b3d6c4e720 t0028: eliminate non-standard usage of printf
    @@ Commit message
            one, two, or three-digit octal number, shall be written as a byte
            with the numeric value specified by the octal number.
     
    -    printf '\xfe\xff' in an extension of some shell.
    +    printf '\xfe\xff' is an extension of some shell.
         Dash, a popular yet simple shell, do not implement this extension.
     
         This wasn't caught by most people running the tests, even though
2:  7c2c6f0603 ! 2:  f07566c60c configure.ac: define ICONV_OMITS_BOM if necessary
    @@ Commit message
     
         However, typing the flag all the time is cumbersome and error-prone.
     
    -    Add a checking into configure script to detect this flag automatically.
    +    Add a check into configure script to detect this flag automatically.
     
         Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
     
-:  ---------- > 3:  662e5bd545 t3900: demonstrate git-rebase problem with multi encoding
3:  b7927b2723 ! 4:  6a51fdd29c sequencer: reencode to utf-8 before arrange rebase's todo list
    @@ Commit message
         first, then format it and convert the message to the actual output
         encoding on git commit --squash.
     
    -    Thus, t3900 is failing on Linux with musl libc.
    -
    -    This problem wasn't specific to musl libc. On Linux with glibc, this
    -    problem can be observed by:
    -
    -    for encoding in utf-8 iso-8859-1; do
    -            # commit using the encoding
    -            echo $encoding >file && git add file
    -            echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
    -              git -c i18n.commitEncoding=$encoding commit -F -
    -            # and then fixup without it
    -            echo "$encoding fixed" >file && git add file
    -            git commit --fixup HEAD
    -    done
    -    git rebase -i --autosquash --root
    +    Thus, t3900::test_commit_autosquash_flags is failing on musl libc.
     
         Reencode to utf-8 before arranging rebase's todo list.
    +    By doing this, we also remove a breakage introduced in the previous
    +    commit.
     
         Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
     
    @@ sequencer.c: int todo_list_rearrange_squash(struct todo_list *todo_list)
      		find_commit_subject(commit_buffer, &subject);
      		format_subject(&buf, subject, " ");
      		subject = subjects[i] = strbuf_detach(&buf, &subject_len);
    +
    + ## t/t3900-i18n-commit.sh ##
    +@@ t/t3900-i18n-commit.sh: test_commit_autosquash_multi_encoding () {
    + 	old=$2
    + 	new=$3
    + 	msg=$4
    +-	test_expect_failure "commit --$flag into $old from $new" '
    ++	test_expect_success "commit --$flag into $old from $new" '
    + 		git checkout -b '$flag-$old-$new' C0 &&
    + 		git config i18n.commitencoding '$old' &&
    + 		echo '$old' >>F &&
-:  ---------- > 5:  d382e35e4e sequencer: reencode revert/cherry-pick's todo list
-:  ---------- > 6:  340902eb67 sequencer: reencode squashing commit's message
-:  ---------- > 7:  7f0df0f685 sequencer: reencode old merge-commit message
-:  ---------- > 8:  69ec40bb1d sequencer: reencode commit message for am/rebase --show-current-patch
-- 
2.24.0.4.g6a51fdd29c


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 1/8] t0028: eliminate non-standard usage of printf
  2019-11-06  9:19 ` [PATCH v3 0/8] Correct internal working and output encoding Doan Tran Cong Danh
@ 2019-11-06  9:19   ` Doan Tran Cong Danh
  2019-11-06  9:20   ` [PATCH v3 2/8] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-06  9:19 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

man 1p printf:
   In addition to the escape sequences shown in the Base Definitions
   volume of POSIX.1‐2008, Chapter 5, File Format Notation ('\\',
   '\a', '\b', '\f', '\n', '\r', '\t', '\v'), "\ddd", where ddd is a
   one, two, or three-digit octal number, shall be written as a byte
   with the numeric value specified by the octal number.

printf '\xfe\xff' is an extension of some shell.
Dash, a popular yet simple shell, do not implement this extension.

This wasn't caught by most people running the tests, even though
common shells like dash don't handle hex escapes, because their
systems don't trigger the NO_UTF16_BOM prereq. But systems with musl
libc do; when combined with dash, the test fails.

Correct it.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 t/t0028-working-tree-encoding.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7aa0945d8d..bfc4fb9af5 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -17,7 +17,7 @@ test_lazy_prereq NO_UTF32_BOM '
 write_utf16 () {
 	if test_have_prereq NO_UTF16_BOM
 	then
-		printf '\xfe\xff'
+		printf '\376\377'
 	fi &&
 	iconv -f UTF-8 -t UTF-16
 }
@@ -25,7 +25,7 @@ write_utf16 () {
 write_utf32 () {
 	if test_have_prereq NO_UTF32_BOM
 	then
-		printf '\x00\x00\xfe\xff'
+		printf '\0\0\376\377'
 	fi &&
 	iconv -f UTF-8 -t UTF-32
 }
-- 
2.24.0.4.g6a51fdd29c


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 2/8] configure.ac: define ICONV_OMITS_BOM if necessary
  2019-11-06  9:19 ` [PATCH v3 0/8] Correct internal working and output encoding Doan Tran Cong Danh
  2019-11-06  9:19   ` [PATCH v3 1/8] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
@ 2019-11-06  9:20   ` Doan Tran Cong Danh
  2019-11-06  9:20   ` [PATCH v3 3/8] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-06  9:20 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

From commit 79444c9294, ("utf8: handle systems that don't write BOM for
UTF-16", 2019-02-12), we're supporting those systems with iconv that
omits BOM with:

    make ICONV_OMITS_BOM=Yes

However, typing the flag all the time is cumbersome and error-prone.

Add a check into configure script to detect this flag automatically.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 configure.ac | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/configure.ac b/configure.ac
index a43b476402..ecba7e6e51 100644
--- a/configure.ac
+++ b/configure.ac
@@ -844,12 +844,61 @@ AC_MSG_CHECKING([for old iconv()])
 AC_COMPILE_IFELSE([OLDICONVTEST_SRC],
 	[AC_MSG_RESULT([no])],
 	[AC_MSG_RESULT([yes])
+	AC_DEFINE(HAVE_OLD_ICONV, 1)
 	OLD_ICONV=UnfortunatelyYes])
 
 GIT_UNSTASH_FLAGS($ICONVDIR)
 
 GIT_CONF_SUBST([OLD_ICONV])
 
+#
+# Define ICONV_OMITS_BOM if you are on a system which
+# iconv omits bom for utf-{16,32}
+if test -z "$NO_ICONV"; then
+AC_CACHE_CHECK([whether iconv omits bom for utf-16 and utf-32],
+ [ac_cv_iconv_omits_bom],
+[
+old_LIBS="$LIBS"
+if test -n "$NEEDS_LIBICONV"; then
+	LIBS="$LIBS -liconv"
+fi
+
+AC_RUN_IFELSE(
+	[AC_LANG_PROGRAM([AC_INCLUDES_DEFAULT
+	#include <iconv.h>
+	#ifdef HAVE_OLD_ICONV
+	typedef const char *iconv_ibp;
+	#else
+	typedef char *iconv_ibp;
+	#endif
+	],
+	[[
+	int v;
+	iconv_t conv;
+	char in[] = "a"; iconv_ibp pin = in;
+	char out[20] = ""; char *pout = out;
+	size_t isz = sizeof in;
+	size_t osz = sizeof out;
+
+	conv = iconv_open("UTF-16", "UTF-8");
+	iconv(conv, &pin, &isz, &pout, &osz);
+	iconv_close(conv);
+	v = (unsigned char)(out[0]) + (unsigned char)(out[1]);
+	return v != 0xfe + 0xff;
+	]])],
+	[ac_cv_iconv_omits_bom=no],
+	[ac_cv_iconv_omits_bom=yes])
+
+LIBS="$old_LIBS"
+])
+if test "x$ac_cv_iconv_omits_bom" = xyes; then
+	ICONV_OMITS_BOM=Yes
+else
+	ICONV_OMITS_BOM=
+fi
+GIT_CONF_SUBST([ICONV_OMITS_BOM])
+fi
+
 ## Checks for typedefs, structures, and compiler characteristics.
 AC_MSG_NOTICE([CHECKS for typedefs, structures, and compiler characteristics])
 #
-- 
2.24.0.4.g6a51fdd29c


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 3/8] t3900: demonstrate git-rebase problem with multi encoding
  2019-11-06  9:19 ` [PATCH v3 0/8] Correct internal working and output encoding Doan Tran Cong Danh
  2019-11-06  9:19   ` [PATCH v3 1/8] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
  2019-11-06  9:20   ` [PATCH v3 2/8] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
@ 2019-11-06  9:20   ` Doan Tran Cong Danh
  2019-11-06  9:20   ` [PATCH v3 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-06  9:20 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh, Jeff King

We're using fixup!/squash! <subject> to mark if current commit will be
used to be fixed up or squashed to a previous commit.

However, if we're changing i18n.commitencoding after making the
original commit but before making the fixing up, we couldn't find the
original commit to do the fixup/squash.

Add a test to demonstrate that problem.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 t/t3900-i18n-commit.sh | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index b92ff95977..85cac6e2dc 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -204,4 +204,34 @@ test_commit_autosquash_flags eucJP fixup
 
 test_commit_autosquash_flags ISO-2022-JP squash
 
+test_commit_autosquash_multi_encoding () {
+	flag=$1
+	old=$2
+	new=$3
+	msg=$4
+	test_expect_failure "commit --$flag into $old from $new" '
+		git checkout -b '$flag-$old-$new' C0 &&
+		git config i18n.commitencoding '$old' &&
+		echo '$old' >>F &&
+		git commit -a -F "$TEST_DIRECTORY/t3900/'$msg'" &&
+		test_tick &&
+		echo intermediate stuff >>G &&
+		git add G &&
+		git commit -a -m "intermediate commit" &&
+		test_tick &&
+		git config i18n.commitencoding '$new' &&
+		echo '$new-$flag' >>F &&
+		git commit -a --'$flag' HEAD^ &&
+		git config --unset-all i18n.commitencoding &&
+		git rebase --autosquash -i HEAD^^^ &&
+		git rev-list HEAD >actual &&
+		test_line_count = 3 actual
+	'
+}
+
+test_commit_autosquash_multi_encoding fixup UTF-8 ISO-8859-1 1-UTF-8.txt
+test_commit_autosquash_multi_encoding squash ISO-8859-1 UTF-8 ISO8859-1.txt
+test_commit_autosquash_multi_encoding squash eucJP ISO-2022-JP eucJP.txt
+test_commit_autosquash_multi_encoding fixup ISO-2022-JP UTF-8 ISO-2022-JP.txt
+
 test_done
-- 
2.24.0.4.g6a51fdd29c


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-06  9:19 ` [PATCH v3 0/8] Correct internal working and output encoding Doan Tran Cong Danh
                     ` (2 preceding siblings ...)
  2019-11-06  9:20   ` [PATCH v3 3/8] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
@ 2019-11-06  9:20   ` Doan Tran Cong Danh
  2019-11-06  9:20   ` [PATCH v3 5/8] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-06  9:20 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

On musl libc, ISO-2022-JP encoder is too eager to switch back to
1 byte encoding, musl's iconv always switch back after every combining
character. Comparing glibc and musl's output for this command
$ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
	iconv -f utf-8 -t ISO-2022-JP | xxd

glibc:
00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
00000010: 0a                                       .

musl:
00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.

Although musl iconv's output isn't optimal, it's still correct.

From commit 7d509878b8, ("pretty.c: format string with truncate respects
logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
first, then format it and convert the message to the actual output
encoding on git commit --squash.

Thus, t3900::test_commit_autosquash_flags is failing on musl libc.

Reencode to utf-8 before arranging rebase's todo list.
By doing this, we also remove a breakage introduced in the previous
commit.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c            | 2 +-
 t/t3900-i18n-commit.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index 9d5964fd81..69430fe23f 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -5169,7 +5169,7 @@ int todo_list_rearrange_squash(struct todo_list *todo_list)
 		*commit_todo_item_at(&commit_todo, item->commit) = item;
 
 		parse_commit(item->commit);
-		commit_buffer = get_commit_buffer(item->commit, NULL);
+		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
 		find_commit_subject(commit_buffer, &subject);
 		format_subject(&buf, subject, " ");
 		subject = subjects[i] = strbuf_detach(&buf, &subject_len);
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index 85cac6e2dc..7501555c52 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -209,7 +209,7 @@ test_commit_autosquash_multi_encoding () {
 	old=$2
 	new=$3
 	msg=$4
-	test_expect_failure "commit --$flag into $old from $new" '
+	test_expect_success "commit --$flag into $old from $new" '
 		git checkout -b '$flag-$old-$new' C0 &&
 		git config i18n.commitencoding '$old' &&
 		echo '$old' >>F &&
-- 
2.24.0.4.g6a51fdd29c


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 5/8] sequencer: reencode revert/cherry-pick's todo list
  2019-11-06  9:19 ` [PATCH v3 0/8] Correct internal working and output encoding Doan Tran Cong Danh
                     ` (3 preceding siblings ...)
  2019-11-06  9:20   ` [PATCH v3 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
@ 2019-11-06  9:20   ` Doan Tran Cong Danh
  2019-11-06  9:20   ` [PATCH v3 6/8] sequencer: reencode squashing commit's message Doan Tran Cong Danh
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-06  9:20 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

Keep revert/cherry-pick's todo list in line with rebase todo list.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/sequencer.c b/sequencer.c
index 69430fe23f..a19954f2bf 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -2564,14 +2564,17 @@ static int walk_revs_populate_todo(struct todo_list *todo_list,
 	enum todo_command command = opts->action == REPLAY_PICK ?
 		TODO_PICK : TODO_REVERT;
 	const char *command_string = todo_command_info[command].str;
+	const char *encoding;
 	struct commit *commit;
 
 	if (prepare_revs(opts))
 		return -1;
 
+	encoding = get_log_output_encoding();
+
 	while ((commit = get_revision(opts->revs))) {
 		struct todo_item *item = append_new_todo(todo_list);
-		const char *commit_buffer = get_commit_buffer(commit, NULL);
+		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);
 		const char *subject;
 		int subject_len;
 
-- 
2.24.0.4.g6a51fdd29c


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 6/8] sequencer: reencode squashing commit's message
  2019-11-06  9:19 ` [PATCH v3 0/8] Correct internal working and output encoding Doan Tran Cong Danh
                     ` (4 preceding siblings ...)
  2019-11-06  9:20   ` [PATCH v3 5/8] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
@ 2019-11-06  9:20   ` Doan Tran Cong Danh
  2019-11-06  9:20   ` [PATCH v3 7/8] sequencer: reencode old merge-commit message Doan Tran Cong Danh
  2019-11-06  9:20   ` [PATCH v3 8/8] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
  7 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-06  9:20 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

On fixup/squash-ing rebase, git will create new commit in
i18n.commitencoding, reencode the commit message to that said encode.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c            |  8 +++++---
 t/t3900-i18n-commit.sh | 14 +++++++++++++-
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index a19954f2bf..833a928929 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -1576,6 +1576,7 @@ static int update_squash_messages(struct repository *r,
 	struct strbuf buf = STRBUF_INIT;
 	int res;
 	const char *message, *body;
+	const char *encoding = get_commit_output_encoding();
 
 	if (opts->current_fixup_count > 0) {
 		struct strbuf header = STRBUF_INIT;
@@ -1602,7 +1603,7 @@ static int update_squash_messages(struct repository *r,
 			return error(_("need a HEAD to fixup"));
 		if (!(head_commit = lookup_commit_reference(r, &head)))
 			return error(_("could not read HEAD"));
-		if (!(head_message = get_commit_buffer(head_commit, NULL)))
+		if (!(head_message = logmsg_reencode(head_commit, NULL, encoding)))
 			return error(_("could not read HEAD's commit message"));
 
 		find_commit_subject(head_message, &body);
@@ -1623,7 +1624,7 @@ static int update_squash_messages(struct repository *r,
 		unuse_commit_buffer(head_commit, head_message);
 	}
 
-	if (!(message = get_commit_buffer(commit, NULL)))
+	if (!(message = logmsg_reencode(commit, NULL, encoding)))
 		return error(_("could not read commit message of %s"),
 			     oid_to_hex(&commit->object.oid));
 	find_commit_subject(message, &body);
@@ -4154,9 +4155,10 @@ static int commit_staged_changes(struct repository *r,
 				 */
 				struct commit *commit;
 				const char *path = rebase_path_squash_msg();
+				const char *encoding = get_commit_output_encoding();
 
 				if (parse_head(r, &commit) ||
-				    !(p = get_commit_buffer(commit, NULL)) ||
+				    !(p = logmsg_reencode(commit, NULL, encoding)) ||
 				    write_message(p, strlen(p), path, 0)) {
 					unuse_commit_buffer(commit, p);
 					return error(_("could not write file: "
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index 7501555c52..241bda1099 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -209,6 +209,13 @@ test_commit_autosquash_multi_encoding () {
 	old=$2
 	new=$3
 	msg=$4
+	squash_msg=
+	if test $flag = squash; then
+		squash_msg='
+		subject="squash! $(head -1 expect)" &&
+		printf "\n%s\n" "$subject" >> expect &&
+		'
+	fi
 	test_expect_success "commit --$flag into $old from $new" '
 		git checkout -b '$flag-$old-$new' C0 &&
 		git config i18n.commitencoding '$old' &&
@@ -225,7 +232,12 @@ test_commit_autosquash_multi_encoding () {
 		git config --unset-all i18n.commitencoding &&
 		git rebase --autosquash -i HEAD^^^ &&
 		git rev-list HEAD >actual &&
-		test_line_count = 3 actual
+		test_line_count = 3 actual &&
+		iconv -f '$old' -t utf-8 "$TEST_DIRECTORY/t3900/'$msg'" >expect &&
+		'"$squash_msg"'
+		git cat-file commit HEAD^ >raw &&
+		(sed "1,/^$/d" raw | iconv -f '$new' -t utf-8) >actual &&
+		test_cmp expect actual
 	'
 }
 
-- 
2.24.0.4.g6a51fdd29c


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 7/8] sequencer: reencode old merge-commit message
  2019-11-06  9:19 ` [PATCH v3 0/8] Correct internal working and output encoding Doan Tran Cong Danh
                     ` (5 preceding siblings ...)
  2019-11-06  9:20   ` [PATCH v3 6/8] sequencer: reencode squashing commit's message Doan Tran Cong Danh
@ 2019-11-06  9:20   ` Doan Tran Cong Danh
  2019-11-06 15:39     ` Eric Sunshine
  2019-11-06  9:20   ` [PATCH v3 8/8] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
  7 siblings, 1 reply; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-06  9:20 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

During rebasing, old merge's message (encoded in old encoding)
will be used as message for new merge commit (created by rebase).

In case of the value of i18n.commitencoding has been changed after the
old merge time. We will receive an usable message for this new merge.

Correct it.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/sequencer.c b/sequencer.c
index 833a928929..d735d09f98 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -3374,7 +3374,8 @@ static int do_merge(struct repository *r,
 	}
 
 	if (commit) {
-		const char *message = get_commit_buffer(commit, NULL);
+		const char *encoding = get_commit_output_encoding();
+		const char *message = logmsg_reencode(commit, NULL, encoding);
 		const char *body;
 		int len;
 
-- 
2.24.0.4.g6a51fdd29c


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3 8/8] sequencer: reencode commit message for am/rebase --show-current-patch
  2019-11-06  9:19 ` [PATCH v3 0/8] Correct internal working and output encoding Doan Tran Cong Danh
                     ` (6 preceding siblings ...)
  2019-11-06  9:20   ` [PATCH v3 7/8] sequencer: reencode old merge-commit message Doan Tran Cong Danh
@ 2019-11-06  9:20   ` Doan Tran Cong Danh
  7 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-06  9:20 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

The message file will be used as commit message for the
git-{am,rebase} --continue.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/sequencer.c b/sequencer.c
index d735d09f98..4c1ffad0f1 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -2972,7 +2972,8 @@ static int make_patch(struct repository *r,
 
 	strbuf_addf(&buf, "%s/message", get_dir(opts));
 	if (!file_exists(buf.buf)) {
-		const char *commit_buffer = get_commit_buffer(commit, NULL);
+		const char *encoding = get_commit_output_encoding()
+		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);
 		find_commit_subject(commit_buffer, &subject);
 		res |= write_message(subject, strlen(subject), buf.buf, 1);
 		unuse_commit_buffer(commit, commit_buffer);
-- 
2.24.0.4.g6a51fdd29c


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-06  4:03             ` Jeff King
@ 2019-11-06 10:03               ` Danh Doan
  2019-11-07  5:56                 ` Jeff King
  0 siblings, 1 reply; 89+ messages in thread
From: Danh Doan @ 2019-11-06 10:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On 2019-11-05 23:03:14 -0500, Jeff King wrote:
> >     3. You are dealing with a project originated on and migrated
> >        from a foreign SCM, and older parts of the history is stored
> >        in a non-utf-8, even though recent history is in utf-8
> > 
> > to the mix?
> 
> I would think you'd want to convert to utf-8 as you do the migration in
> that case, since you're writing new hashes anyway.

Sorry but I'm confused.
If we're migrating from foreign SCM, we could make our commit in
utf-8 (convert their commit message to utf8).
Even if we need to synchronise history between the foreign SCM in
question with git, we could use i18n.logoutputencoding for the output
comestic.

> But I think a similar
> case would just be an old Git repository, where for some reason you
> thought i18n.commitEncoding was a good idea back then (perhaps because
> you were in situation (1) then, but now you aren't).
> 
> In either case, though, I don't think it's a compelling motivation for
> optimization, if only because those old commits will be shown less and
> less (and even without modern optimizations like commit-graph, we'd
> generally avoid reencoding those old commits unless we're actually going
> to _show_ them).

I'm not sure if we're misunderstood each other.
I've only suggested to encode _new_ commit from now on in utf-8.
Reencoding old history in utf-8 is definitely not in that suggestion.

-- 
Danh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3 7/8] sequencer: reencode old merge-commit message
  2019-11-06  9:20   ` [PATCH v3 7/8] sequencer: reencode old merge-commit message Doan Tran Cong Danh
@ 2019-11-06 15:39     ` Eric Sunshine
  0 siblings, 0 replies; 89+ messages in thread
From: Eric Sunshine @ 2019-11-06 15:39 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: Git List

On Wed, Nov 6, 2019 at 4:21 AM Doan Tran Cong Danh <congdanhqx@gmail.com> wrote:
> During rebasing, old merge's message (encoded in old encoding)
> will be used as message for new merge commit (created by rebase).
>
> In case of the value of i18n.commitencoding has been changed after the
> old merge time. We will receive an usable message for this new merge.

Did you mean s/usable/unusable/ ?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 0/8] Correct internal working and output encoding
  2019-10-31  9:26 [PATCH 0/3] Linux with musl libc improvement Doan Tran Cong Danh
                   ` (4 preceding siblings ...)
  2019-11-06  9:19 ` [PATCH v3 0/8] Correct internal working and output encoding Doan Tran Cong Danh
@ 2019-11-07  2:56 ` Doan Tran Cong Danh
  2019-11-07  2:56   ` [PATCH v4 1/8] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
                     ` (7 more replies)
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
  2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
  7 siblings, 8 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-07  2:56 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh, Eric Sunshine

The series is shifting from fixing test that failed on musl based Linux to
correct the internal working encoding and output encoding of git-am
git-cherry-pick git-rebase and git-revert.

Change from v3:
- Fix grammar and spelling in commit message.
- Add mising semicolon (I ran the test with dirty workspace right before
  making last change). Everything is OK this time, though.

Cc: Eric Sunshine <sunshine@sunshineco.com>

Apparantly, I couldn't spell unusable.


Doan Tran Cong Danh (8):
  t0028: eliminate non-standard usage of printf
  configure.ac: define ICONV_OMITS_BOM if necessary
  t3900: demonstrate git-rebase problem with multi encoding
  sequencer: reencode to utf-8 before arrange rebase's todo list
  sequencer: reencode revert/cherry-pick's todo list
  sequencer: reencode squashing commit's message
  sequencer: reencode old merge-commit message
  sequencer: reencode commit message for am/rebase --show-current-patch

 configure.ac                     | 49 ++++++++++++++++++++++++++++++++
 sequencer.c                      | 21 +++++++++-----
 t/t0028-working-tree-encoding.sh |  4 +--
 t/t3900-i18n-commit.sh           | 41 ++++++++++++++++++++++++++
 4 files changed, 106 insertions(+), 9 deletions(-)

Range-diff against v3:
1:  b3d6c4e720 = 1:  b3d6c4e720 t0028: eliminate non-standard usage of printf
2:  f07566c60c = 2:  f07566c60c configure.ac: define ICONV_OMITS_BOM if necessary
3:  662e5bd545 ! 3:  ca869cef57 t3900: demonstrate git-rebase problem with multi encoding
    @@ t/t3900-i18n-commit.sh: test_commit_autosquash_flags eucJP fixup
     +		git config i18n.commitencoding '$new' &&
     +		echo '$new-$flag' >>F &&
     +		git commit -a --'$flag' HEAD^ &&
    -+		git config --unset-all i18n.commitencoding &&
     +		git rebase --autosquash -i HEAD^^^ &&
     +		git rev-list HEAD >actual &&
     +		test_line_count = 3 actual
4:  6a51fdd29c = 4:  15c33fc245 sequencer: reencode to utf-8 before arrange rebase's todo list
5:  d382e35e4e = 5:  304ac6c289 sequencer: reencode revert/cherry-pick's todo list
6:  340902eb67 ! 6:  97ab88e5d8 sequencer: reencode squashing commit's message
    @@ t/t3900-i18n-commit.sh: test_commit_autosquash_multi_encoding () {
      		git checkout -b '$flag-$old-$new' C0 &&
      		git config i18n.commitencoding '$old' &&
     @@ t/t3900-i18n-commit.sh: test_commit_autosquash_multi_encoding () {
    - 		git config --unset-all i18n.commitencoding &&
    + 		git commit -a --'$flag' HEAD^ &&
      		git rebase --autosquash -i HEAD^^^ &&
      		git rev-list HEAD >actual &&
     -		test_line_count = 3 actual
7:  7f0df0f685 ! 7:  f295d32d7b sequencer: reencode old merge-commit message
    @@ Commit message
         will be used as message for new merge commit (created by rebase).
     
         In case of the value of i18n.commitencoding has been changed after the
    -    old merge time. We will receive an usable message for this new merge.
    +    old merge time. We will receive an unusable message for this new merge.
     
         Correct it.
     
8:  69ec40bb1d ! 8:  36796e2b67 sequencer: reencode commit message for am/rebase --show-current-patch
    @@ sequencer.c: static int make_patch(struct repository *r,
      	strbuf_addf(&buf, "%s/message", get_dir(opts));
      	if (!file_exists(buf.buf)) {
     -		const char *commit_buffer = get_commit_buffer(commit, NULL);
    -+		const char *encoding = get_commit_output_encoding()
    ++		const char *encoding = get_commit_output_encoding();
     +		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);
      		find_commit_subject(commit_buffer, &subject);
      		res |= write_message(subject, strlen(subject), buf.buf, 1);
-- 
2.24.0.8.g36796e2b67


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 1/8] t0028: eliminate non-standard usage of printf
  2019-11-07  2:56 ` [PATCH v4 0/8] Correct internal working and output encoding Doan Tran Cong Danh
@ 2019-11-07  2:56   ` Doan Tran Cong Danh
  2019-11-07  2:56   ` [PATCH v4 2/8] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-07  2:56 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

man 1p printf:
   In addition to the escape sequences shown in the Base Definitions
   volume of POSIX.1‐2008, Chapter 5, File Format Notation ('\\',
   '\a', '\b', '\f', '\n', '\r', '\t', '\v'), "\ddd", where ddd is a
   one, two, or three-digit octal number, shall be written as a byte
   with the numeric value specified by the octal number.

printf '\xfe\xff' is an extension of some shell.
Dash, a popular yet simple shell, do not implement this extension.

This wasn't caught by most people running the tests, even though
common shells like dash don't handle hex escapes, because their
systems don't trigger the NO_UTF16_BOM prereq. But systems with musl
libc do; when combined with dash, the test fails.

Correct it.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 t/t0028-working-tree-encoding.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7aa0945d8d..bfc4fb9af5 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -17,7 +17,7 @@ test_lazy_prereq NO_UTF32_BOM '
 write_utf16 () {
 	if test_have_prereq NO_UTF16_BOM
 	then
-		printf '\xfe\xff'
+		printf '\376\377'
 	fi &&
 	iconv -f UTF-8 -t UTF-16
 }
@@ -25,7 +25,7 @@ write_utf16 () {
 write_utf32 () {
 	if test_have_prereq NO_UTF32_BOM
 	then
-		printf '\x00\x00\xfe\xff'
+		printf '\0\0\376\377'
 	fi &&
 	iconv -f UTF-8 -t UTF-32
 }
-- 
2.24.0.8.g36796e2b67


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 2/8] configure.ac: define ICONV_OMITS_BOM if necessary
  2019-11-07  2:56 ` [PATCH v4 0/8] Correct internal working and output encoding Doan Tran Cong Danh
  2019-11-07  2:56   ` [PATCH v4 1/8] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
@ 2019-11-07  2:56   ` Doan Tran Cong Danh
  2019-11-07  6:18     ` Junio C Hamano
  2019-11-07  2:56   ` [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-07  2:56 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

From commit 79444c9294, ("utf8: handle systems that don't write BOM for
UTF-16", 2019-02-12), we're supporting those systems with iconv that
omits BOM with:

    make ICONV_OMITS_BOM=Yes

However, typing the flag all the time is cumbersome and error-prone.

Add a check into configure script to detect this flag automatically.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 configure.ac | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/configure.ac b/configure.ac
index a43b476402..ecba7e6e51 100644
--- a/configure.ac
+++ b/configure.ac
@@ -844,12 +844,61 @@ AC_MSG_CHECKING([for old iconv()])
 AC_COMPILE_IFELSE([OLDICONVTEST_SRC],
 	[AC_MSG_RESULT([no])],
 	[AC_MSG_RESULT([yes])
+	AC_DEFINE(HAVE_OLD_ICONV, 1)
 	OLD_ICONV=UnfortunatelyYes])
 
 GIT_UNSTASH_FLAGS($ICONVDIR)
 
 GIT_CONF_SUBST([OLD_ICONV])
 
+#
+# Define ICONV_OMITS_BOM if you are on a system which
+# iconv omits bom for utf-{16,32}
+if test -z "$NO_ICONV"; then
+AC_CACHE_CHECK([whether iconv omits bom for utf-16 and utf-32],
+ [ac_cv_iconv_omits_bom],
+[
+old_LIBS="$LIBS"
+if test -n "$NEEDS_LIBICONV"; then
+	LIBS="$LIBS -liconv"
+fi
+
+AC_RUN_IFELSE(
+	[AC_LANG_PROGRAM([AC_INCLUDES_DEFAULT
+	#include <iconv.h>
+	#ifdef HAVE_OLD_ICONV
+	typedef const char *iconv_ibp;
+	#else
+	typedef char *iconv_ibp;
+	#endif
+	],
+	[[
+	int v;
+	iconv_t conv;
+	char in[] = "a"; iconv_ibp pin = in;
+	char out[20] = ""; char *pout = out;
+	size_t isz = sizeof in;
+	size_t osz = sizeof out;
+
+	conv = iconv_open("UTF-16", "UTF-8");
+	iconv(conv, &pin, &isz, &pout, &osz);
+	iconv_close(conv);
+	v = (unsigned char)(out[0]) + (unsigned char)(out[1]);
+	return v != 0xfe + 0xff;
+	]])],
+	[ac_cv_iconv_omits_bom=no],
+	[ac_cv_iconv_omits_bom=yes])
+
+LIBS="$old_LIBS"
+])
+if test "x$ac_cv_iconv_omits_bom" = xyes; then
+	ICONV_OMITS_BOM=Yes
+else
+	ICONV_OMITS_BOM=
+fi
+GIT_CONF_SUBST([ICONV_OMITS_BOM])
+fi
+
 ## Checks for typedefs, structures, and compiler characteristics.
 AC_MSG_NOTICE([CHECKS for typedefs, structures, and compiler characteristics])
 #
-- 
2.24.0.8.g36796e2b67


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding
  2019-11-07  2:56 ` [PATCH v4 0/8] Correct internal working and output encoding Doan Tran Cong Danh
  2019-11-07  2:56   ` [PATCH v4 1/8] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
  2019-11-07  2:56   ` [PATCH v4 2/8] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
@ 2019-11-07  2:56   ` Doan Tran Cong Danh
  2019-11-07  6:02     ` Jeff King
  2019-11-07  2:56   ` [PATCH v4 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-07  2:56 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh, Jeff King

We're using fixup!/squash! <subject> to mark if current commit will be
used to be fixed up or squashed to a previous commit.

However, if we're changing i18n.commitencoding after making the
original commit but before making the fixing up, we couldn't find the
original commit to do the fixup/squash.

Add a test to demonstrate that problem.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 t/t3900-i18n-commit.sh | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index b92ff95977..0f978bfde1 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -204,4 +204,33 @@ test_commit_autosquash_flags eucJP fixup
 
 test_commit_autosquash_flags ISO-2022-JP squash
 
+test_commit_autosquash_multi_encoding () {
+	flag=$1
+	old=$2
+	new=$3
+	msg=$4
+	test_expect_failure "commit --$flag into $old from $new" '
+		git checkout -b '$flag-$old-$new' C0 &&
+		git config i18n.commitencoding '$old' &&
+		echo '$old' >>F &&
+		git commit -a -F "$TEST_DIRECTORY/t3900/'$msg'" &&
+		test_tick &&
+		echo intermediate stuff >>G &&
+		git add G &&
+		git commit -a -m "intermediate commit" &&
+		test_tick &&
+		git config i18n.commitencoding '$new' &&
+		echo '$new-$flag' >>F &&
+		git commit -a --'$flag' HEAD^ &&
+		git rebase --autosquash -i HEAD^^^ &&
+		git rev-list HEAD >actual &&
+		test_line_count = 3 actual
+	'
+}
+
+test_commit_autosquash_multi_encoding fixup UTF-8 ISO-8859-1 1-UTF-8.txt
+test_commit_autosquash_multi_encoding squash ISO-8859-1 UTF-8 ISO8859-1.txt
+test_commit_autosquash_multi_encoding squash eucJP ISO-2022-JP eucJP.txt
+test_commit_autosquash_multi_encoding fixup ISO-2022-JP UTF-8 ISO-2022-JP.txt
+
 test_done
-- 
2.24.0.8.g36796e2b67


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-07  2:56 ` [PATCH v4 0/8] Correct internal working and output encoding Doan Tran Cong Danh
                     ` (2 preceding siblings ...)
  2019-11-07  2:56   ` [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
@ 2019-11-07  2:56   ` Doan Tran Cong Danh
  2019-11-07  6:04     ` Jeff King
  2019-11-07  2:56   ` [PATCH v4 5/8] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-07  2:56 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

On musl libc, ISO-2022-JP encoder is too eager to switch back to
1 byte encoding, musl's iconv always switch back after every combining
character. Comparing glibc and musl's output for this command
$ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
	iconv -f utf-8 -t ISO-2022-JP | xxd

glibc:
00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
00000010: 0a                                       .

musl:
00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.

Although musl iconv's output isn't optimal, it's still correct.

From commit 7d509878b8, ("pretty.c: format string with truncate respects
logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
first, then format it and convert the message to the actual output
encoding on git commit --squash.

Thus, t3900::test_commit_autosquash_flags is failing on musl libc.

Reencode to utf-8 before arranging rebase's todo list.
By doing this, we also remove a breakage introduced in the previous
commit.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c            | 2 +-
 t/t3900-i18n-commit.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index 9d5964fd81..69430fe23f 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -5169,7 +5169,7 @@ int todo_list_rearrange_squash(struct todo_list *todo_list)
 		*commit_todo_item_at(&commit_todo, item->commit) = item;
 
 		parse_commit(item->commit);
-		commit_buffer = get_commit_buffer(item->commit, NULL);
+		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
 		find_commit_subject(commit_buffer, &subject);
 		format_subject(&buf, subject, " ");
 		subject = subjects[i] = strbuf_detach(&buf, &subject_len);
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index 0f978bfde1..e8ce5323ee 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -209,7 +209,7 @@ test_commit_autosquash_multi_encoding () {
 	old=$2
 	new=$3
 	msg=$4
-	test_expect_failure "commit --$flag into $old from $new" '
+	test_expect_success "commit --$flag into $old from $new" '
 		git checkout -b '$flag-$old-$new' C0 &&
 		git config i18n.commitencoding '$old' &&
 		echo '$old' >>F &&
-- 
2.24.0.8.g36796e2b67


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 5/8] sequencer: reencode revert/cherry-pick's todo list
  2019-11-07  2:56 ` [PATCH v4 0/8] Correct internal working and output encoding Doan Tran Cong Danh
                     ` (3 preceding siblings ...)
  2019-11-07  2:56   ` [PATCH v4 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
@ 2019-11-07  2:56   ` Doan Tran Cong Danh
  2019-11-07  6:06     ` Jeff King
  2019-11-07  2:56   ` [PATCH v4 6/8] sequencer: reencode squashing commit's message Doan Tran Cong Danh
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-07  2:56 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

Keep revert/cherry-pick's todo list in line with rebase todo list.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/sequencer.c b/sequencer.c
index 69430fe23f..a19954f2bf 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -2564,14 +2564,17 @@ static int walk_revs_populate_todo(struct todo_list *todo_list,
 	enum todo_command command = opts->action == REPLAY_PICK ?
 		TODO_PICK : TODO_REVERT;
 	const char *command_string = todo_command_info[command].str;
+	const char *encoding;
 	struct commit *commit;
 
 	if (prepare_revs(opts))
 		return -1;
 
+	encoding = get_log_output_encoding();
+
 	while ((commit = get_revision(opts->revs))) {
 		struct todo_item *item = append_new_todo(todo_list);
-		const char *commit_buffer = get_commit_buffer(commit, NULL);
+		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);
 		const char *subject;
 		int subject_len;
 
-- 
2.24.0.8.g36796e2b67


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 6/8] sequencer: reencode squashing commit's message
  2019-11-07  2:56 ` [PATCH v4 0/8] Correct internal working and output encoding Doan Tran Cong Danh
                     ` (4 preceding siblings ...)
  2019-11-07  2:56   ` [PATCH v4 5/8] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
@ 2019-11-07  2:56   ` Doan Tran Cong Danh
  2019-11-07  6:15     ` Jeff King
  2019-11-07  2:56   ` [PATCH v4 7/8] sequencer: reencode old merge-commit message Doan Tran Cong Danh
  2019-11-07  2:56   ` [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
  7 siblings, 1 reply; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-07  2:56 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

On fixup/squash-ing rebase, git will create new commit in
i18n.commitencoding, reencode the commit message to that said encode.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c            |  8 +++++---
 t/t3900-i18n-commit.sh | 14 +++++++++++++-
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index a19954f2bf..833a928929 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -1576,6 +1576,7 @@ static int update_squash_messages(struct repository *r,
 	struct strbuf buf = STRBUF_INIT;
 	int res;
 	const char *message, *body;
+	const char *encoding = get_commit_output_encoding();
 
 	if (opts->current_fixup_count > 0) {
 		struct strbuf header = STRBUF_INIT;
@@ -1602,7 +1603,7 @@ static int update_squash_messages(struct repository *r,
 			return error(_("need a HEAD to fixup"));
 		if (!(head_commit = lookup_commit_reference(r, &head)))
 			return error(_("could not read HEAD"));
-		if (!(head_message = get_commit_buffer(head_commit, NULL)))
+		if (!(head_message = logmsg_reencode(head_commit, NULL, encoding)))
 			return error(_("could not read HEAD's commit message"));
 
 		find_commit_subject(head_message, &body);
@@ -1623,7 +1624,7 @@ static int update_squash_messages(struct repository *r,
 		unuse_commit_buffer(head_commit, head_message);
 	}
 
-	if (!(message = get_commit_buffer(commit, NULL)))
+	if (!(message = logmsg_reencode(commit, NULL, encoding)))
 		return error(_("could not read commit message of %s"),
 			     oid_to_hex(&commit->object.oid));
 	find_commit_subject(message, &body);
@@ -4154,9 +4155,10 @@ static int commit_staged_changes(struct repository *r,
 				 */
 				struct commit *commit;
 				const char *path = rebase_path_squash_msg();
+				const char *encoding = get_commit_output_encoding();
 
 				if (parse_head(r, &commit) ||
-				    !(p = get_commit_buffer(commit, NULL)) ||
+				    !(p = logmsg_reencode(commit, NULL, encoding)) ||
 				    write_message(p, strlen(p), path, 0)) {
 					unuse_commit_buffer(commit, p);
 					return error(_("could not write file: "
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index e8ce5323ee..521d7bb927 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -209,6 +209,13 @@ test_commit_autosquash_multi_encoding () {
 	old=$2
 	new=$3
 	msg=$4
+	squash_msg=
+	if test $flag = squash; then
+		squash_msg='
+		subject="squash! $(head -1 expect)" &&
+		printf "\n%s\n" "$subject" >> expect &&
+		'
+	fi
 	test_expect_success "commit --$flag into $old from $new" '
 		git checkout -b '$flag-$old-$new' C0 &&
 		git config i18n.commitencoding '$old' &&
@@ -224,7 +231,12 @@ test_commit_autosquash_multi_encoding () {
 		git commit -a --'$flag' HEAD^ &&
 		git rebase --autosquash -i HEAD^^^ &&
 		git rev-list HEAD >actual &&
-		test_line_count = 3 actual
+		test_line_count = 3 actual &&
+		iconv -f '$old' -t utf-8 "$TEST_DIRECTORY/t3900/'$msg'" >expect &&
+		'"$squash_msg"'
+		git cat-file commit HEAD^ >raw &&
+		(sed "1,/^$/d" raw | iconv -f '$new' -t utf-8) >actual &&
+		test_cmp expect actual
 	'
 }
 
-- 
2.24.0.8.g36796e2b67


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 7/8] sequencer: reencode old merge-commit message
  2019-11-07  2:56 ` [PATCH v4 0/8] Correct internal working and output encoding Doan Tran Cong Danh
                     ` (5 preceding siblings ...)
  2019-11-07  2:56   ` [PATCH v4 6/8] sequencer: reencode squashing commit's message Doan Tran Cong Danh
@ 2019-11-07  2:56   ` Doan Tran Cong Danh
  2019-11-07  2:56   ` [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
  7 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-07  2:56 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

During rebasing, old merge's message (encoded in old encoding)
will be used as message for new merge commit (created by rebase).

In case of the value of i18n.commitencoding has been changed after the
old merge time. We will receive an unusable message for this new merge.

Correct it.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/sequencer.c b/sequencer.c
index 833a928929..d735d09f98 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -3374,7 +3374,8 @@ static int do_merge(struct repository *r,
 	}
 
 	if (commit) {
-		const char *message = get_commit_buffer(commit, NULL);
+		const char *encoding = get_commit_output_encoding();
+		const char *message = logmsg_reencode(commit, NULL, encoding);
 		const char *body;
 		int len;
 
-- 
2.24.0.8.g36796e2b67


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch
  2019-11-07  2:56 ` [PATCH v4 0/8] Correct internal working and output encoding Doan Tran Cong Danh
                     ` (6 preceding siblings ...)
  2019-11-07  2:56   ` [PATCH v4 7/8] sequencer: reencode old merge-commit message Doan Tran Cong Danh
@ 2019-11-07  2:56   ` Doan Tran Cong Danh
  2019-11-07  6:32     ` Jeff King
  7 siblings, 1 reply; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-07  2:56 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

The message file will be used as commit message for the
git-{am,rebase} --continue.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/sequencer.c b/sequencer.c
index d735d09f98..4d12ad3cc6 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -2972,7 +2972,8 @@ static int make_patch(struct repository *r,
 
 	strbuf_addf(&buf, "%s/message", get_dir(opts));
 	if (!file_exists(buf.buf)) {
-		const char *commit_buffer = get_commit_buffer(commit, NULL);
+		const char *encoding = get_commit_output_encoding();
+		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);
 		find_commit_subject(commit_buffer, &subject);
 		res |= write_message(subject, strlen(subject), buf.buf, 1);
 		unuse_commit_buffer(commit, commit_buffer);
-- 
2.24.0.8.g36796e2b67


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-06 10:03               ` Danh Doan
@ 2019-11-07  5:56                 ` Jeff King
  0 siblings, 0 replies; 89+ messages in thread
From: Jeff King @ 2019-11-07  5:56 UTC (permalink / raw)
  To: Danh Doan; +Cc: Junio C Hamano, git

On Wed, Nov 06, 2019 at 05:03:22PM +0700, Danh Doan wrote:

> On 2019-11-05 23:03:14 -0500, Jeff King wrote:
> > >     3. You are dealing with a project originated on and migrated
> > >        from a foreign SCM, and older parts of the history is stored
> > >        in a non-utf-8, even though recent history is in utf-8
> > > 
> > > to the mix?
> > 
> > I would think you'd want to convert to utf-8 as you do the migration in
> > that case, since you're writing new hashes anyway.
> 
> Sorry but I'm confused.
> If we're migrating from foreign SCM, we could make our commit in
> utf-8 (convert their commit message to utf8).
> Even if we need to synchronise history between the foreign SCM in
> question with git, we could use i18n.logoutputencoding for the output
> comestic.

Right, that's the same thing I'm suggesting.

> > But I think a similar
> > case would just be an old Git repository, where for some reason you
> > thought i18n.commitEncoding was a good idea back then (perhaps because
> > you were in situation (1) then, but now you aren't).
> > 
> > In either case, though, I don't think it's a compelling motivation for
> > optimization, if only because those old commits will be shown less and
> > less (and even without modern optimizations like commit-graph, we'd
> > generally avoid reencoding those old commits unless we're actually going
> > to _show_ them).
> 
> I'm not sure if we're misunderstood each other.
> I've only suggested to encode _new_ commit from now on in utf-8.
> Reencoding old history in utf-8 is definitely not in that suggestion.

Yes. My point was that's _already_ the default behavior, unless you
explicitly set some config asking for non-utf8 commit objects. And I
don't think there's any good reason to set that these days.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding
  2019-11-07  2:56   ` [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
@ 2019-11-07  6:02     ` Jeff King
  2019-11-07  6:48       ` Danh Doan
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2019-11-07  6:02 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

On Thu, Nov 07, 2019 at 09:56:14AM +0700, Doan Tran Cong Danh wrote:

> We're using fixup!/squash! <subject> to mark if current commit will be
> used to be fixed up or squashed to a previous commit.
> 
> However, if we're changing i18n.commitencoding after making the
> original commit but before making the fixing up, we couldn't find the
> original commit to do the fixup/squash.
> 
> Add a test to demonstrate that problem.

OK, this makes sense to do.

I'm not sure if we need to test so many combinations, as the problem is
apparent even on some vanilla ones. But I guess this is just following
the lead of the rest of the script.

> +test_commit_autosquash_multi_encoding () {
> +	flag=$1
> +	old=$2
> +	new=$3
> +	msg=$4
> +	test_expect_failure "commit --$flag into $old from $new" '
> +		git checkout -b '$flag-$old-$new' C0 &&

These single quotes are funny; they close the test-snippet string, so
these variables are outside of any quoting (and thus subject to
whitespace splitting).

The test snippets are run as an eval, so they have access to the
variables you set above. I.e., just:

  git checkout -b $flag-$old-$new C0

would work. Or:

  git checkout -b "$flag-$old-$new" C0

if you wanted to be more careful inside the snippet.

> +		git config i18n.commitencoding '$old' &&
> +		echo '$old' >>F &&
> +		git commit -a -F "$TEST_DIRECTORY/t3900/'$msg'" &&

Likewise for all these other bits of the script.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-07  2:56   ` [PATCH v4 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
@ 2019-11-07  6:04     ` Jeff King
  0 siblings, 0 replies; 89+ messages in thread
From: Jeff King @ 2019-11-07  6:04 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

On Thu, Nov 07, 2019 at 09:56:15AM +0700, Doan Tran Cong Danh wrote:

> On musl libc, ISO-2022-JP encoder is too eager to switch back to
> 1 byte encoding, musl's iconv always switch back after every combining
> character. Comparing glibc and musl's output for this command
> $ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
> 	iconv -f utf-8 -t ISO-2022-JP | xxd
> 
> glibc:
> 00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
> 00000010: 0a                                       .
> 
> musl:
> 00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
> 00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
> 00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.
> 
> Although musl iconv's output isn't optimal, it's still correct.
> 
> From commit 7d509878b8, ("pretty.c: format string with truncate respects
> logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
> first, then format it and convert the message to the actual output
> encoding on git commit --squash.
> 
> Thus, t3900::test_commit_autosquash_flags is failing on musl libc.
> 
> Reencode to utf-8 before arranging rebase's todo list.

OK, that makes sense.

> By doing this, we also remove a breakage introduced in the previous
> commit.

We'd usually say that a commit "introduced a breakage" if it was the
source of the bug. Perhaps:

  ...we also remove a breakage noticed by a test added in the previous
  commit.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 5/8] sequencer: reencode revert/cherry-pick's todo list
  2019-11-07  2:56   ` [PATCH v4 5/8] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
@ 2019-11-07  6:06     ` Jeff King
  0 siblings, 0 replies; 89+ messages in thread
From: Jeff King @ 2019-11-07  6:06 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

On Thu, Nov 07, 2019 at 09:56:16AM +0700, Doan Tran Cong Danh wrote:

> Keep revert/cherry-pick's todo list in line with rebase todo list.

I think this is the right thing to do, but it might bear a little more
explanation. Maybe:

  The user may see this todo list in their editor, so we should output
  it in the encoding they generally expect to see, not whatever encoding
  the commit was originally in. This also matches the behavior of the
  rebase todo list.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 6/8] sequencer: reencode squashing commit's message
  2019-11-07  2:56   ` [PATCH v4 6/8] sequencer: reencode squashing commit's message Doan Tran Cong Danh
@ 2019-11-07  6:15     ` Jeff King
  0 siblings, 0 replies; 89+ messages in thread
From: Jeff King @ 2019-11-07  6:15 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

On Thu, Nov 07, 2019 at 09:56:17AM +0700, Doan Tran Cong Danh wrote:

> On fixup/squash-ing rebase, git will create new commit in
> i18n.commitencoding, reencode the commit message to that said encode.

That makes sense (and I agree this is logically distinct from the
previous ones, which were about _showing_ commits, not generating them).

I wondered who is responsible for setting the "encoding" header in the
resulting object. It looks like we just call out to a separate "git
commit", feeding it some content we wrote out to a file. So before this
patch, I think we probably are writing out "encoding iso8859-1" or
whatever in the commit object, but actually outputting whatever the
original commit happened to have in it.

So your approach here is right: we just need to make sure what we write
out for git-commit to read back in is in i18n.commitEncoding.

> diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
> index e8ce5323ee..521d7bb927 100755
> --- a/t/t3900-i18n-commit.sh
> +++ b/t/t3900-i18n-commit.sh
> @@ -209,6 +209,13 @@ test_commit_autosquash_multi_encoding () {
>  	old=$2
>  	new=$3
>  	msg=$4
> +	squash_msg=
> +	if test $flag = squash; then
> +		squash_msg='
> +		subject="squash! $(head -1 expect)" &&
> +		printf "\n%s\n" "$subject" >> expect &&
> +		'
> +	fi

Now what's going on here? This is a snippet of code we man to evaluate
later:

> -		test_line_count = 3 actual
> +		test_line_count = 3 actual &&
> +		iconv -f '$old' -t utf-8 "$TEST_DIRECTORY/t3900/'$msg'" >expect &&
> +		'"$squash_msg"'

I assume this is part of the same confusion that caused the
single-quotes in the earlier patch. You can just include those lines
inline in the quoted test snippet.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 2/8] configure.ac: define ICONV_OMITS_BOM if necessary
  2019-11-07  2:56   ` [PATCH v4 2/8] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
@ 2019-11-07  6:18     ` Junio C Hamano
  0 siblings, 0 replies; 89+ messages in thread
From: Junio C Hamano @ 2019-11-07  6:18 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

Doan Tran Cong Danh <congdanhqx@gmail.com> writes:

> From commit 79444c9294, ("utf8: handle systems that don't write BOM for
> UTF-16", 2019-02-12), we're supporting those systems with iconv that
> omits BOM with:
>
>     make ICONV_OMITS_BOM=Yes
>
> However, typing the flag all the time is cumbersome and error-prone.

Drop the last sentence.  Not having to give the flags is an already
solved problem and the solution is to use config.mak.

> Add a check into configure script to detect this flag automatically.

This is good.  "Makefile supports this knob, configure script was
not aware of it.  Teach configure to do so" flows the thought
perfectly well (and "typing all the time" does not have to come into
picture).

Thanks.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch
  2019-11-07  2:56   ` [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
@ 2019-11-07  6:32     ` Jeff King
  2019-11-07  7:48       ` Danh Doan
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2019-11-07  6:32 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

On Thu, Nov 07, 2019 at 09:56:19AM +0700, Doan Tran Cong Danh wrote:

> The message file will be used as commit message for the
> git-{am,rebase} --continue.
>
> [...]
>  	strbuf_addf(&buf, "%s/message", get_dir(opts));
>  	if (!file_exists(buf.buf)) {
> -		const char *commit_buffer = get_commit_buffer(commit, NULL);
> +		const char *encoding = get_commit_output_encoding();
> +		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);

That makes sense, though it's hard to understand the flow of this data
through multiple sequencer invocations. I _think_ this would be fixing a
case like this:

-- >8 --
git init repo
cd repo

# some commits to build off of
echo base >file
git add file
git commit -m base

echo side >file
git add file
git commit -m side

# now make a commit in iso8859-1
git checkout -b side HEAD^
echo iso8859-1 >file
git add file
iconv -f utf8 -t iso8859-1 <<-\EOF |
súbject

bödy
EOF
git -c i18n.commitEncoding=iso8859-1 commit -F -

# and rebase it with the merge strategy, which will fail;
# now .git/rebase-merge/message has iso8859-1 in it
git rebase -m master

# and if we resolve and commit, presumably we'd get a broken commit,
# with iso8859-1 and no encoding header
echo resolved >file
git add file
GIT_EDITOR=: git rebase --continue
-- 8< --

But somehow it all seems to work. The resulting commit has real utf8 in
it. I'm not sure if we pull it from the original commit via "commit -c",
or if it's in one of the other files. But it's not clear to me how
this "message" file is being used.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding
  2019-11-07  6:02     ` Jeff King
@ 2019-11-07  6:48       ` Danh Doan
  2019-11-07  8:02         ` Jeff King
  0 siblings, 1 reply; 89+ messages in thread
From: Danh Doan @ 2019-11-07  6:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 2019-11-07 01:02:33 -0500, Jeff King wrote:
> On Thu, Nov 07, 2019 at 09:56:14AM +0700, Doan Tran Cong Danh wrote:
> 
> > +test_commit_autosquash_multi_encoding () {
> > +	flag=$1
> > +	old=$2
> > +	new=$3
> > +	msg=$4
> > +	test_expect_failure "commit --$flag into $old from $new" '
> > +		git checkout -b '$flag-$old-$new' C0 &&
> 
> These single quotes are funny; they close the test-snippet string, so
> these variables are outside of any quoting (and thus subject to
> whitespace splitting).

Yes, those quotes are funny, and I'm also aware that
they will be subjected to whitespace spliting.
But those quotes were intentional, they're there in order to have
better log with:

    ./t3900-i18n-commit.sh -v

With those funny quotes, we will see this (verbose) log:

    expecting success of 3900.38 'commit --fixup into ISO-2022-JP from UTF-8':
                git checkout -b fixup-ISO-2022-JP-UTF-8 C0 &&
                git config i18n.commitencoding ISO-2022-JP &&
                echo ISO-2022-JP >>F &&
                git commit -a -F "$TEST_DIRECTORY/t3900/ISO-2022-JP.txt" &&
                test_tick &&
                echo intermediate stuff >>G &&
                git add G &&
                git commit -a -m "intermediate commit" &&
                test_tick &&
                git config i18n.commitencoding UTF-8 &&
                echo UTF-8-fixup >>F &&
                git commit -a --fixup HEAD^ &&
                git rebase --autosquash -i HEAD^^^ &&
                git rev-list HEAD >actual &&
                test_line_count = 3 actual &&
                iconv -f ISO-2022-JP -t utf-8 "$TEST_DIRECTORY/t3900/ISO-2022-JP.txt" >expect &&
                git cat-file commit HEAD^ >raw &&
                (sed "1,/^$/d" raw | iconv -f UTF-8 -t utf-8) >actual &&
                test_cmp expect actual

I think it's easier to manual run the test step with this log.

-- 
Danh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch
  2019-11-07  6:32     ` Jeff King
@ 2019-11-07  7:48       ` Danh Doan
  2019-11-07  8:03         ` Jeff King
  0 siblings, 1 reply; 89+ messages in thread
From: Danh Doan @ 2019-11-07  7:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 2019-11-07 01:32:23 -0500, Jeff King wrote:
> On Thu, Nov 07, 2019 at 09:56:19AM +0700, Doan Tran Cong Danh wrote:
> 
> > The message file will be used as commit message for the
> > git-{am,rebase} --continue.
> >
> > [...]
> >  	strbuf_addf(&buf, "%s/message", get_dir(opts));
> >  	if (!file_exists(buf.buf)) {
> > -		const char *commit_buffer = get_commit_buffer(commit, NULL);
> > +		const char *encoding = get_commit_output_encoding();
> > +		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);
> 
> That makes sense, though it's hard to understand the flow of this data
> through multiple sequencer invocations. I _think_ this would be fixing a
> case like this:
> 
> -- >8 --
> git init repo
> cd repo
> 
> # some commits to build off of
> echo base >file
> git add file
> git commit -m base
> 
> echo side >file
> git add file
> git commit -m side
> 
> # now make a commit in iso8859-1
> git checkout -b side HEAD^
> echo iso8859-1 >file
> git add file
> iconv -f utf8 -t iso8859-1 <<-\EOF |
> súbject
> 
> bödy
> EOF
> git -c i18n.commitEncoding=iso8859-1 commit -F -
> 
> # and rebase it with the merge strategy, which will fail;
> # now .git/rebase-merge/message has iso8859-1 in it
> git rebase -m master
> 
> # and if we resolve and commit, presumably we'd get a broken commit,
> # with iso8859-1 and no encoding header
> echo resolved >file
> git add file
> GIT_EDITOR=: git rebase --continue
> -- 8< --
> 
> But somehow it all seems to work. The resulting commit has real utf8 in
> it. I'm not sure if we pull it from the original commit via "commit -c",

Yes, somehow it worked. But, without this patch, git also warns:

    % GIT_EDITOR=: git rebase --continue
    Warning: commit message did not conform to UTF-8.
    You may want to amend it after fixing the message, or set the config
    variable i18n.commitencoding to the encoding your project uses.

Checking with strace (on glibc, musl strace can't trace execve):

> [pid 12848] execve("/home/danh/workspace/git/git", ["/home/danh/workspace/git/git", "commit", "-n", "-F", ".git/rebase-merge/message", "-e", "--allow-empty"], 0x558fb02e8240 /* 51 vars */) = 0

Turn out, it's because of: commit.c::verify_utf8

    /*
     * This verifies that the buffer is in proper utf8 format.
     *
     * If it isn't, it assumes any non-utf8 characters are Latin1,
     * and does the conversion.
     */
    static int verify_utf8(struct strbuf *buf)

Hence, your test is just pure luck (because it's in latin1).

-- 
Danh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding
  2019-11-07  6:48       ` Danh Doan
@ 2019-11-07  8:02         ` Jeff King
  2019-11-07 10:51           ` Danh Doan
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2019-11-07  8:02 UTC (permalink / raw)
  To: Danh Doan; +Cc: git

On Thu, Nov 07, 2019 at 01:48:54PM +0700, Danh Doan wrote:

> On 2019-11-07 01:02:33 -0500, Jeff King wrote:
> > On Thu, Nov 07, 2019 at 09:56:14AM +0700, Doan Tran Cong Danh wrote:
> > 
> > > +test_commit_autosquash_multi_encoding () {
> > > +	flag=$1
> > > +	old=$2
> > > +	new=$3
> > > +	msg=$4
> > > +	test_expect_failure "commit --$flag into $old from $new" '
> > > +		git checkout -b '$flag-$old-$new' C0 &&
> > 
> > These single quotes are funny; they close the test-snippet string, so
> > these variables are outside of any quoting (and thus subject to
> > whitespace splitting).
> 
> Yes, those quotes are funny, and I'm also aware that
> they will be subjected to whitespace spliting.
> But those quotes were intentional, they're there in order to have
> better log with:
> 
>     ./t3900-i18n-commit.sh -v
> 
> With those funny quotes, we will see this (verbose) log:
> 
>     expecting success of 3900.38 'commit --fixup into ISO-2022-JP from UTF-8':
>                 git checkout -b fixup-ISO-2022-JP-UTF-8 C0 &&

Yes, it's true you get the expanded version here, but...

>                 git config i18n.commitencoding ISO-2022-JP &&
>                 echo ISO-2022-JP >>F &&
>                 git commit -a -F "$TEST_DIRECTORY/t3900/ISO-2022-JP.txt" &&

...you still can't just run this manually because of other lines like
this one.

It's also weirdly unlike all of the other tests, which creates confusion
for people reading the code. IMHO the tradeoff isn't worth it.

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch
  2019-11-07  7:48       ` Danh Doan
@ 2019-11-07  8:03         ` Jeff King
  2019-11-07 16:32           ` Danh Doan
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff King @ 2019-11-07  8:03 UTC (permalink / raw)
  To: Danh Doan; +Cc: git

On Thu, Nov 07, 2019 at 02:48:58PM +0700, Danh Doan wrote:

> > # and if we resolve and commit, presumably we'd get a broken commit,
> > # with iso8859-1 and no encoding header
> > echo resolved >file
> > git add file
> > GIT_EDITOR=: git rebase --continue
> > -- 8< --
> > 
> > But somehow it all seems to work. The resulting commit has real utf8 in
> > it. I'm not sure if we pull it from the original commit via "commit -c",
> 
> Yes, somehow it worked. But, without this patch, git also warns:
> 
>     % GIT_EDITOR=: git rebase --continue
>     Warning: commit message did not conform to UTF-8.
>     You may want to amend it after fixing the message, or set the config
>     variable i18n.commitencoding to the encoding your project uses.
> 
> Checking with strace (on glibc, musl strace can't trace execve):
> 
> > [pid 12848] execve("/home/danh/workspace/git/git", ["/home/danh/workspace/git/git", "commit", "-n", "-F", ".git/rebase-merge/message", "-e", "--allow-empty"], 0x558fb02e8240 /* 51 vars */) = 0
> 
> Turn out, it's because of: commit.c::verify_utf8
> 
>     /*
>      * This verifies that the buffer is in proper utf8 format.
>      *
>      * If it isn't, it assumes any non-utf8 characters are Latin1,
>      * and does the conversion.
>      */
>     static int verify_utf8(struct strbuf *buf)
> 
> Hence, your test is just pure luck (because it's in latin1).

Ah, thanks for resolving that mystery. Is it worth turning the scenario
above into a test?

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding
  2019-11-07  8:02         ` Jeff King
@ 2019-11-07 10:51           ` Danh Doan
  2019-11-11  8:22             ` Jeff King
  0 siblings, 1 reply; 89+ messages in thread
From: Danh Doan @ 2019-11-07 10:51 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 2019-11-07 03:02:18 -0500, Jeff King wrote:
> >                 git config i18n.commitencoding ISO-2022-JP &&
> >                 echo ISO-2022-JP >>F &&
> >                 git commit -a -F "$TEST_DIRECTORY/t3900/ISO-2022-JP.txt" &&
> 
> ...you still can't just run this manually because of other lines like
> this one.

Except we can with a little effort:

    export TEST_DIRECTORY=..

> It's also weirdly unlike all of the other tests, which creates confusion
> for people reading the code. IMHO the tradeoff isn't worth it.

Hm, I think it's the test_commit_autosquash_flag is the one that is weird
in this file. Most of other sets of tests (line 89-176) use the same quote.

test_commit_autosquash_flag is the inconsistent one,
for the most part, it doesn't employ the funny quote,
but it uses the funny one in the `git cat-file` line.

-- 
Danh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch
  2019-11-07  8:03         ` Jeff King
@ 2019-11-07 16:32           ` Danh Doan
  0 siblings, 0 replies; 89+ messages in thread
From: Danh Doan @ 2019-11-07 16:32 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 2019-11-07 03:03:07 -0500, Jeff King wrote:
> On Thu, Nov 07, 2019 at 02:48:58PM +0700, Danh Doan wrote:
> 
> > > # and if we resolve and commit, presumably we'd get a broken commit,
> > > # with iso8859-1 and no encoding header
> > > echo resolved >file
> > > git add file
> > > GIT_EDITOR=: git rebase --continue
> > > -- 8< --
> > > 
> > > But somehow it all seems to work. The resulting commit has real utf8 in
> > > it. I'm not sure if we pull it from the original commit via "commit -c",
> > 
> > Yes, somehow it worked. But, without this patch, git also warns:
> > 
> >     % GIT_EDITOR=: git rebase --continue
> >     Warning: commit message did not conform to UTF-8.
> >     You may want to amend it after fixing the message, or set the config
> >     variable i18n.commitencoding to the encoding your project uses.
> > 
> > Checking with strace (on glibc, musl strace can't trace execve):
> > 
> > > [pid 12848] execve("/home/danh/workspace/git/git", ["/home/danh/workspace/git/git", "commit", "-n", "-F", ".git/rebase-merge/message", "-e", "--allow-empty"], 0x558fb02e8240 /* 51 vars */) = 0
> > 
> > Turn out, it's because of: commit.c::verify_utf8
> > 
> >     /*
> >      * This verifies that the buffer is in proper utf8 format.
> >      *
> >      * If it isn't, it assumes any non-utf8 characters are Latin1,
> >      * and does the conversion.
> >      */
> >     static int verify_utf8(struct strbuf *buf)
> > 
> > Hence, your test is just pure luck (because it's in latin1).
> 
> Ah, thanks for resolving that mystery. Is it worth turning the scenario
> above into a test?

Yes, it's worth to have a test.
In fact, I found another breakage (rebase with encoding) while writing
this test. I'll delay the re-roll a bit to include that breakage.

-- 
Danh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 0/9] Improve odd encoding integration
  2019-10-31  9:26 [PATCH 0/3] Linux with musl libc improvement Doan Tran Cong Danh
                   ` (5 preceding siblings ...)
  2019-11-07  2:56 ` [PATCH v4 0/8] Correct internal working and output encoding Doan Tran Cong Danh
@ 2019-11-08  9:43 ` Doan Tran Cong Danh
  2019-11-08  9:43   ` [PATCH v5 1/9] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
                     ` (10 more replies)
  2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
  7 siblings, 11 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-08  9:43 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

The series is shifting from fixing test that failed on musl based Linux to
correct the internal working encoding and output encoding of git-am
git-cherry-pick git-rebase and git-revert.

Change from v4:
- Update commit message per review
- Add test for last 2 patches
- Notice a breakage with git rebase --rebase-merges (see patch 7)


Doan Tran Cong Danh (9):
  t0028: eliminate non-standard usage of printf
  configure.ac: define ICONV_OMITS_BOM if necessary
  t3900: demonstrate git-rebase problem with multi encoding
  sequencer: reencode to utf-8 before arrange rebase's todo list
  sequencer: reencode revert/cherry-pick's todo list
  sequencer: reencode squashing commit's message
  sequencer: reencode old merge-commit message
  sequencer: reencode commit message for am/rebase --show-current-patch
  sequencer: fallback to sane label in making rebase todo list

 configure.ac                     |  49 ++++++++++++++++++
 sequencer.c                      |  32 ++++++++----
 t/t0028-working-tree-encoding.sh |   4 +-
 t/t3433-rebase-i18n.sh           |  84 +++++++++++++++++++++++++++++++
 t/t3433/ISO8859-1.txt            | Bin 0 -> 15 bytes
 t/t3433/eucJP.txt                | Bin 0 -> 68 bytes
 t/t3900-i18n-commit.sh           |  37 ++++++++++++++
 7 files changed, 195 insertions(+), 11 deletions(-)
 create mode 100755 t/t3433-rebase-i18n.sh
 create mode 100644 t/t3433/ISO8859-1.txt
 create mode 100644 t/t3433/eucJP.txt

Range-diff against v4:
 1:  daa0c27d28 =  1:  b3d6c4e720 t0028: eliminate non-standard usage of printf
 2:  c50964f413 !  2:  fe63a6bc44 configure.ac: define ICONV_OMITS_BOM if necessary
    @@ Commit message
     
             make ICONV_OMITS_BOM=Yes
     
    -    However, typing the flag all the time is cumbersome and error-prone.
    +    However, configure script wasn't taught to detect those systems.
     
    -    Add a check into configure script to detect this flag automatically.
    +    Teach configure to do so.
     
         Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
     
 3:  47888f236c !  3:  30f15075c4 t3900: demonstrate git-rebase problem with multi encoding
    @@ t/t3900-i18n-commit.sh: test_commit_autosquash_flags eucJP fixup
     +	new=$3
     +	msg=$4
     +	test_expect_failure "commit --$flag into $old from $new" '
    -+		git checkout -b '$flag-$old-$new' C0 &&
    -+		git config i18n.commitencoding '$old' &&
    -+		echo '$old' >>F &&
    -+		git commit -a -F "$TEST_DIRECTORY/t3900/'$msg'" &&
    ++		git checkout -b $flag-$old-$new C0 &&
    ++		git config i18n.commitencoding $old &&
    ++		echo $old >>F &&
    ++		git commit -a -F "$TEST_DIRECTORY"/t3900/$msg &&
     +		test_tick &&
     +		echo intermediate stuff >>G &&
     +		git add G &&
     +		git commit -a -m "intermediate commit" &&
     +		test_tick &&
    -+		git config i18n.commitencoding '$new' &&
    -+		echo '$new-$flag' >>F &&
    -+		git commit -a --'$flag' HEAD^ &&
    ++		git config i18n.commitencoding $new &&
    ++		echo $new-$flag >>F &&
    ++		git commit -a --$flag HEAD^ &&
     +		git rebase --autosquash -i HEAD^^^ &&
     +		git rev-list HEAD >actual &&
     +		test_line_count = 3 actual
 4:  42115f1e33 !  4:  17165b81e5 sequencer: reencode to utf-8 before arrange rebase's todo list
    @@ Commit message
         Thus, t3900::test_commit_autosquash_flags is failing on musl libc.
     
         Reencode to utf-8 before arranging rebase's todo list.
    -    By doing this, we also remove a breakage introduced in the previous
    -    commit.
    +
    +    By doing this, we also remove a breakage noticed by a test added in the
    +    previous commit.
     
         Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
     
    @@ t/t3900-i18n-commit.sh: test_commit_autosquash_multi_encoding () {
      	msg=$4
     -	test_expect_failure "commit --$flag into $old from $new" '
     +	test_expect_success "commit --$flag into $old from $new" '
    - 		git checkout -b '$flag-$old-$new' C0 &&
    - 		git config i18n.commitencoding '$old' &&
    - 		echo '$old' >>F &&
    + 		git checkout -b $flag-$old-$new C0 &&
    + 		git config i18n.commitencoding $old &&
    + 		echo $old >>F &&
 5:  5a871d7226 =  5:  40fa759492 sequencer: reencode revert/cherry-pick's todo list
 6:  1c6194a598 !  6:  ed6cfab5d2 sequencer: reencode squashing commit's message
    @@ sequencer.c: static int commit_staged_changes(struct repository *r,
     
      ## t/t3900-i18n-commit.sh ##
     @@ t/t3900-i18n-commit.sh: test_commit_autosquash_multi_encoding () {
    - 	old=$2
    - 	new=$3
    - 	msg=$4
    -+	squash_msg=
    -+	if test $flag = squash; then
    -+		squash_msg='
    -+		subject="squash! $(head -1 expect)" &&
    -+		printf "\n%s\n" "$subject" >> expect &&
    -+		'
    -+	fi
    - 	test_expect_success "commit --$flag into $old from $new" '
    - 		git checkout -b '$flag-$old-$new' C0 &&
    - 		git config i18n.commitencoding '$old' &&
    -@@ t/t3900-i18n-commit.sh: test_commit_autosquash_multi_encoding () {
    - 		git commit -a --'$flag' HEAD^ &&
    + 		git commit -a --$flag HEAD^ &&
      		git rebase --autosquash -i HEAD^^^ &&
      		git rev-list HEAD >actual &&
     -		test_line_count = 3 actual
     +		test_line_count = 3 actual &&
    -+		iconv -f '$old' -t utf-8 "$TEST_DIRECTORY/t3900/'$msg'" >expect &&
    -+		'"$squash_msg"'
    ++		iconv -f $old -t UTF-8 "$TEST_DIRECTORY"/t3900/$msg >expect &&
    ++		if test $flag = squash; then
    ++			subject="$(head -1 expect)" &&
    ++			printf "\nsquash! %s\n" "$subject" >>expect
    ++		fi &&
     +		git cat-file commit HEAD^ >raw &&
    -+		(sed "1,/^$/d" raw | iconv -f '$new' -t utf-8) >actual &&
    ++		(sed "1,/^$/d" raw | iconv -f $new -t utf-8) >actual &&
     +		test_cmp expect actual
      	'
      }
 7:  95df3cdadf <  -:  ---------- sequencer: reencode old merge-commit message
 8:  0606b2408d <  -:  ---------- sequencer: reencode commit message for am/rebase --show-current-patch
 -:  ---------- >  7:  def9adf97e sequencer: reencode old merge-commit message
 -:  ---------- >  8:  2e95ca57d2 sequencer: reencode commit message for am/rebase --show-current-patch
 -:  ---------- >  9:  860dee65f4 sequencer: fallback to sane label in making rebase todo list
-- 
2.24.0.8.g2e95ca57d2.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 1/9] t0028: eliminate non-standard usage of printf
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
@ 2019-11-08  9:43   ` Doan Tran Cong Danh
  2019-11-08  9:43   ` [PATCH v5 2/9] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-08  9:43 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

man 1p printf:
   In addition to the escape sequences shown in the Base Definitions
   volume of POSIX.1‐2008, Chapter 5, File Format Notation ('\\',
   '\a', '\b', '\f', '\n', '\r', '\t', '\v'), "\ddd", where ddd is a
   one, two, or three-digit octal number, shall be written as a byte
   with the numeric value specified by the octal number.

printf '\xfe\xff' is an extension of some shell.
Dash, a popular yet simple shell, do not implement this extension.

This wasn't caught by most people running the tests, even though
common shells like dash don't handle hex escapes, because their
systems don't trigger the NO_UTF16_BOM prereq. But systems with musl
libc do; when combined with dash, the test fails.

Correct it.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 t/t0028-working-tree-encoding.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7aa0945d8d..bfc4fb9af5 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -17,7 +17,7 @@ test_lazy_prereq NO_UTF32_BOM '
 write_utf16 () {
 	if test_have_prereq NO_UTF16_BOM
 	then
-		printf '\xfe\xff'
+		printf '\376\377'
 	fi &&
 	iconv -f UTF-8 -t UTF-16
 }
@@ -25,7 +25,7 @@ write_utf16 () {
 write_utf32 () {
 	if test_have_prereq NO_UTF32_BOM
 	then
-		printf '\x00\x00\xfe\xff'
+		printf '\0\0\376\377'
 	fi &&
 	iconv -f UTF-8 -t UTF-32
 }
-- 
2.24.0.8.g2e95ca57d2.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 2/9] configure.ac: define ICONV_OMITS_BOM if necessary
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
  2019-11-08  9:43   ` [PATCH v5 1/9] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
@ 2019-11-08  9:43   ` Doan Tran Cong Danh
  2019-11-08  9:43   ` [PATCH v5 3/9] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-08  9:43 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

From commit 79444c9294, ("utf8: handle systems that don't write BOM for
UTF-16", 2019-02-12), we're supporting those systems with iconv that
omits BOM with:

    make ICONV_OMITS_BOM=Yes

However, configure script wasn't taught to detect those systems.

Teach configure to do so.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 configure.ac | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/configure.ac b/configure.ac
index a43b476402..ecba7e6e51 100644
--- a/configure.ac
+++ b/configure.ac
@@ -844,12 +844,61 @@ AC_MSG_CHECKING([for old iconv()])
 AC_COMPILE_IFELSE([OLDICONVTEST_SRC],
 	[AC_MSG_RESULT([no])],
 	[AC_MSG_RESULT([yes])
+	AC_DEFINE(HAVE_OLD_ICONV, 1)
 	OLD_ICONV=UnfortunatelyYes])
 
 GIT_UNSTASH_FLAGS($ICONVDIR)
 
 GIT_CONF_SUBST([OLD_ICONV])
 
+#
+# Define ICONV_OMITS_BOM if you are on a system which
+# iconv omits bom for utf-{16,32}
+if test -z "$NO_ICONV"; then
+AC_CACHE_CHECK([whether iconv omits bom for utf-16 and utf-32],
+ [ac_cv_iconv_omits_bom],
+[
+old_LIBS="$LIBS"
+if test -n "$NEEDS_LIBICONV"; then
+	LIBS="$LIBS -liconv"
+fi
+
+AC_RUN_IFELSE(
+	[AC_LANG_PROGRAM([AC_INCLUDES_DEFAULT
+	#include <iconv.h>
+	#ifdef HAVE_OLD_ICONV
+	typedef const char *iconv_ibp;
+	#else
+	typedef char *iconv_ibp;
+	#endif
+	],
+	[[
+	int v;
+	iconv_t conv;
+	char in[] = "a"; iconv_ibp pin = in;
+	char out[20] = ""; char *pout = out;
+	size_t isz = sizeof in;
+	size_t osz = sizeof out;
+
+	conv = iconv_open("UTF-16", "UTF-8");
+	iconv(conv, &pin, &isz, &pout, &osz);
+	iconv_close(conv);
+	v = (unsigned char)(out[0]) + (unsigned char)(out[1]);
+	return v != 0xfe + 0xff;
+	]])],
+	[ac_cv_iconv_omits_bom=no],
+	[ac_cv_iconv_omits_bom=yes])
+
+LIBS="$old_LIBS"
+])
+if test "x$ac_cv_iconv_omits_bom" = xyes; then
+	ICONV_OMITS_BOM=Yes
+else
+	ICONV_OMITS_BOM=
+fi
+GIT_CONF_SUBST([ICONV_OMITS_BOM])
+fi
+
 ## Checks for typedefs, structures, and compiler characteristics.
 AC_MSG_NOTICE([CHECKS for typedefs, structures, and compiler characteristics])
 #
-- 
2.24.0.8.g2e95ca57d2.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 3/9] t3900: demonstrate git-rebase problem with multi encoding
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
  2019-11-08  9:43   ` [PATCH v5 1/9] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
  2019-11-08  9:43   ` [PATCH v5 2/9] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
@ 2019-11-08  9:43   ` Doan Tran Cong Danh
  2019-11-08  9:43   ` [PATCH v5 4/9] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-08  9:43 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh, Jeff King

We're using fixup!/squash! <subject> to mark if current commit will be
used to be fixed up or squashed to a previous commit.

However, if we're changing i18n.commitencoding after making the
original commit but before making the fixing up, we couldn't find the
original commit to do the fixup/squash.

Add a test to demonstrate that problem.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 t/t3900-i18n-commit.sh | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index b92ff95977..dd56384b93 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -204,4 +204,33 @@ test_commit_autosquash_flags eucJP fixup
 
 test_commit_autosquash_flags ISO-2022-JP squash
 
+test_commit_autosquash_multi_encoding () {
+	flag=$1
+	old=$2
+	new=$3
+	msg=$4
+	test_expect_failure "commit --$flag into $old from $new" '
+		git checkout -b $flag-$old-$new C0 &&
+		git config i18n.commitencoding $old &&
+		echo $old >>F &&
+		git commit -a -F "$TEST_DIRECTORY"/t3900/$msg &&
+		test_tick &&
+		echo intermediate stuff >>G &&
+		git add G &&
+		git commit -a -m "intermediate commit" &&
+		test_tick &&
+		git config i18n.commitencoding $new &&
+		echo $new-$flag >>F &&
+		git commit -a --$flag HEAD^ &&
+		git rebase --autosquash -i HEAD^^^ &&
+		git rev-list HEAD >actual &&
+		test_line_count = 3 actual
+	'
+}
+
+test_commit_autosquash_multi_encoding fixup UTF-8 ISO-8859-1 1-UTF-8.txt
+test_commit_autosquash_multi_encoding squash ISO-8859-1 UTF-8 ISO8859-1.txt
+test_commit_autosquash_multi_encoding squash eucJP ISO-2022-JP eucJP.txt
+test_commit_autosquash_multi_encoding fixup ISO-2022-JP UTF-8 ISO-2022-JP.txt
+
 test_done
-- 
2.24.0.8.g2e95ca57d2.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 4/9] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
                     ` (2 preceding siblings ...)
  2019-11-08  9:43   ` [PATCH v5 3/9] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
@ 2019-11-08  9:43   ` Doan Tran Cong Danh
  2019-11-08  9:43   ` [PATCH v5 5/9] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-08  9:43 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

On musl libc, ISO-2022-JP encoder is too eager to switch back to
1 byte encoding, musl's iconv always switch back after every combining
character. Comparing glibc and musl's output for this command
$ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
	iconv -f utf-8 -t ISO-2022-JP | xxd

glibc:
00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
00000010: 0a                                       .

musl:
00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.

Although musl iconv's output isn't optimal, it's still correct.

From commit 7d509878b8, ("pretty.c: format string with truncate respects
logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
first, then format it and convert the message to the actual output
encoding on git commit --squash.

Thus, t3900::test_commit_autosquash_flags is failing on musl libc.

Reencode to utf-8 before arranging rebase's todo list.

By doing this, we also remove a breakage noticed by a test added in the
previous commit.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c            | 2 +-
 t/t3900-i18n-commit.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index 9d5964fd81..69430fe23f 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -5169,7 +5169,7 @@ int todo_list_rearrange_squash(struct todo_list *todo_list)
 		*commit_todo_item_at(&commit_todo, item->commit) = item;
 
 		parse_commit(item->commit);
-		commit_buffer = get_commit_buffer(item->commit, NULL);
+		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
 		find_commit_subject(commit_buffer, &subject);
 		format_subject(&buf, subject, " ");
 		subject = subjects[i] = strbuf_detach(&buf, &subject_len);
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index dd56384b93..a518281b04 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -209,7 +209,7 @@ test_commit_autosquash_multi_encoding () {
 	old=$2
 	new=$3
 	msg=$4
-	test_expect_failure "commit --$flag into $old from $new" '
+	test_expect_success "commit --$flag into $old from $new" '
 		git checkout -b $flag-$old-$new C0 &&
 		git config i18n.commitencoding $old &&
 		echo $old >>F &&
-- 
2.24.0.8.g2e95ca57d2.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 5/9] sequencer: reencode revert/cherry-pick's todo list
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
                     ` (3 preceding siblings ...)
  2019-11-08  9:43   ` [PATCH v5 4/9] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
@ 2019-11-08  9:43   ` Doan Tran Cong Danh
  2019-11-08  9:43   ` [PATCH v5 6/9] sequencer: reencode squashing commit's message Doan Tran Cong Danh
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-08  9:43 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

Keep revert/cherry-pick's todo list in line with rebase todo list.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/sequencer.c b/sequencer.c
index 69430fe23f..a19954f2bf 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -2564,14 +2564,17 @@ static int walk_revs_populate_todo(struct todo_list *todo_list,
 	enum todo_command command = opts->action == REPLAY_PICK ?
 		TODO_PICK : TODO_REVERT;
 	const char *command_string = todo_command_info[command].str;
+	const char *encoding;
 	struct commit *commit;
 
 	if (prepare_revs(opts))
 		return -1;
 
+	encoding = get_log_output_encoding();
+
 	while ((commit = get_revision(opts->revs))) {
 		struct todo_item *item = append_new_todo(todo_list);
-		const char *commit_buffer = get_commit_buffer(commit, NULL);
+		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);
 		const char *subject;
 		int subject_len;
 
-- 
2.24.0.8.g2e95ca57d2.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 6/9] sequencer: reencode squashing commit's message
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
                     ` (4 preceding siblings ...)
  2019-11-08  9:43   ` [PATCH v5 5/9] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
@ 2019-11-08  9:43   ` Doan Tran Cong Danh
  2019-11-08  9:43   ` [PATCH v5 7/9] sequencer: reencode old merge-commit message Doan Tran Cong Danh
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-08  9:43 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

On fixup/squash-ing rebase, git will create new commit in
i18n.commitencoding, reencode the commit message to that said encode.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c            |  8 +++++---
 t/t3900-i18n-commit.sh | 10 +++++++++-
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index a19954f2bf..833a928929 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -1576,6 +1576,7 @@ static int update_squash_messages(struct repository *r,
 	struct strbuf buf = STRBUF_INIT;
 	int res;
 	const char *message, *body;
+	const char *encoding = get_commit_output_encoding();
 
 	if (opts->current_fixup_count > 0) {
 		struct strbuf header = STRBUF_INIT;
@@ -1602,7 +1603,7 @@ static int update_squash_messages(struct repository *r,
 			return error(_("need a HEAD to fixup"));
 		if (!(head_commit = lookup_commit_reference(r, &head)))
 			return error(_("could not read HEAD"));
-		if (!(head_message = get_commit_buffer(head_commit, NULL)))
+		if (!(head_message = logmsg_reencode(head_commit, NULL, encoding)))
 			return error(_("could not read HEAD's commit message"));
 
 		find_commit_subject(head_message, &body);
@@ -1623,7 +1624,7 @@ static int update_squash_messages(struct repository *r,
 		unuse_commit_buffer(head_commit, head_message);
 	}
 
-	if (!(message = get_commit_buffer(commit, NULL)))
+	if (!(message = logmsg_reencode(commit, NULL, encoding)))
 		return error(_("could not read commit message of %s"),
 			     oid_to_hex(&commit->object.oid));
 	find_commit_subject(message, &body);
@@ -4154,9 +4155,10 @@ static int commit_staged_changes(struct repository *r,
 				 */
 				struct commit *commit;
 				const char *path = rebase_path_squash_msg();
+				const char *encoding = get_commit_output_encoding();
 
 				if (parse_head(r, &commit) ||
-				    !(p = get_commit_buffer(commit, NULL)) ||
+				    !(p = logmsg_reencode(commit, NULL, encoding)) ||
 				    write_message(p, strlen(p), path, 0)) {
 					unuse_commit_buffer(commit, p);
 					return error(_("could not write file: "
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index a518281b04..d277a9f4b7 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -224,7 +224,15 @@ test_commit_autosquash_multi_encoding () {
 		git commit -a --$flag HEAD^ &&
 		git rebase --autosquash -i HEAD^^^ &&
 		git rev-list HEAD >actual &&
-		test_line_count = 3 actual
+		test_line_count = 3 actual &&
+		iconv -f $old -t UTF-8 "$TEST_DIRECTORY"/t3900/$msg >expect &&
+		if test $flag = squash; then
+			subject="$(head -1 expect)" &&
+			printf "\nsquash! %s\n" "$subject" >>expect
+		fi &&
+		git cat-file commit HEAD^ >raw &&
+		(sed "1,/^$/d" raw | iconv -f $new -t utf-8) >actual &&
+		test_cmp expect actual
 	'
 }
 
-- 
2.24.0.8.g2e95ca57d2.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 7/9] sequencer: reencode old merge-commit message
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
                     ` (5 preceding siblings ...)
  2019-11-08  9:43   ` [PATCH v5 6/9] sequencer: reencode squashing commit's message Doan Tran Cong Danh
@ 2019-11-08  9:43   ` Doan Tran Cong Danh
  2019-11-08  9:43   ` [PATCH v5 8/9] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-08  9:43 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

During rebasing, old merge's message (encoded in old encoding)
will be used as message for new merge commit (created by rebase).

In case of the value of i18n.commitencoding has been changed after the
old merge time. We will receive an unusable message for this new merge.

Correct it.

This change also notice a breakage with git-rebase label system.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---

Notes:
	eucJP.txt is copied from t3900

 sequencer.c            |   3 ++-
 t/t3433-rebase-i18n.sh |  57 +++++++++++++++++++++++++++++++++++++++++
 t/t3433/eucJP.txt      | Bin 0 -> 68 bytes
 3 files changed, 59 insertions(+), 1 deletion(-)
 create mode 100755 t/t3433-rebase-i18n.sh
 create mode 100644 t/t3433/eucJP.txt

diff --git a/sequencer.c b/sequencer.c
index 833a928929..d735d09f98 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -3374,7 +3374,8 @@ static int do_merge(struct repository *r,
 	}
 
 	if (commit) {
-		const char *message = get_commit_buffer(commit, NULL);
+		const char *encoding = get_commit_output_encoding();
+		const char *message = logmsg_reencode(commit, NULL, encoding);
 		const char *body;
 		int len;
 
diff --git a/t/t3433-rebase-i18n.sh b/t/t3433-rebase-i18n.sh
new file mode 100755
index 0000000000..bff5a68d25
--- /dev/null
+++ b/t/t3433-rebase-i18n.sh
@@ -0,0 +1,57 @@
+#!/bin/sh
+#
+# Copyright (c) 2019 Doan Tran Cong Danh
+#
+
+test_description='rebase with changing encoding
+
+Initial setup:
+
+1 - 2              master
+ \
+  3 - 4            first
+   \
+    5 - 6          second
+'
+
+. ./test-lib.sh
+
+compare_msg () {
+	iconv -f "$2" -t "$3" "$TEST_DIRECTORY/t3433/$1" >expect &&
+	git cat-file commit HEAD >raw &&
+	sed "1,/^$/d" raw >actual &&
+	test_cmp expect actual
+}
+
+test_expect_success setup '
+	test_commit one &&
+	git branch first &&
+	test_commit two &&
+	git switch first &&
+	test_commit three &&
+	git branch second &&
+	test_commit four &&
+	git switch second &&
+	test_commit five &&
+	test_commit six
+'
+
+test_expect_success 'rebase --rebase-merges update encoding eucJP to UTF-8' '
+	git switch -c merge-eucJP-UTF-8 first &&
+	git config i18n.commitencoding eucJP &&
+	git merge -F "$TEST_DIRECTORY/t3433/eucJP.txt" second &&
+	git config i18n.commitencoding UTF-8 &&
+	git rebase --rebase-merges master &&
+	compare_msg eucJP.txt eucJP UTF-8
+'
+
+test_expect_failure 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP' '
+	git switch -c merge-eucJP-ISO-2022-JP first &&
+	git config i18n.commitencoding eucJP &&
+	git merge -F "$TEST_DIRECTORY/t3433/eucJP.txt" second &&
+	git config i18n.commitencoding ISO-2022-JP &&
+	git rebase --rebase-merges master &&
+	compare_msg eucJP.txt eucJP ISO-2022-JP
+'
+
+test_done
diff --git a/t/t3433/eucJP.txt b/t/t3433/eucJP.txt
new file mode 100644
index 0000000000000000000000000000000000000000..546f2aac01b67e39d19de601f5586097b34a8325
GIT binary patch
literal 68
zcmZ2-e#x69mzLaKa+Ql~$@V43mMmHFddayZYZfkovW_oY%ys|3$+JKuZ<btN@@mOl
SAboGi<s}<{?6*s90HpysEiSzP

literal 0
HcmV?d00001

-- 
2.24.0.8.g2e95ca57d2.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 8/9] sequencer: reencode commit message for am/rebase --show-current-patch
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
                     ` (6 preceding siblings ...)
  2019-11-08  9:43   ` [PATCH v5 7/9] sequencer: reencode old merge-commit message Doan Tran Cong Danh
@ 2019-11-08  9:43   ` Doan Tran Cong Danh
  2019-11-08  9:43   ` [PATCH v5 9/9] sequencer: fallback to sane label in making rebase todo list Doan Tran Cong Danh
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-08  9:43 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

The message file will be used as commit message for the
git-{am,rebase} --continue.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---

Notes:
	ISO8859-1.txt is copied from t3900

 sequencer.c            |   3 ++-
 t/t3433-rebase-i18n.sh |  27 +++++++++++++++++++++++++++
 t/t3433/ISO8859-1.txt  | Bin 0 -> 15 bytes
 3 files changed, 29 insertions(+), 1 deletion(-)
 create mode 100644 t/t3433/ISO8859-1.txt

diff --git a/sequencer.c b/sequencer.c
index d735d09f98..4d12ad3cc6 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -2972,7 +2972,8 @@ static int make_patch(struct repository *r,
 
 	strbuf_addf(&buf, "%s/message", get_dir(opts));
 	if (!file_exists(buf.buf)) {
-		const char *commit_buffer = get_commit_buffer(commit, NULL);
+		const char *encoding = get_commit_output_encoding();
+		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);
 		find_commit_subject(commit_buffer, &subject);
 		res |= write_message(subject, strlen(subject), buf.buf, 1);
 		unuse_commit_buffer(commit, commit_buffer);
diff --git a/t/t3433-rebase-i18n.sh b/t/t3433-rebase-i18n.sh
index bff5a68d25..537d18c330 100755
--- a/t/t3433-rebase-i18n.sh
+++ b/t/t3433-rebase-i18n.sh
@@ -54,4 +54,31 @@ test_expect_failure 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP
 	compare_msg eucJP.txt eucJP ISO-2022-JP
 '
 
+test_rebase_continue_update_encode () {
+	old=$1
+	new=$2
+	msgfile=$3
+	test_expect_success "rebase --continue update from $old to $new" '
+		(git rebase --abort || : abort current git-rebase failure) &&
+		git switch -c conflict-$old-$new one &&
+		echo for-conflict >two.t &&
+		git add two.t &&
+		git config i18n.commitencoding $old &&
+		git commit -F "$TEST_DIRECTORY/t3433/$msgfile" &&
+		git config i18n.commitencoding $new &&
+		test_must_fail git rebase -m master &&
+		test -f .git/rebase-merge/message &&
+		git stripspace <.git/rebase-merge/message >two.t &&
+		git add two.t &&
+		git rebase --continue &&
+		compare_msg $msgfile $old $new &&
+		: git-commit assume invalid utf-8 is latin1 &&
+		test_cmp expect two.t
+	'
+}
+
+test_rebase_continue_update_encode ISO-8859-1 UTF-8 ISO8859-1.txt
+test_rebase_continue_update_encode eucJP UTF-8 eucJP.txt
+test_rebase_continue_update_encode eucJP ISO-2022-JP eucJP.txt
+
 test_done
diff --git a/t/t3433/ISO8859-1.txt b/t/t3433/ISO8859-1.txt
new file mode 100644
index 0000000000000000000000000000000000000000..7cbef0ee6f430c611134a06a6dd6c12fbea589d5
GIT binary patch
literal 15
XcmX?d`r`R(TwDi}o~OJ>OXmUrN~8!!

literal 0
HcmV?d00001

-- 
2.24.0.8.g2e95ca57d2.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v5 9/9] sequencer: fallback to sane label in making rebase todo list
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
                     ` (7 preceding siblings ...)
  2019-11-08  9:43   ` [PATCH v5 8/9] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
@ 2019-11-08  9:43   ` Doan Tran Cong Danh
  2019-11-11  1:22   ` [PATCH v5 0/9] Improve odd encoding integration Junio C Hamano
  2019-11-11  4:02   ` Junio C Hamano
  10 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-08  9:43 UTC (permalink / raw)
  To: git; +Cc: Doan Tran Cong Danh

In later stage of rebasing, we will make a ref in
refs/rewritten/<label>, this label is extracted from the subject of
current merge commit.

If that subject has forbidden character for refname, git couldn't create
the rewritten ref. One such case is the merge commit subject has
a multibyte character encoded in ISO-2022-JP.

Provide a sane fallback in order to help git-rebase works in such case

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---

I don't know if this is the _right_ way to do.
We may discard this patch and fix the test instead.

 sequencer.c            | 11 +++++++++--
 t/t3433-rebase-i18n.sh |  2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index 4d12ad3cc6..5a00941205 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -4562,6 +4562,7 @@ static int make_script_with_merges(struct pretty_print_context *pp,
 	while ((commit = get_revision(revs))) {
 		struct commit_list *to_merge;
 		const char *p1, *p2;
+		const char *hex_oid;
 		struct object_id *oid;
 		int is_empty;
 
@@ -4609,9 +4610,15 @@ static int make_script_with_merges(struct pretty_print_context *pp,
 			if (isspace(*p1))
 				*(char *)p1 = '-';
 
+		hex_oid = oid_to_hex(&commit->object.oid);
+
+		if (check_refname_format(label.buf, REFNAME_ALLOW_ONELEVEL) < 0) {
+			strbuf_reset(&label);
+			strbuf_addf(&label, "label-%s", hex_oid);
+		}
+
 		strbuf_reset(&buf);
-		strbuf_addf(&buf, "%s -C %s",
-			    cmd_merge, oid_to_hex(&commit->object.oid));
+		strbuf_addf(&buf, "%s -C %s", cmd_merge, hex_oid);
 
 		/* label the tips of merged branches */
 		for (; to_merge; to_merge = to_merge->next) {
diff --git a/t/t3433-rebase-i18n.sh b/t/t3433-rebase-i18n.sh
index 537d18c330..93e5402849 100755
--- a/t/t3433-rebase-i18n.sh
+++ b/t/t3433-rebase-i18n.sh
@@ -45,7 +45,7 @@ test_expect_success 'rebase --rebase-merges update encoding eucJP to UTF-8' '
 	compare_msg eucJP.txt eucJP UTF-8
 '
 
-test_expect_failure 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP' '
+test_expect_success 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP' '
 	git switch -c merge-eucJP-ISO-2022-JP first &&
 	git config i18n.commitencoding eucJP &&
 	git merge -F "$TEST_DIRECTORY/t3433/eucJP.txt" second &&
-- 
2.24.0.8.g2e95ca57d2.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v5 0/9] Improve odd encoding integration
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
                     ` (8 preceding siblings ...)
  2019-11-08  9:43   ` [PATCH v5 9/9] sequencer: fallback to sane label in making rebase todo list Doan Tran Cong Danh
@ 2019-11-11  1:22   ` Junio C Hamano
  2019-11-11  4:02   ` Junio C Hamano
  10 siblings, 0 replies; 89+ messages in thread
From: Junio C Hamano @ 2019-11-11  1:22 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

Doan Tran Cong Danh <congdanhqx@gmail.com> writes:

> The series is shifting from fixing test that failed on musl based Linux to
> correct the internal working encoding and output encoding of git-am
> git-cherry-pick git-rebase and git-revert.
>
> Change from v4:
> - Update commit message per review
> - Add test for last 2 patches
> - Notice a breakage with git rebase --rebase-merges (see patch 7)
>

Re: Patch 3, it indeed is a bit sad that out test framework does not
let us "show" what exactly is being executed in the debug/verbose
mode when we use parameterized test helpers. We designed the test
helpers like test_expect_success and test_eval_ so that the eval'ed
block can access the outer variables safely, i.e. "$flag-$old-$new"
can and should be written inside dq for safety, but the variable
names would be all we see in the verbose log because of that.  But
that's not the end of the world ;-)

I do not have a strong opinion on Patch 9 (I agree we need to fall
back on something sensible and without any change, the system is
quite broken---I do not have a strong opinion on what the fallback
should be); I'd love to hear what sequencer folks think.

Overall this round looks quite good.

Thanks, will queue.




>
> Doan Tran Cong Danh (9):
>   t0028: eliminate non-standard usage of printf
>   configure.ac: define ICONV_OMITS_BOM if necessary
>   t3900: demonstrate git-rebase problem with multi encoding
>   sequencer: reencode to utf-8 before arrange rebase's todo list
>   sequencer: reencode revert/cherry-pick's todo list
>   sequencer: reencode squashing commit's message
>   sequencer: reencode old merge-commit message
>   sequencer: reencode commit message for am/rebase --show-current-patch
>   sequencer: fallback to sane label in making rebase todo list
>
>  configure.ac                     |  49 ++++++++++++++++++
>  sequencer.c                      |  32 ++++++++----
>  t/t0028-working-tree-encoding.sh |   4 +-
>  t/t3433-rebase-i18n.sh           |  84 +++++++++++++++++++++++++++++++
>  t/t3433/ISO8859-1.txt            | Bin 0 -> 15 bytes
>  t/t3433/eucJP.txt                | Bin 0 -> 68 bytes
>  t/t3900-i18n-commit.sh           |  37 ++++++++++++++
>  7 files changed, 195 insertions(+), 11 deletions(-)
>  create mode 100755 t/t3433-rebase-i18n.sh
>  create mode 100644 t/t3433/ISO8859-1.txt
>  create mode 100644 t/t3433/eucJP.txt
>
> Range-diff against v4:
>  1:  daa0c27d28 =  1:  b3d6c4e720 t0028: eliminate non-standard usage of printf
>  2:  c50964f413 !  2:  fe63a6bc44 configure.ac: define ICONV_OMITS_BOM if necessary
>     @@ Commit message
>      
>              make ICONV_OMITS_BOM=Yes
>      
>     -    However, typing the flag all the time is cumbersome and error-prone.
>     +    However, configure script wasn't taught to detect those systems.
>      
>     -    Add a check into configure script to detect this flag automatically.
>     +    Teach configure to do so.
>      
>          Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
>      
>  3:  47888f236c !  3:  30f15075c4 t3900: demonstrate git-rebase problem with multi encoding
>     @@ t/t3900-i18n-commit.sh: test_commit_autosquash_flags eucJP fixup
>      +	new=$3
>      +	msg=$4
>      +	test_expect_failure "commit --$flag into $old from $new" '
>     -+		git checkout -b '$flag-$old-$new' C0 &&
>     -+		git config i18n.commitencoding '$old' &&
>     -+		echo '$old' >>F &&
>     -+		git commit -a -F "$TEST_DIRECTORY/t3900/'$msg'" &&
>     ++		git checkout -b $flag-$old-$new C0 &&
>     ++		git config i18n.commitencoding $old &&
>     ++		echo $old >>F &&
>     ++		git commit -a -F "$TEST_DIRECTORY"/t3900/$msg &&
>      +		test_tick &&
>      +		echo intermediate stuff >>G &&
>      +		git add G &&
>      +		git commit -a -m "intermediate commit" &&
>      +		test_tick &&
>     -+		git config i18n.commitencoding '$new' &&
>     -+		echo '$new-$flag' >>F &&
>     -+		git commit -a --'$flag' HEAD^ &&
>     ++		git config i18n.commitencoding $new &&
>     ++		echo $new-$flag >>F &&
>     ++		git commit -a --$flag HEAD^ &&
>      +		git rebase --autosquash -i HEAD^^^ &&
>      +		git rev-list HEAD >actual &&
>      +		test_line_count = 3 actual
>  4:  42115f1e33 !  4:  17165b81e5 sequencer: reencode to utf-8 before arrange rebase's todo list
>     @@ Commit message
>          Thus, t3900::test_commit_autosquash_flags is failing on musl libc.
>      
>          Reencode to utf-8 before arranging rebase's todo list.
>     -    By doing this, we also remove a breakage introduced in the previous
>     -    commit.
>     +
>     +    By doing this, we also remove a breakage noticed by a test added in the
>     +    previous commit.
>      
>          Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
>      
>     @@ t/t3900-i18n-commit.sh: test_commit_autosquash_multi_encoding () {
>       	msg=$4
>      -	test_expect_failure "commit --$flag into $old from $new" '
>      +	test_expect_success "commit --$flag into $old from $new" '
>     - 		git checkout -b '$flag-$old-$new' C0 &&
>     - 		git config i18n.commitencoding '$old' &&
>     - 		echo '$old' >>F &&
>     + 		git checkout -b $flag-$old-$new C0 &&
>     + 		git config i18n.commitencoding $old &&
>     + 		echo $old >>F &&
>  5:  5a871d7226 =  5:  40fa759492 sequencer: reencode revert/cherry-pick's todo list
>  6:  1c6194a598 !  6:  ed6cfab5d2 sequencer: reencode squashing commit's message
>     @@ sequencer.c: static int commit_staged_changes(struct repository *r,
>      
>       ## t/t3900-i18n-commit.sh ##
>      @@ t/t3900-i18n-commit.sh: test_commit_autosquash_multi_encoding () {
>     - 	old=$2
>     - 	new=$3
>     - 	msg=$4
>     -+	squash_msg=
>     -+	if test $flag = squash; then
>     -+		squash_msg='
>     -+		subject="squash! $(head -1 expect)" &&
>     -+		printf "\n%s\n" "$subject" >> expect &&
>     -+		'
>     -+	fi
>     - 	test_expect_success "commit --$flag into $old from $new" '
>     - 		git checkout -b '$flag-$old-$new' C0 &&
>     - 		git config i18n.commitencoding '$old' &&
>     -@@ t/t3900-i18n-commit.sh: test_commit_autosquash_multi_encoding () {
>     - 		git commit -a --'$flag' HEAD^ &&
>     + 		git commit -a --$flag HEAD^ &&
>       		git rebase --autosquash -i HEAD^^^ &&
>       		git rev-list HEAD >actual &&
>      -		test_line_count = 3 actual
>      +		test_line_count = 3 actual &&
>     -+		iconv -f '$old' -t utf-8 "$TEST_DIRECTORY/t3900/'$msg'" >expect &&
>     -+		'"$squash_msg"'
>     ++		iconv -f $old -t UTF-8 "$TEST_DIRECTORY"/t3900/$msg >expect &&
>     ++		if test $flag = squash; then
>     ++			subject="$(head -1 expect)" &&
>     ++			printf "\nsquash! %s\n" "$subject" >>expect
>     ++		fi &&
>      +		git cat-file commit HEAD^ >raw &&
>     -+		(sed "1,/^$/d" raw | iconv -f '$new' -t utf-8) >actual &&
>     ++		(sed "1,/^$/d" raw | iconv -f $new -t utf-8) >actual &&
>      +		test_cmp expect actual
>       	'
>       }
>  7:  95df3cdadf <  -:  ---------- sequencer: reencode old merge-commit message
>  8:  0606b2408d <  -:  ---------- sequencer: reencode commit message for am/rebase --show-current-patch
>  -:  ---------- >  7:  def9adf97e sequencer: reencode old merge-commit message
>  -:  ---------- >  8:  2e95ca57d2 sequencer: reencode commit message for am/rebase --show-current-patch
>  -:  ---------- >  9:  860dee65f4 sequencer: fallback to sane label in making rebase todo list

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v5 0/9] Improve odd encoding integration
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
                     ` (9 preceding siblings ...)
  2019-11-11  1:22   ` [PATCH v5 0/9] Improve odd encoding integration Junio C Hamano
@ 2019-11-11  4:02   ` Junio C Hamano
  2019-11-11  4:43     ` Danh Doan
  2019-11-11  6:14     ` Junio C Hamano
  10 siblings, 2 replies; 89+ messages in thread
From: Junio C Hamano @ 2019-11-11  4:02 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

Doan Tran Cong Danh <congdanhqx@gmail.com> writes:

>  t/t3433-rebase-i18n.sh           |  84 +++++++++++++++++++++++++++++++
>  t/t3433/ISO8859-1.txt            | Bin 0 -> 15 bytes
>  t/t3433/eucJP.txt                | Bin 0 -> 68 bytes
>  t/t3900-i18n-commit.sh           |  37 ++++++++++++++
>  7 files changed, 195 insertions(+), 11 deletions(-)
>  create mode 100755 t/t3433-rebase-i18n.sh

Doesn't "make test" barf with test-lint failure with this series
merged to 'pu'?

In other words, isn't 3433 already taken by another series?  

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v5 0/9] Improve odd encoding integration
  2019-11-11  4:02   ` Junio C Hamano
@ 2019-11-11  4:43     ` Danh Doan
  2019-11-11  6:14     ` Junio C Hamano
  1 sibling, 0 replies; 89+ messages in thread
From: Danh Doan @ 2019-11-11  4:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On 2019-11-11 13:02:37 +0900, Junio C Hamano wrote:
> Doan Tran Cong Danh <congdanhqx@gmail.com> writes:
> 
> >  t/t3433-rebase-i18n.sh           |  84 +++++++++++++++++++++++++++++++
> >  t/t3433/ISO8859-1.txt            | Bin 0 -> 15 bytes
> >  t/t3433/eucJP.txt                | Bin 0 -> 68 bytes
> >  t/t3900-i18n-commit.sh           |  37 ++++++++++++++
> >  7 files changed, 195 insertions(+), 11 deletions(-)
> >  create mode 100755 t/t3433-rebase-i18n.sh
> 
> Doesn't "make test" barf with test-lint failure with this series
> merged to 'pu'?
> 
> In other words, isn't 3433 already taken by another series?  

It's taken in pu but not master.

I branched out from master and I forgot to check pu hence I hadn't
noticed this barf earlier.

Should I send a re-roll to rename to t3434?

-- 
Danh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 0/9] sequencer: handle other encoding better
  2019-10-31  9:26 [PATCH 0/3] Linux with musl libc improvement Doan Tran Cong Danh
                   ` (6 preceding siblings ...)
  2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
@ 2019-11-11  6:03 ` Doan Tran Cong Danh
  2019-11-11  6:03   ` [PATCH v6 1/9] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
                     ` (9 more replies)
  7 siblings, 10 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-11  6:03 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, Doan Tran Cong Danh

Change from v5:
- claim t3434 instead of t3433

Doan Tran Cong Danh (9):
  t0028: eliminate non-standard usage of printf
  configure.ac: define ICONV_OMITS_BOM if necessary
  t3900: demonstrate git-rebase problem with multi encoding
  sequencer: reencode to utf-8 before arrange rebase's todo list
  sequencer: reencode revert/cherry-pick's todo list
  sequencer: reencode squashing commit's message
  sequencer: reencode old merge-commit message
  sequencer: reencode commit message for am/rebase --show-current-patch
  sequencer: fallback to sane label in making rebase todo list

 configure.ac                     |  49 ++++++++++++++++++
 sequencer.c                      |  32 ++++++++----
 t/t0028-working-tree-encoding.sh |   4 +-
 t/t3434-rebase-i18n.sh           |  84 +++++++++++++++++++++++++++++++
 t/t3434/ISO8859-1.txt            | Bin 0 -> 15 bytes
 t/t3434/eucJP.txt                | Bin 0 -> 68 bytes
 t/t3900-i18n-commit.sh           |  37 ++++++++++++++
 7 files changed, 195 insertions(+), 11 deletions(-)
 create mode 100755 t/t3434-rebase-i18n.sh
 create mode 100644 t/t3434/ISO8859-1.txt
 create mode 100644 t/t3434/eucJP.txt

Range-diff against v5:
 1:  b3d6c4e720 =  1:  9f83d4533b t0028: eliminate non-standard usage of printf
 2:  fe63a6bc44 =  2:  a9adb3d061 configure.ac: define ICONV_OMITS_BOM if necessary
 3:  30f15075c4 =  3:  c41046e717 t3900: demonstrate git-rebase problem with multi encoding
 4:  17165b81e5 =  4:  0ab92e7999 sequencer: reencode to utf-8 before arrange rebase's todo list
 5:  40fa759492 =  5:  23f9de4527 sequencer: reencode revert/cherry-pick's todo list
 6:  ed6cfab5d2 =  6:  f04a9d1698 sequencer: reencode squashing commit's message
 7:  def9adf97e !  7:  4dfdd4b83e sequencer: reencode old merge-commit message
    @@ sequencer.c: static int do_merge(struct repository *r,
      		int len;
      
     
    - ## t/t3433-rebase-i18n.sh (new) ##
    + ## t/t3434-rebase-i18n.sh (new) ##
     @@
     +#!/bin/sh
     +#
    @@ t/t3433-rebase-i18n.sh (new)
     +. ./test-lib.sh
     +
     +compare_msg () {
    -+	iconv -f "$2" -t "$3" "$TEST_DIRECTORY/t3433/$1" >expect &&
    ++	iconv -f "$2" -t "$3" "$TEST_DIRECTORY/t3434/$1" >expect &&
     +	git cat-file commit HEAD >raw &&
     +	sed "1,/^$/d" raw >actual &&
     +	test_cmp expect actual
    @@ t/t3433-rebase-i18n.sh (new)
     +test_expect_success 'rebase --rebase-merges update encoding eucJP to UTF-8' '
     +	git switch -c merge-eucJP-UTF-8 first &&
     +	git config i18n.commitencoding eucJP &&
    -+	git merge -F "$TEST_DIRECTORY/t3433/eucJP.txt" second &&
    ++	git merge -F "$TEST_DIRECTORY/t3434/eucJP.txt" second &&
     +	git config i18n.commitencoding UTF-8 &&
     +	git rebase --rebase-merges master &&
     +	compare_msg eucJP.txt eucJP UTF-8
    @@ t/t3433-rebase-i18n.sh (new)
     +test_expect_failure 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP' '
     +	git switch -c merge-eucJP-ISO-2022-JP first &&
     +	git config i18n.commitencoding eucJP &&
    -+	git merge -F "$TEST_DIRECTORY/t3433/eucJP.txt" second &&
    ++	git merge -F "$TEST_DIRECTORY/t3434/eucJP.txt" second &&
     +	git config i18n.commitencoding ISO-2022-JP &&
     +	git rebase --rebase-merges master &&
     +	compare_msg eucJP.txt eucJP ISO-2022-JP
    @@ t/t3433-rebase-i18n.sh (new)
     +
     +test_done
     
    - ## t/t3433/eucJP.txt (new) ##
    - Binary files /dev/null and t/t3433/eucJP.txt differ
    + ## t/t3434/eucJP.txt (new) ##
    + Binary files /dev/null and t/t3434/eucJP.txt differ
 8:  2e95ca57d2 !  8:  28e82d6394 sequencer: reencode commit message for am/rebase --show-current-patch
    @@ sequencer.c: static int make_patch(struct repository *r,
      		res |= write_message(subject, strlen(subject), buf.buf, 1);
      		unuse_commit_buffer(commit, commit_buffer);
     
    - ## t/t3433-rebase-i18n.sh ##
    -@@ t/t3433-rebase-i18n.sh: test_expect_failure 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP
    + ## t/t3434-rebase-i18n.sh ##
    +@@ t/t3434-rebase-i18n.sh: test_expect_failure 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP
      	compare_msg eucJP.txt eucJP ISO-2022-JP
      '
      
    @@ t/t3433-rebase-i18n.sh: test_expect_failure 'rebase --rebase-merges update encod
     +		echo for-conflict >two.t &&
     +		git add two.t &&
     +		git config i18n.commitencoding $old &&
    -+		git commit -F "$TEST_DIRECTORY/t3433/$msgfile" &&
    ++		git commit -F "$TEST_DIRECTORY/t3434/$msgfile" &&
     +		git config i18n.commitencoding $new &&
     +		test_must_fail git rebase -m master &&
     +		test -f .git/rebase-merge/message &&
    @@ t/t3433-rebase-i18n.sh: test_expect_failure 'rebase --rebase-merges update encod
     +
      test_done
     
    - ## t/t3433/ISO8859-1.txt (new) ##
    - Binary files /dev/null and t/t3433/ISO8859-1.txt differ
    + ## t/t3434/ISO8859-1.txt (new) ##
    + Binary files /dev/null and t/t3434/ISO8859-1.txt differ
 9:  860dee65f4 !  9:  78daf050de sequencer: fallback to sane label in making rebase todo list
    @@ sequencer.c: static int make_script_with_merges(struct pretty_print_context *pp,
      		/* label the tips of merged branches */
      		for (; to_merge; to_merge = to_merge->next) {
     
    - ## t/t3433-rebase-i18n.sh ##
    -@@ t/t3433-rebase-i18n.sh: test_expect_success 'rebase --rebase-merges update encoding eucJP to UTF-8' '
    + ## t/t3434-rebase-i18n.sh ##
    +@@ t/t3434-rebase-i18n.sh: test_expect_success 'rebase --rebase-merges update encoding eucJP to UTF-8' '
      	compare_msg eucJP.txt eucJP UTF-8
      '
      
    @@ t/t3433-rebase-i18n.sh: test_expect_success 'rebase --rebase-merges update encod
     +test_expect_success 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP' '
      	git switch -c merge-eucJP-ISO-2022-JP first &&
      	git config i18n.commitencoding eucJP &&
    - 	git merge -F "$TEST_DIRECTORY/t3433/eucJP.txt" second &&
    + 	git merge -F "$TEST_DIRECTORY/t3434/eucJP.txt" second &&
-- 
2.24.0.164.g78daf050de.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 1/9] t0028: eliminate non-standard usage of printf
  2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
@ 2019-11-11  6:03   ` Doan Tran Cong Danh
  2019-11-11  6:03   ` [PATCH v6 2/9] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-11  6:03 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, Doan Tran Cong Danh

man 1p printf:
   In addition to the escape sequences shown in the Base Definitions
   volume of POSIX.1‐2008, Chapter 5, File Format Notation ('\\',
   '\a', '\b', '\f', '\n', '\r', '\t', '\v'), "\ddd", where ddd is a
   one, two, or three-digit octal number, shall be written as a byte
   with the numeric value specified by the octal number.

printf '\xfe\xff' is an extension of some shell.
Dash, a popular yet simple shell, do not implement this extension.

This wasn't caught by most people running the tests, even though
common shells like dash don't handle hex escapes, because their
systems don't trigger the NO_UTF16_BOM prereq. But systems with musl
libc do; when combined with dash, the test fails.

Correct it.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 t/t0028-working-tree-encoding.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7aa0945d8d..bfc4fb9af5 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -17,7 +17,7 @@ test_lazy_prereq NO_UTF32_BOM '
 write_utf16 () {
 	if test_have_prereq NO_UTF16_BOM
 	then
-		printf '\xfe\xff'
+		printf '\376\377'
 	fi &&
 	iconv -f UTF-8 -t UTF-16
 }
@@ -25,7 +25,7 @@ write_utf16 () {
 write_utf32 () {
 	if test_have_prereq NO_UTF32_BOM
 	then
-		printf '\x00\x00\xfe\xff'
+		printf '\0\0\376\377'
 	fi &&
 	iconv -f UTF-8 -t UTF-32
 }
-- 
2.24.0.164.g78daf050de.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 2/9] configure.ac: define ICONV_OMITS_BOM if necessary
  2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
  2019-11-11  6:03   ` [PATCH v6 1/9] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
@ 2019-11-11  6:03   ` Doan Tran Cong Danh
  2019-11-11  6:03   ` [PATCH v6 3/9] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-11  6:03 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, Doan Tran Cong Danh

From commit 79444c9294, ("utf8: handle systems that don't write BOM for
UTF-16", 2019-02-12), we're supporting those systems with iconv that
omits BOM with:

    make ICONV_OMITS_BOM=Yes

However, configure script wasn't taught to detect those systems.

Teach configure to do so.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 configure.ac | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/configure.ac b/configure.ac
index a43b476402..ecba7e6e51 100644
--- a/configure.ac
+++ b/configure.ac
@@ -844,12 +844,61 @@ AC_MSG_CHECKING([for old iconv()])
 AC_COMPILE_IFELSE([OLDICONVTEST_SRC],
 	[AC_MSG_RESULT([no])],
 	[AC_MSG_RESULT([yes])
+	AC_DEFINE(HAVE_OLD_ICONV, 1)
 	OLD_ICONV=UnfortunatelyYes])
 
 GIT_UNSTASH_FLAGS($ICONVDIR)
 
 GIT_CONF_SUBST([OLD_ICONV])
 
+#
+# Define ICONV_OMITS_BOM if you are on a system which
+# iconv omits bom for utf-{16,32}
+if test -z "$NO_ICONV"; then
+AC_CACHE_CHECK([whether iconv omits bom for utf-16 and utf-32],
+ [ac_cv_iconv_omits_bom],
+[
+old_LIBS="$LIBS"
+if test -n "$NEEDS_LIBICONV"; then
+	LIBS="$LIBS -liconv"
+fi
+
+AC_RUN_IFELSE(
+	[AC_LANG_PROGRAM([AC_INCLUDES_DEFAULT
+	#include <iconv.h>
+	#ifdef HAVE_OLD_ICONV
+	typedef const char *iconv_ibp;
+	#else
+	typedef char *iconv_ibp;
+	#endif
+	],
+	[[
+	int v;
+	iconv_t conv;
+	char in[] = "a"; iconv_ibp pin = in;
+	char out[20] = ""; char *pout = out;
+	size_t isz = sizeof in;
+	size_t osz = sizeof out;
+
+	conv = iconv_open("UTF-16", "UTF-8");
+	iconv(conv, &pin, &isz, &pout, &osz);
+	iconv_close(conv);
+	v = (unsigned char)(out[0]) + (unsigned char)(out[1]);
+	return v != 0xfe + 0xff;
+	]])],
+	[ac_cv_iconv_omits_bom=no],
+	[ac_cv_iconv_omits_bom=yes])
+
+LIBS="$old_LIBS"
+])
+if test "x$ac_cv_iconv_omits_bom" = xyes; then
+	ICONV_OMITS_BOM=Yes
+else
+	ICONV_OMITS_BOM=
+fi
+GIT_CONF_SUBST([ICONV_OMITS_BOM])
+fi
+
 ## Checks for typedefs, structures, and compiler characteristics.
 AC_MSG_NOTICE([CHECKS for typedefs, structures, and compiler characteristics])
 #
-- 
2.24.0.164.g78daf050de.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 3/9] t3900: demonstrate git-rebase problem with multi encoding
  2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
  2019-11-11  6:03   ` [PATCH v6 1/9] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
  2019-11-11  6:03   ` [PATCH v6 2/9] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
@ 2019-11-11  6:03   ` Doan Tran Cong Danh
  2019-11-11  6:03   ` [PATCH v6 4/9] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-11  6:03 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, Doan Tran Cong Danh

We're using fixup!/squash! <subject> to mark if current commit will be
used to be fixed up or squashed to a previous commit.

However, if we're changing i18n.commitencoding after making the
original commit but before making the fixing up, we couldn't find the
original commit to do the fixup/squash.

Add a test to demonstrate that problem.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 t/t3900-i18n-commit.sh | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index b92ff95977..dd56384b93 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -204,4 +204,33 @@ test_commit_autosquash_flags eucJP fixup
 
 test_commit_autosquash_flags ISO-2022-JP squash
 
+test_commit_autosquash_multi_encoding () {
+	flag=$1
+	old=$2
+	new=$3
+	msg=$4
+	test_expect_failure "commit --$flag into $old from $new" '
+		git checkout -b $flag-$old-$new C0 &&
+		git config i18n.commitencoding $old &&
+		echo $old >>F &&
+		git commit -a -F "$TEST_DIRECTORY"/t3900/$msg &&
+		test_tick &&
+		echo intermediate stuff >>G &&
+		git add G &&
+		git commit -a -m "intermediate commit" &&
+		test_tick &&
+		git config i18n.commitencoding $new &&
+		echo $new-$flag >>F &&
+		git commit -a --$flag HEAD^ &&
+		git rebase --autosquash -i HEAD^^^ &&
+		git rev-list HEAD >actual &&
+		test_line_count = 3 actual
+	'
+}
+
+test_commit_autosquash_multi_encoding fixup UTF-8 ISO-8859-1 1-UTF-8.txt
+test_commit_autosquash_multi_encoding squash ISO-8859-1 UTF-8 ISO8859-1.txt
+test_commit_autosquash_multi_encoding squash eucJP ISO-2022-JP eucJP.txt
+test_commit_autosquash_multi_encoding fixup ISO-2022-JP UTF-8 ISO-2022-JP.txt
+
 test_done
-- 
2.24.0.164.g78daf050de.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 4/9] sequencer: reencode to utf-8 before arrange rebase's todo list
  2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
                     ` (2 preceding siblings ...)
  2019-11-11  6:03   ` [PATCH v6 3/9] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
@ 2019-11-11  6:03   ` Doan Tran Cong Danh
  2019-11-11  6:03   ` [PATCH v6 5/9] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-11  6:03 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, Doan Tran Cong Danh

On musl libc, ISO-2022-JP encoder is too eager to switch back to
1 byte encoding, musl's iconv always switch back after every combining
character. Comparing glibc and musl's output for this command
$ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
	iconv -f utf-8 -t ISO-2022-JP | xxd

glibc:
00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
00000010: 0a                                       .

musl:
00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.

Although musl iconv's output isn't optimal, it's still correct.

From commit 7d509878b8, ("pretty.c: format string with truncate respects
logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
first, then format it and convert the message to the actual output
encoding on git commit --squash.

Thus, t3900::test_commit_autosquash_flags is failing on musl libc.

Reencode to utf-8 before arranging rebase's todo list.

By doing this, we also remove a breakage noticed by a test added in the
previous commit.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c            | 2 +-
 t/t3900-i18n-commit.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index 8952cfa89b..05403a9005 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -5167,7 +5167,7 @@ int todo_list_rearrange_squash(struct todo_list *todo_list)
 		*commit_todo_item_at(&commit_todo, item->commit) = item;
 
 		parse_commit(item->commit);
-		commit_buffer = get_commit_buffer(item->commit, NULL);
+		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
 		find_commit_subject(commit_buffer, &subject);
 		format_subject(&buf, subject, " ");
 		subject = subjects[i] = strbuf_detach(&buf, &subject_len);
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index dd56384b93..a518281b04 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -209,7 +209,7 @@ test_commit_autosquash_multi_encoding () {
 	old=$2
 	new=$3
 	msg=$4
-	test_expect_failure "commit --$flag into $old from $new" '
+	test_expect_success "commit --$flag into $old from $new" '
 		git checkout -b $flag-$old-$new C0 &&
 		git config i18n.commitencoding $old &&
 		echo $old >>F &&
-- 
2.24.0.164.g78daf050de.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 5/9] sequencer: reencode revert/cherry-pick's todo list
  2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
                     ` (3 preceding siblings ...)
  2019-11-11  6:03   ` [PATCH v6 4/9] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
@ 2019-11-11  6:03   ` Doan Tran Cong Danh
  2019-11-11  6:03   ` [PATCH v6 6/9] sequencer: reencode squashing commit's message Doan Tran Cong Danh
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-11  6:03 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, Doan Tran Cong Danh

Keep revert/cherry-pick's todo list in line with rebase todo list.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/sequencer.c b/sequencer.c
index 05403a9005..6ab1bba39d 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -2562,14 +2562,17 @@ static int walk_revs_populate_todo(struct todo_list *todo_list,
 	enum todo_command command = opts->action == REPLAY_PICK ?
 		TODO_PICK : TODO_REVERT;
 	const char *command_string = todo_command_info[command].str;
+	const char *encoding;
 	struct commit *commit;
 
 	if (prepare_revs(opts))
 		return -1;
 
+	encoding = get_log_output_encoding();
+
 	while ((commit = get_revision(opts->revs))) {
 		struct todo_item *item = append_new_todo(todo_list);
-		const char *commit_buffer = get_commit_buffer(commit, NULL);
+		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);
 		const char *subject;
 		int subject_len;
 
-- 
2.24.0.164.g78daf050de.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 6/9] sequencer: reencode squashing commit's message
  2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
                     ` (4 preceding siblings ...)
  2019-11-11  6:03   ` [PATCH v6 5/9] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
@ 2019-11-11  6:03   ` Doan Tran Cong Danh
  2019-11-11  6:03   ` [PATCH v6 7/9] sequencer: reencode old merge-commit message Doan Tran Cong Danh
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-11  6:03 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, Doan Tran Cong Danh

On fixup/squash-ing rebase, git will create new commit in
i18n.commitencoding, reencode the commit message to that said encode.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c            |  8 +++++---
 t/t3900-i18n-commit.sh | 10 +++++++++-
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index 6ab1bba39d..b5712e59d9 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -1574,6 +1574,7 @@ static int update_squash_messages(struct repository *r,
 	struct strbuf buf = STRBUF_INIT;
 	int res;
 	const char *message, *body;
+	const char *encoding = get_commit_output_encoding();
 
 	if (opts->current_fixup_count > 0) {
 		struct strbuf header = STRBUF_INIT;
@@ -1600,7 +1601,7 @@ static int update_squash_messages(struct repository *r,
 			return error(_("need a HEAD to fixup"));
 		if (!(head_commit = lookup_commit_reference(r, &head)))
 			return error(_("could not read HEAD"));
-		if (!(head_message = get_commit_buffer(head_commit, NULL)))
+		if (!(head_message = logmsg_reencode(head_commit, NULL, encoding)))
 			return error(_("could not read HEAD's commit message"));
 
 		find_commit_subject(head_message, &body);
@@ -1621,7 +1622,7 @@ static int update_squash_messages(struct repository *r,
 		unuse_commit_buffer(head_commit, head_message);
 	}
 
-	if (!(message = get_commit_buffer(commit, NULL)))
+	if (!(message = logmsg_reencode(commit, NULL, encoding)))
 		return error(_("could not read commit message of %s"),
 			     oid_to_hex(&commit->object.oid));
 	find_commit_subject(message, &body);
@@ -4152,9 +4153,10 @@ static int commit_staged_changes(struct repository *r,
 				 */
 				struct commit *commit;
 				const char *path = rebase_path_squash_msg();
+				const char *encoding = get_commit_output_encoding();
 
 				if (parse_head(r, &commit) ||
-				    !(p = get_commit_buffer(commit, NULL)) ||
+				    !(p = logmsg_reencode(commit, NULL, encoding)) ||
 				    write_message(p, strlen(p), path, 0)) {
 					unuse_commit_buffer(commit, p);
 					return error(_("could not write file: "
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index a518281b04..d277a9f4b7 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -224,7 +224,15 @@ test_commit_autosquash_multi_encoding () {
 		git commit -a --$flag HEAD^ &&
 		git rebase --autosquash -i HEAD^^^ &&
 		git rev-list HEAD >actual &&
-		test_line_count = 3 actual
+		test_line_count = 3 actual &&
+		iconv -f $old -t UTF-8 "$TEST_DIRECTORY"/t3900/$msg >expect &&
+		if test $flag = squash; then
+			subject="$(head -1 expect)" &&
+			printf "\nsquash! %s\n" "$subject" >>expect
+		fi &&
+		git cat-file commit HEAD^ >raw &&
+		(sed "1,/^$/d" raw | iconv -f $new -t utf-8) >actual &&
+		test_cmp expect actual
 	'
 }
 
-- 
2.24.0.164.g78daf050de.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 7/9] sequencer: reencode old merge-commit message
  2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
                     ` (5 preceding siblings ...)
  2019-11-11  6:03   ` [PATCH v6 6/9] sequencer: reencode squashing commit's message Doan Tran Cong Danh
@ 2019-11-11  6:03   ` Doan Tran Cong Danh
  2019-11-11  6:03   ` [PATCH v6 8/9] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-11  6:03 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, Doan Tran Cong Danh

During rebasing, old merge's message (encoded in old encoding)
will be used as message for new merge commit (created by rebase).

In case of the value of i18n.commitencoding has been changed after the
old merge time. We will receive an unusable message for this new merge.

Correct it.

This change also notice a breakage with git-rebase label system.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c            |   3 ++-
 t/t3434-rebase-i18n.sh |  57 +++++++++++++++++++++++++++++++++++++++++
 t/t3434/eucJP.txt      | Bin 0 -> 68 bytes
 3 files changed, 59 insertions(+), 1 deletion(-)
 create mode 100755 t/t3434-rebase-i18n.sh
 create mode 100644 t/t3434/eucJP.txt

diff --git a/sequencer.c b/sequencer.c
index b5712e59d9..18bc4d515d 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -3372,7 +3372,8 @@ static int do_merge(struct repository *r,
 	}
 
 	if (commit) {
-		const char *message = get_commit_buffer(commit, NULL);
+		const char *encoding = get_commit_output_encoding();
+		const char *message = logmsg_reencode(commit, NULL, encoding);
 		const char *body;
 		int len;
 
diff --git a/t/t3434-rebase-i18n.sh b/t/t3434-rebase-i18n.sh
new file mode 100755
index 0000000000..c6c16373eb
--- /dev/null
+++ b/t/t3434-rebase-i18n.sh
@@ -0,0 +1,57 @@
+#!/bin/sh
+#
+# Copyright (c) 2019 Doan Tran Cong Danh
+#
+
+test_description='rebase with changing encoding
+
+Initial setup:
+
+1 - 2              master
+ \
+  3 - 4            first
+   \
+    5 - 6          second
+'
+
+. ./test-lib.sh
+
+compare_msg () {
+	iconv -f "$2" -t "$3" "$TEST_DIRECTORY/t3434/$1" >expect &&
+	git cat-file commit HEAD >raw &&
+	sed "1,/^$/d" raw >actual &&
+	test_cmp expect actual
+}
+
+test_expect_success setup '
+	test_commit one &&
+	git branch first &&
+	test_commit two &&
+	git switch first &&
+	test_commit three &&
+	git branch second &&
+	test_commit four &&
+	git switch second &&
+	test_commit five &&
+	test_commit six
+'
+
+test_expect_success 'rebase --rebase-merges update encoding eucJP to UTF-8' '
+	git switch -c merge-eucJP-UTF-8 first &&
+	git config i18n.commitencoding eucJP &&
+	git merge -F "$TEST_DIRECTORY/t3434/eucJP.txt" second &&
+	git config i18n.commitencoding UTF-8 &&
+	git rebase --rebase-merges master &&
+	compare_msg eucJP.txt eucJP UTF-8
+'
+
+test_expect_failure 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP' '
+	git switch -c merge-eucJP-ISO-2022-JP first &&
+	git config i18n.commitencoding eucJP &&
+	git merge -F "$TEST_DIRECTORY/t3434/eucJP.txt" second &&
+	git config i18n.commitencoding ISO-2022-JP &&
+	git rebase --rebase-merges master &&
+	compare_msg eucJP.txt eucJP ISO-2022-JP
+'
+
+test_done
diff --git a/t/t3434/eucJP.txt b/t/t3434/eucJP.txt
new file mode 100644
index 0000000000000000000000000000000000000000..546f2aac01b67e39d19de601f5586097b34a8325
GIT binary patch
literal 68
zcmZ2-e#x69mzLaKa+Ql~$@V43mMmHFddayZYZfkovW_oY%ys|3$+JKuZ<btN@@mOl
SAboGi<s}<{?6*s90HpysEiSzP

literal 0
HcmV?d00001

-- 
2.24.0.164.g78daf050de.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 8/9] sequencer: reencode commit message for am/rebase --show-current-patch
  2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
                     ` (6 preceding siblings ...)
  2019-11-11  6:03   ` [PATCH v6 7/9] sequencer: reencode old merge-commit message Doan Tran Cong Danh
@ 2019-11-11  6:03   ` Doan Tran Cong Danh
  2019-11-11  6:03   ` [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list Doan Tran Cong Danh
  2019-11-11  8:40   ` [PATCH v6 0/9] sequencer: handle other encoding better Jeff King
  9 siblings, 0 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-11  6:03 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, Doan Tran Cong Danh

The message file will be used as commit message for the
git-{am,rebase} --continue.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c            |   3 ++-
 t/t3434-rebase-i18n.sh |  27 +++++++++++++++++++++++++++
 t/t3434/ISO8859-1.txt  | Bin 0 -> 15 bytes
 3 files changed, 29 insertions(+), 1 deletion(-)
 create mode 100644 t/t3434/ISO8859-1.txt

diff --git a/sequencer.c b/sequencer.c
index 18bc4d515d..2f32b44f52 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -2970,7 +2970,8 @@ static int make_patch(struct repository *r,
 
 	strbuf_addf(&buf, "%s/message", get_dir(opts));
 	if (!file_exists(buf.buf)) {
-		const char *commit_buffer = get_commit_buffer(commit, NULL);
+		const char *encoding = get_commit_output_encoding();
+		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);
 		find_commit_subject(commit_buffer, &subject);
 		res |= write_message(subject, strlen(subject), buf.buf, 1);
 		unuse_commit_buffer(commit, commit_buffer);
diff --git a/t/t3434-rebase-i18n.sh b/t/t3434-rebase-i18n.sh
index c6c16373eb..4b5b128cd6 100755
--- a/t/t3434-rebase-i18n.sh
+++ b/t/t3434-rebase-i18n.sh
@@ -54,4 +54,31 @@ test_expect_failure 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP
 	compare_msg eucJP.txt eucJP ISO-2022-JP
 '
 
+test_rebase_continue_update_encode () {
+	old=$1
+	new=$2
+	msgfile=$3
+	test_expect_success "rebase --continue update from $old to $new" '
+		(git rebase --abort || : abort current git-rebase failure) &&
+		git switch -c conflict-$old-$new one &&
+		echo for-conflict >two.t &&
+		git add two.t &&
+		git config i18n.commitencoding $old &&
+		git commit -F "$TEST_DIRECTORY/t3434/$msgfile" &&
+		git config i18n.commitencoding $new &&
+		test_must_fail git rebase -m master &&
+		test -f .git/rebase-merge/message &&
+		git stripspace <.git/rebase-merge/message >two.t &&
+		git add two.t &&
+		git rebase --continue &&
+		compare_msg $msgfile $old $new &&
+		: git-commit assume invalid utf-8 is latin1 &&
+		test_cmp expect two.t
+	'
+}
+
+test_rebase_continue_update_encode ISO-8859-1 UTF-8 ISO8859-1.txt
+test_rebase_continue_update_encode eucJP UTF-8 eucJP.txt
+test_rebase_continue_update_encode eucJP ISO-2022-JP eucJP.txt
+
 test_done
diff --git a/t/t3434/ISO8859-1.txt b/t/t3434/ISO8859-1.txt
new file mode 100644
index 0000000000000000000000000000000000000000..7cbef0ee6f430c611134a06a6dd6c12fbea589d5
GIT binary patch
literal 15
XcmX?d`r`R(TwDi}o~OJ>OXmUrN~8!!

literal 0
HcmV?d00001

-- 
2.24.0.164.g78daf050de.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list
  2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
                     ` (7 preceding siblings ...)
  2019-11-11  6:03   ` [PATCH v6 8/9] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
@ 2019-11-11  6:03   ` Doan Tran Cong Danh
  2019-11-11  8:39     ` Jeff King
  2019-11-11 18:26     ` Johannes Schindelin
  2019-11-11  8:40   ` [PATCH v6 0/9] sequencer: handle other encoding better Jeff King
  9 siblings, 2 replies; 89+ messages in thread
From: Doan Tran Cong Danh @ 2019-11-11  6:03 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, Doan Tran Cong Danh

In later stage of rebasing, we will make a ref in
refs/rewritten/<label>, this label is extracted from the subject of
current merge commit.

If that subject has forbidden character for refname, git couldn't create
the rewritten ref. One such case is the merge commit subject has
a multibyte character encoded in ISO-2022-JP.

Provide a sane fallback in order to help git-rebase works in such case

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c            | 11 +++++++++--
 t/t3434-rebase-i18n.sh |  2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index 2f32b44f52..f664ba727e 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -4560,6 +4560,7 @@ static int make_script_with_merges(struct pretty_print_context *pp,
 	while ((commit = get_revision(revs))) {
 		struct commit_list *to_merge;
 		const char *p1, *p2;
+		const char *hex_oid;
 		struct object_id *oid;
 		int is_empty;
 
@@ -4607,9 +4608,15 @@ static int make_script_with_merges(struct pretty_print_context *pp,
 			if (isspace(*p1))
 				*(char *)p1 = '-';
 
+		hex_oid = oid_to_hex(&commit->object.oid);
+
+		if (check_refname_format(label.buf, REFNAME_ALLOW_ONELEVEL) < 0) {
+			strbuf_reset(&label);
+			strbuf_addf(&label, "label-%s", hex_oid);
+		}
+
 		strbuf_reset(&buf);
-		strbuf_addf(&buf, "%s -C %s",
-			    cmd_merge, oid_to_hex(&commit->object.oid));
+		strbuf_addf(&buf, "%s -C %s", cmd_merge, hex_oid);
 
 		/* label the tips of merged branches */
 		for (; to_merge; to_merge = to_merge->next) {
diff --git a/t/t3434-rebase-i18n.sh b/t/t3434-rebase-i18n.sh
index 4b5b128cd6..c7c835cde9 100755
--- a/t/t3434-rebase-i18n.sh
+++ b/t/t3434-rebase-i18n.sh
@@ -45,7 +45,7 @@ test_expect_success 'rebase --rebase-merges update encoding eucJP to UTF-8' '
 	compare_msg eucJP.txt eucJP UTF-8
 '
 
-test_expect_failure 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP' '
+test_expect_success 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP' '
 	git switch -c merge-eucJP-ISO-2022-JP first &&
 	git config i18n.commitencoding eucJP &&
 	git merge -F "$TEST_DIRECTORY/t3434/eucJP.txt" second &&
-- 
2.24.0.164.g78daf050de.dirty


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v5 0/9] Improve odd encoding integration
  2019-11-11  4:02   ` Junio C Hamano
  2019-11-11  4:43     ` Danh Doan
@ 2019-11-11  6:14     ` Junio C Hamano
  1 sibling, 0 replies; 89+ messages in thread
From: Junio C Hamano @ 2019-11-11  6:14 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> Doan Tran Cong Danh <congdanhqx@gmail.com> writes:
>
>>  t/t3433-rebase-i18n.sh           |  84 +++++++++++++++++++++++++++++++
>>  t/t3433/ISO8859-1.txt            | Bin 0 -> 15 bytes
>>  t/t3433/eucJP.txt                | Bin 0 -> 68 bytes
>>  t/t3900-i18n-commit.sh           |  37 ++++++++++++++
>>  7 files changed, 195 insertions(+), 11 deletions(-)
>>  create mode 100755 t/t3433-rebase-i18n.sh
>
> Doesn't "make test" barf with test-lint failure with this series
> merged to 'pu'?
>
> In other words, isn't 3433 already taken by another series?  

In the meantime, here is what I added to the tip of the series.
Notice that t3434 itself does not have to repeat its own name over
and over this way.

Thanks.

 t/{t3433-rebase-i18n.sh => t3434-rebase-i18n.sh} | 10 ++++++----
 t/{t3433 => t3434}/ISO8859-1.txt                 |  0
 t/{t3433 => t3434}/eucJP.txt                     |  0
 3 files changed, 6 insertions(+), 4 deletions(-)
 rename t/{t3433-rebase-i18n.sh => t3434-rebase-i18n.sh} (89%)
 rename t/{t3433 => t3434}/ISO8859-1.txt (100%)
 rename t/{t3433 => t3434}/eucJP.txt (100%)

diff --git a/t/t3433-rebase-i18n.sh b/t/t3434-rebase-i18n.sh
similarity index 89%
rename from t/t3433-rebase-i18n.sh
rename to t/t3434-rebase-i18n.sh
index 93e5402849..f693a7f4a0 100755
--- a/t/t3433-rebase-i18n.sh
+++ b/t/t3434-rebase-i18n.sh
@@ -16,8 +16,10 @@ Initial setup:
 
 . ./test-lib.sh
 
+test_vector="$TEST_DIRECTORY/t3434"
+
 compare_msg () {
-	iconv -f "$2" -t "$3" "$TEST_DIRECTORY/t3433/$1" >expect &&
+	iconv -f "$2" -t "$3" "$test_vector/$1" >expect &&
 	git cat-file commit HEAD >raw &&
 	sed "1,/^$/d" raw >actual &&
 	test_cmp expect actual
@@ -39,7 +41,7 @@ test_expect_success setup '
 test_expect_success 'rebase --rebase-merges update encoding eucJP to UTF-8' '
 	git switch -c merge-eucJP-UTF-8 first &&
 	git config i18n.commitencoding eucJP &&
-	git merge -F "$TEST_DIRECTORY/t3433/eucJP.txt" second &&
+	git merge -F "$test_vector/eucJP.txt" second &&
 	git config i18n.commitencoding UTF-8 &&
 	git rebase --rebase-merges master &&
 	compare_msg eucJP.txt eucJP UTF-8
@@ -48,7 +50,7 @@ test_expect_success 'rebase --rebase-merges update encoding eucJP to UTF-8' '
 test_expect_success 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP' '
 	git switch -c merge-eucJP-ISO-2022-JP first &&
 	git config i18n.commitencoding eucJP &&
-	git merge -F "$TEST_DIRECTORY/t3433/eucJP.txt" second &&
+	git merge -F "$test_vector/eucJP.txt" second &&
 	git config i18n.commitencoding ISO-2022-JP &&
 	git rebase --rebase-merges master &&
 	compare_msg eucJP.txt eucJP ISO-2022-JP
@@ -64,7 +66,7 @@ test_rebase_continue_update_encode () {
 		echo for-conflict >two.t &&
 		git add two.t &&
 		git config i18n.commitencoding $old &&
-		git commit -F "$TEST_DIRECTORY/t3433/$msgfile" &&
+		git commit -F "$test_vector/$msgfile" &&
 		git config i18n.commitencoding $new &&
 		test_must_fail git rebase -m master &&
 		test -f .git/rebase-merge/message &&
diff --git a/t/t3433/ISO8859-1.txt b/t/t3434/ISO8859-1.txt
similarity index 100%
rename from t/t3433/ISO8859-1.txt
rename to t/t3434/ISO8859-1.txt
diff --git a/t/t3433/eucJP.txt b/t/t3434/eucJP.txt
similarity index 100%
rename from t/t3433/eucJP.txt
rename to t/t3434/eucJP.txt
-- 
2.24.0-309-ga5a95dfdb6


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding
  2019-11-07 10:51           ` Danh Doan
@ 2019-11-11  8:22             ` Jeff King
  0 siblings, 0 replies; 89+ messages in thread
From: Jeff King @ 2019-11-11  8:22 UTC (permalink / raw)
  To: Danh Doan; +Cc: git

On Thu, Nov 07, 2019 at 05:51:09PM +0700, Danh Doan wrote:

> On 2019-11-07 03:02:18 -0500, Jeff King wrote:
> > >                 git config i18n.commitencoding ISO-2022-JP &&
> > >                 echo ISO-2022-JP >>F &&
> > >                 git commit -a -F "$TEST_DIRECTORY/t3900/ISO-2022-JP.txt" &&
> > 
> > ...you still can't just run this manually because of other lines like
> > this one.
> 
> Except we can with a little effort:
> 
>     export TEST_DIRECTORY=..

Sure, but if you allow setting variables, you could do the same for
"$msg", etc. :)

> > It's also weirdly unlike all of the other tests, which creates confusion
> > for people reading the code. IMHO the tradeoff isn't worth it.
> 
> Hm, I think it's the test_commit_autosquash_flag is the one that is weird
> in this file. Most of other sets of tests (line 89-176) use the same quote.

Yeah, you're right. I did look at the other tests to see if it was an
existing style, but of course that was the exact one I looked at. ;)

IMHO it's still a bad style (and is unlike most of the rest of our
tests), but as it's following the existing style in the file, I can live
with it (and we can think about changing it all as a separate step
later).

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list
  2019-11-11  6:03   ` [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list Doan Tran Cong Danh
@ 2019-11-11  8:39     ` Jeff King
  2019-11-11 16:22       ` Phillip Wood
  2019-11-11 18:26     ` Johannes Schindelin
  1 sibling, 1 reply; 89+ messages in thread
From: Jeff King @ 2019-11-11  8:39 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git, gitster

On Mon, Nov 11, 2019 at 01:03:42PM +0700, Doan Tran Cong Danh wrote:

> In later stage of rebasing, we will make a ref in
> refs/rewritten/<label>, this label is extracted from the subject of
> current merge commit.
> 
> If that subject has forbidden character for refname, git couldn't create
> the rewritten ref. One such case is the merge commit subject has
> a multibyte character encoded in ISO-2022-JP.
> 
> Provide a sane fallback in order to help git-rebase works in such case

Good find. Not having worked much with this part of the sequencer code,
I don't have a strong opinion on the fallback label. But it seems better
than the current behavior would be.

I suspect there may be other subtle problems, too, for filesystems that
can't represent some part of the subject (e.g., I wonder if some of your
ISO-2022-JP tests might already have trouble on HFS+, which excepts all
paths to be UTF-8).

> @@ -4607,9 +4608,15 @@ static int make_script_with_merges(struct pretty_print_context *pp,
>  			if (isspace(*p1))
>  				*(char *)p1 = '-';
>  
> +		hex_oid = oid_to_hex(&commit->object.oid);
> +
> +		if (check_refname_format(label.buf, REFNAME_ALLOW_ONELEVEL) < 0) {
> +			strbuf_reset(&label);
> +			strbuf_addf(&label, "label-%s", hex_oid);
> +		}
> +
>  		strbuf_reset(&buf);
> -		strbuf_addf(&buf, "%s -C %s",
> -			    cmd_merge, oid_to_hex(&commit->object.oid));
> +		strbuf_addf(&buf, "%s -C %s", cmd_merge, hex_oid);

A minor nit, but I think it's better here to just call oid_to_hex()
twice, rather than assigning to the hex_oid variable. It's returning a
pointer to a static buffer, so holding onto the pointers for too long is
dangerous. What you have here is definitely OK, because nothing
interesting happens in between, but seeing any assignment of the result
of "oid_to_hex" makes auditing harder.

And it's not like it's that expensive, or that this is
performance-critical code (and if it were, we'd do better to use
oid_to_hex_r() directly into the strbuf anyway).

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 0/9] sequencer: handle other encoding better
  2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
                     ` (8 preceding siblings ...)
  2019-11-11  6:03   ` [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list Doan Tran Cong Danh
@ 2019-11-11  8:40   ` Jeff King
  9 siblings, 0 replies; 89+ messages in thread
From: Jeff King @ 2019-11-11  8:40 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git, gitster

On Mon, Nov 11, 2019 at 01:03:33PM +0700, Doan Tran Cong Danh wrote:

> Change from v5:
> - claim t3434 instead of t3433
> 
> Doan Tran Cong Danh (9):
>   t0028: eliminate non-standard usage of printf
>   configure.ac: define ICONV_OMITS_BOM if necessary
>   t3900: demonstrate git-rebase problem with multi encoding
>   sequencer: reencode to utf-8 before arrange rebase's todo list
>   sequencer: reencode revert/cherry-pick's todo list
>   sequencer: reencode squashing commit's message
>   sequencer: reencode old merge-commit message
>   sequencer: reencode commit message for am/rebase --show-current-patch
>   sequencer: fallback to sane label in making rebase todo list

Thanks, these all look pretty sensible to me. I left a few small
comments on patch 9.

(I see you took my suggestion regarding the test quoting, too. Thanks!)

-Peff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list
  2019-11-11  8:39     ` Jeff King
@ 2019-11-11 16:22       ` Phillip Wood
  0 siblings, 0 replies; 89+ messages in thread
From: Phillip Wood @ 2019-11-11 16:22 UTC (permalink / raw)
  To: Jeff King, Doan Tran Cong Danh; +Cc: git, gitster

Hi Doan & Peff

On 11/11/2019 08:39, Jeff King wrote:
> On Mon, Nov 11, 2019 at 01:03:42PM +0700, Doan Tran Cong Danh wrote:
> 
>> In later stage of rebasing, we will make a ref in
>> refs/rewritten/<label>, this label is extracted from the subject of
>> current merge commit.
>>
>> If that subject has forbidden character for refname, git couldn't create
>> the rewritten ref. One such case is the merge commit subject has
>> a multibyte character encoded in ISO-2022-JP.
>>
>> Provide a sane fallback in order to help git-rebase works in such case
> 
> Good find. Not having worked much with this part of the sequencer code,
> I don't have a strong opinion on the fallback label. But it seems better
> than the current behavior would be.

There's been some discussion about sanitizing the subject recently 
[1,2]. It would be good to co-ordinate this effort with that one

Best Wishes

Phillip

[1] 
https://public-inbox.org/git/4a02c38442dd8a4c0381adc8db0dce81c253da09.1567432900.git.gitgitgadget@gmail.com
[2] 
https://public-inbox.org/git/nycvar.QRO.7.76.6.1910251508100.46@tvgsbejvaqbjf.bet

> I suspect there may be other subtle problems, too, for filesystems that
> can't represent some part of the subject (e.g., I wonder if some of your
> ISO-2022-JP tests might already have trouble on HFS+, which excepts all
> paths to be UTF-8).
> 
>> @@ -4607,9 +4608,15 @@ static int make_script_with_merges(struct pretty_print_context *pp,
>>   			if (isspace(*p1))
>>   				*(char *)p1 = '-';
>>   
>> +		hex_oid = oid_to_hex(&commit->object.oid);
>> +
>> +		if (check_refname_format(label.buf, REFNAME_ALLOW_ONELEVEL) < 0) {
>> +			strbuf_reset(&label);
>> +			strbuf_addf(&label, "label-%s", hex_oid);
>> +		}
>> +
>>   		strbuf_reset(&buf);
>> -		strbuf_addf(&buf, "%s -C %s",
>> -			    cmd_merge, oid_to_hex(&commit->object.oid));
>> +		strbuf_addf(&buf, "%s -C %s", cmd_merge, hex_oid);
> 
> A minor nit, but I think it's better here to just call oid_to_hex()
> twice, rather than assigning to the hex_oid variable. It's returning a
> pointer to a static buffer, so holding onto the pointers for too long is
> dangerous. What you have here is definitely OK, because nothing
> interesting happens in between, but seeing any assignment of the result
> of "oid_to_hex" makes auditing harder.
> 
> And it's not like it's that expensive, or that this is
> performance-critical code (and if it were, we'd do better to use
> oid_to_hex_r() directly into the strbuf anyway).
> 
> -Peff
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list
  2019-11-11  6:03   ` [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list Doan Tran Cong Danh
  2019-11-11  8:39     ` Jeff King
@ 2019-11-11 18:26     ` Johannes Schindelin
  2019-11-12  4:17       ` Junio C Hamano
  1 sibling, 1 reply; 89+ messages in thread
From: Johannes Schindelin @ 2019-11-11 18:26 UTC (permalink / raw)
  To: Doan Tran Cong Danh; +Cc: git, peff, gitster

Hi,

On Mon, 11 Nov 2019, Doan Tran Cong Danh wrote:

> In later stage of rebasing, we will make a ref in
> refs/rewritten/<label>, this label is extracted from the subject of
> current merge commit.
>
> If that subject has forbidden character for refname, git couldn't create
> the rewritten ref. One such case is the merge commit subject has
> a multibyte character encoded in ISO-2022-JP.
>
> Provide a sane fallback in order to help git-rebase works in such case
>
> Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
> ---

There was an alternative patch under discussion, one that also addressed
problem like e.g. characters that are illegal in Windows file names:
https://github.com/gitgitgadget/git/pull/327

Unfortunately, the reviews asked for an extensive revision for which I have
not been able to find the time to implement them.

Having said that, I think that the solution proposed over there is more
complete, and maybe even more robust (nothing in your patch prevents
`label-<hex-oid>` to _already_ having been taken by another label).

Ciao,
Dscho

>  sequencer.c            | 11 +++++++++--
>  t/t3434-rebase-i18n.sh |  2 +-
>  2 files changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/sequencer.c b/sequencer.c
> index 2f32b44f52..f664ba727e 100644
> --- a/sequencer.c
> +++ b/sequencer.c
> @@ -4560,6 +4560,7 @@ static int make_script_with_merges(struct pretty_print_context *pp,
>  	while ((commit = get_revision(revs))) {
>  		struct commit_list *to_merge;
>  		const char *p1, *p2;
> +		const char *hex_oid;
>  		struct object_id *oid;
>  		int is_empty;
>
> @@ -4607,9 +4608,15 @@ static int make_script_with_merges(struct pretty_print_context *pp,
>  			if (isspace(*p1))
>  				*(char *)p1 = '-';
>
> +		hex_oid = oid_to_hex(&commit->object.oid);
> +
> +		if (check_refname_format(label.buf, REFNAME_ALLOW_ONELEVEL) < 0) {
> +			strbuf_reset(&label);
> +			strbuf_addf(&label, "label-%s", hex_oid);
> +		}
> +
>  		strbuf_reset(&buf);
> -		strbuf_addf(&buf, "%s -C %s",
> -			    cmd_merge, oid_to_hex(&commit->object.oid));
> +		strbuf_addf(&buf, "%s -C %s", cmd_merge, hex_oid);
>
>  		/* label the tips of merged branches */
>  		for (; to_merge; to_merge = to_merge->next) {
> diff --git a/t/t3434-rebase-i18n.sh b/t/t3434-rebase-i18n.sh
> index 4b5b128cd6..c7c835cde9 100755
> --- a/t/t3434-rebase-i18n.sh
> +++ b/t/t3434-rebase-i18n.sh
> @@ -45,7 +45,7 @@ test_expect_success 'rebase --rebase-merges update encoding eucJP to UTF-8' '
>  	compare_msg eucJP.txt eucJP UTF-8
>  '
>
> -test_expect_failure 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP' '
> +test_expect_success 'rebase --rebase-merges update encoding eucJP to ISO-2022-JP' '
>  	git switch -c merge-eucJP-ISO-2022-JP first &&
>  	git config i18n.commitencoding eucJP &&
>  	git merge -F "$TEST_DIRECTORY/t3434/eucJP.txt" second &&
> --
> 2.24.0.164.g78daf050de.dirty
>
>

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list
  2019-11-11 18:26     ` Johannes Schindelin
@ 2019-11-12  4:17       ` Junio C Hamano
  0 siblings, 0 replies; 89+ messages in thread
From: Junio C Hamano @ 2019-11-12  4:17 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Doan Tran Cong Danh, git, peff

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Having said that, I think that the solution proposed over there is more
> complete, and maybe even more robust (nothing in your patch prevents
> `label-<hex-oid>` to _already_ having been taken by another label).

Good thinking.  Let me drop this last one from the series in the
meantime.

Thanks.

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2019-11-12  4:17 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-31  9:26 [PATCH 0/3] Linux with musl libc improvement Doan Tran Cong Danh
2019-10-31  9:26 ` [PATCH 1/3] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
2019-10-31 17:41   ` Jeff King
2019-11-01  1:33     ` Danh Doan
2019-10-31 19:50   ` brian m. carlson
2019-10-31  9:26 ` [PATCH 2/3] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
2019-10-31 18:11   ` Jeff King
2019-10-31 20:02     ` brian m. carlson
2019-11-01  1:40     ` Danh Doan
2019-10-31  9:26 ` [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
2019-10-31 10:38   ` Johannes Schindelin
2019-10-31 19:26     ` Jeff King
2019-11-01  4:49       ` Danh Doan
2019-11-01  8:25 ` [PATCH v2 0/3] Linux with musl libc improvement Doan Tran Cong Danh
2019-11-01  8:25   ` [PATCH v2 1/3] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
2019-11-01 16:54     ` Jeff King
2019-11-01  8:25   ` [PATCH v2 2/3] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
2019-11-01 16:56     ` Jeff King
2019-11-02  0:43       ` Danh Doan
2019-11-01  8:25   ` [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
2019-11-01 16:59     ` Jeff King
2019-11-02  1:02       ` Danh Doan
2019-11-02 12:20         ` Danh Doan
2019-11-05  8:00         ` Jeff King
2019-11-06  1:30           ` Junio C Hamano
2019-11-06  4:03             ` Jeff King
2019-11-06 10:03               ` Danh Doan
2019-11-07  5:56                 ` Jeff King
2019-11-06  9:19 ` [PATCH v3 0/8] Correct internal working and output encoding Doan Tran Cong Danh
2019-11-06  9:19   ` [PATCH v3 1/8] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
2019-11-06  9:20   ` [PATCH v3 2/8] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
2019-11-06  9:20   ` [PATCH v3 3/8] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
2019-11-06  9:20   ` [PATCH v3 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
2019-11-06  9:20   ` [PATCH v3 5/8] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
2019-11-06  9:20   ` [PATCH v3 6/8] sequencer: reencode squashing commit's message Doan Tran Cong Danh
2019-11-06  9:20   ` [PATCH v3 7/8] sequencer: reencode old merge-commit message Doan Tran Cong Danh
2019-11-06 15:39     ` Eric Sunshine
2019-11-06  9:20   ` [PATCH v3 8/8] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
2019-11-07  2:56 ` [PATCH v4 0/8] Correct internal working and output encoding Doan Tran Cong Danh
2019-11-07  2:56   ` [PATCH v4 1/8] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
2019-11-07  2:56   ` [PATCH v4 2/8] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
2019-11-07  6:18     ` Junio C Hamano
2019-11-07  2:56   ` [PATCH v4 3/8] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
2019-11-07  6:02     ` Jeff King
2019-11-07  6:48       ` Danh Doan
2019-11-07  8:02         ` Jeff King
2019-11-07 10:51           ` Danh Doan
2019-11-11  8:22             ` Jeff King
2019-11-07  2:56   ` [PATCH v4 4/8] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
2019-11-07  6:04     ` Jeff King
2019-11-07  2:56   ` [PATCH v4 5/8] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
2019-11-07  6:06     ` Jeff King
2019-11-07  2:56   ` [PATCH v4 6/8] sequencer: reencode squashing commit's message Doan Tran Cong Danh
2019-11-07  6:15     ` Jeff King
2019-11-07  2:56   ` [PATCH v4 7/8] sequencer: reencode old merge-commit message Doan Tran Cong Danh
2019-11-07  2:56   ` [PATCH v4 8/8] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
2019-11-07  6:32     ` Jeff King
2019-11-07  7:48       ` Danh Doan
2019-11-07  8:03         ` Jeff King
2019-11-07 16:32           ` Danh Doan
2019-11-08  9:43 ` [PATCH v5 0/9] Improve odd encoding integration Doan Tran Cong Danh
2019-11-08  9:43   ` [PATCH v5 1/9] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
2019-11-08  9:43   ` [PATCH v5 2/9] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
2019-11-08  9:43   ` [PATCH v5 3/9] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
2019-11-08  9:43   ` [PATCH v5 4/9] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
2019-11-08  9:43   ` [PATCH v5 5/9] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
2019-11-08  9:43   ` [PATCH v5 6/9] sequencer: reencode squashing commit's message Doan Tran Cong Danh
2019-11-08  9:43   ` [PATCH v5 7/9] sequencer: reencode old merge-commit message Doan Tran Cong Danh
2019-11-08  9:43   ` [PATCH v5 8/9] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
2019-11-08  9:43   ` [PATCH v5 9/9] sequencer: fallback to sane label in making rebase todo list Doan Tran Cong Danh
2019-11-11  1:22   ` [PATCH v5 0/9] Improve odd encoding integration Junio C Hamano
2019-11-11  4:02   ` Junio C Hamano
2019-11-11  4:43     ` Danh Doan
2019-11-11  6:14     ` Junio C Hamano
2019-11-11  6:03 ` [PATCH v6 0/9] sequencer: handle other encoding better Doan Tran Cong Danh
2019-11-11  6:03   ` [PATCH v6 1/9] t0028: eliminate non-standard usage of printf Doan Tran Cong Danh
2019-11-11  6:03   ` [PATCH v6 2/9] configure.ac: define ICONV_OMITS_BOM if necessary Doan Tran Cong Danh
2019-11-11  6:03   ` [PATCH v6 3/9] t3900: demonstrate git-rebase problem with multi encoding Doan Tran Cong Danh
2019-11-11  6:03   ` [PATCH v6 4/9] sequencer: reencode to utf-8 before arrange rebase's todo list Doan Tran Cong Danh
2019-11-11  6:03   ` [PATCH v6 5/9] sequencer: reencode revert/cherry-pick's " Doan Tran Cong Danh
2019-11-11  6:03   ` [PATCH v6 6/9] sequencer: reencode squashing commit's message Doan Tran Cong Danh
2019-11-11  6:03   ` [PATCH v6 7/9] sequencer: reencode old merge-commit message Doan Tran Cong Danh
2019-11-11  6:03   ` [PATCH v6 8/9] sequencer: reencode commit message for am/rebase --show-current-patch Doan Tran Cong Danh
2019-11-11  6:03   ` [PATCH v6 9/9] sequencer: fallback to sane label in making rebase todo list Doan Tran Cong Danh
2019-11-11  8:39     ` Jeff King
2019-11-11 16:22       ` Phillip Wood
2019-11-11 18:26     ` Johannes Schindelin
2019-11-12  4:17       ` Junio C Hamano
2019-11-11  8:40   ` [PATCH v6 0/9] sequencer: handle other encoding better Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).