git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: "Đoàn Trần Công Danh" <congdanhqx@gmail.com>
Cc: Matheus Tavares <matheus.bernardino@usp.br>,
	gitster@pobox.com, git@vger.kernel.org,
	"brian m . carlson" <sandals@crustytoothpaste.net>
Subject: Re: [PATCH] t2080: fix cp invocation to copy symlinks instead of following them
Date: Wed, 02 Jun 2021 15:36:57 +0200	[thread overview]
Message-ID: <87mts875d3.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <YLdqDn9vCBc7sPDN@danh.dev>


On Wed, Jun 02 2021, Đoàn Trần Công Danh wrote:

> On 2021-06-02 12:50:53+0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> 
>> On Wed, Jun 02 2021, Đoàn Trần Công Danh wrote:
>> 
>> > On 2021-05-31 16:01:01+0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> >> 
>> >> On Thu, May 27 2021, Ævar Arnfjörð Bjarmason wrote:
>> >> 
>> >> > On Wed, May 26 2021, Matheus Tavares wrote:
>> >> >
>> >> >> t2080 makes a few copies of a test repository and later performs a
>> >> >> branch switch on each one of the copies to verify that parallel checkout
>> >> >> and sequential checkout produce the same results. However, the
>> >> >> repository is copied with `cp -R` which, on some systems, defaults to
>> >> >> following symlinks on the directory hierarchy and copying their target
>> >> >> files instead of copying the symlinks themselves. AIX is one example of
>> >> >> system where this happens. Because the symlinks are not preserved, the
>> >> >> copied repositories have paths that do not match what is in the index,
>> >> >> causing git to abort the checkout operation that we want to test. This
>> >> >> makes the test fail on these systems.
>> >> >>
>> >> >> Fix this by copying the repository with the POSIX flag '-P', which
>> >> >> forces cp to copy the symlinks instead of following them. Note that we
>> >> >> already use this flag for other cp invocations in our test suite (see
>> >> >> t7001). With this change, t2080 now passes on AIX.
>> >> >>
>> >> >> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> >> >> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
>> >> >> ---
>> >> >>  t/t2080-parallel-checkout-basics.sh | 2 +-
>> >> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >> >>
>> >> >> diff --git a/t/t2080-parallel-checkout-basics.sh b/t/t2080-parallel-checkout-basics.sh
>> >> >> index 7087818550..3e0f8c675f 100755
>> >> >> --- a/t/t2080-parallel-checkout-basics.sh
>> >> >> +++ b/t/t2080-parallel-checkout-basics.sh
>> >> >> @@ -114,7 +114,7 @@ do
>> >> >>  
>> >> >>  	test_expect_success "$mode checkout" '
>> >> >>  		repo=various_$mode &&
>> >> >> -		cp -R various $repo &&
>> >> >> +		cp -R -P various $repo &&
>> >> >>  
>> >> >>  		# The just copied files have more recent timestamps than their
>> >> >>  		# associated index entries. So refresh the cached timestamps
>> >> >
>> >> > Thanks for the quick fix, I can confirm that this makes the test pass on
>> >> > AIX 7.2.
>> >> 
>> >> There's still a failure[1] in t2082-parallel-checkout-attributes.sh
>> >> though, which is new in 2.32.0-rc*. The difference is in an unexpected
>> >> BOM:
>> >>     
>> >>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal 
>> >>     efbbbf74657874
>> >>     avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text  
>> >>     74657874
>> >> 
>> >> I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also
>> >> fails similarly[2], so perhaps it's some old/iconv/whatever issue not
>> >> per-se related to any change of yours.
>> >
>> > The 0xefbbbf looks interesting, it's BOM for utf-8.
>> >
>> >> I tried compiling with both NO_ICONV=Y and ICONV_OMITS_BOM=Y, both have
>> >> the same failure.
>> >
>> > I didn't check the code-path for NO_ICONV=Y but ICONV_OMITS_BOM=Y only
>> > affects output of converting *to* utf-16 and utf-32.
>> >
>> > So, I think AIX iconv implementation automatically add BOM to utf-8?
>> >
>> > Perhap we need to call skip_utf8_bom somewhere?
>> 
>> I debugged this a bit more, it's probably *also* an issue in our use of
>> libiconv, but it goes wrong just with our test setup with
>> iconv(1). I.e. on my boring linux box:
>>     
>>     echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a'
>>     $VAR1 = [
>>               '0xff',
>>               '0xfe',
>>               '0x78',
>>               '0x0',
>>               '0xa',
>>               '0x0'
>>             ];
>> 
>> 
>> On the AIX box to get the same I need to do that as:
>> 
>>     (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...]
>
> FWIW, my Linux with musl-libc also need to be done like this.
>
>> I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian
>> UTF-16, a plain UTF-16 gives you the big-endian version.
>
> Per spec, plain UTF-16 *is* big-endian. [1]
>
> 	In the table <BOM> indicates that the byte order is determined
> 	by a byte order mark, if present at the beginning of the data
> 	stream, otherwise it is big-endian.
>
>> To make things
>> worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE
>> version. So it seems we can't get the same result at all for that one.
>
> Ditto for UTF-32
>
>> So from the outset the code added around 79444c92943 (utf8: handle
>> systems that don't write BOM for UTF-16, 2019-02-12) needs to be more
>> careful (although this looked broken before), i.e. we should test exact
>> known-good bytes and see if UTF-16 is really what we think it is,
>> etc. This is likely broken on any big-endian non-GNUish iconv
>> implementation.
>
> Linux with musl-libc on little endian also thinks UTF-16 without BOM is UTF-16-BE
>
> I still think we should strip UTF-8 BOM after reencode_string_len
> I.e. something like this, I can't test this, though, since I don't have any AIX box.
> And my Linux with musl-libc doesn't output BOM for utf-8
> It doesn't write BOM for utf-16be and utf-32be, anyway.
>
> -----8<----
> diff --git a/utf8.c b/utf8.c
> index de4ce5c0e6..73631632bd 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -8,6 +8,7 @@ static const char utf16_be_bom[] = {'\xFE', '\xFF'};
>  static const char utf16_le_bom[] = {'\xFF', '\xFE'};
>  static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
>  static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
> +const char utf8_bom[] = "\357\273\277";
>  
>  struct interval {
>  	ucs_char_t first;
> @@ -28,6 +29,12 @@ size_t display_mode_esc_sequence_len(const char *s)
>  	return p - s;
>  }
>  
> +static int has_utf8_bom(const char *text, size_t len)
> +{
> +	return len >= strlen(utf8_bom) &&
> +		memcmp(text, utf8_bom, strlen(utf8_bom)) == 0;
> +}
> +
>  /* auxiliary function for binary search in interval table */
>  static int bisearch(ucs_char_t ucs, const struct interval *table, int max)
>  {
> @@ -539,12 +546,13 @@ static const char *fallback_encoding(const char *name)
>  
>  char *reencode_string_len(const char *in, size_t insz,
>  			  const char *out_encoding, const char *in_encoding,
> -			  size_t *outsz)
> +			  size_t *outsz_p)
>  {
>  	iconv_t conv;
>  	char *out;
>  	const char *bom_str = NULL;
>  	size_t bom_len = 0;
> +	size_t outsz = 0;
>  
>  	if (!in_encoding)
>  		return NULL;
> @@ -590,10 +598,16 @@ char *reencode_string_len(const char *in, size_t insz,
>  		if (conv == (iconv_t) -1)
>  			return NULL;
>  	}
> -	out = reencode_string_iconv(in, insz, conv, bom_len, outsz);
> +	out = reencode_string_iconv(in, insz, conv, bom_len, &outsz);
>  	iconv_close(conv);
>  	if (out && bom_str && bom_len)
>  		memcpy(out, bom_str, bom_len);
> +	if (is_encoding_utf8(out_encoding) && has_utf8_bom(out, outsz)) {
> +		outsz -= strlen(utf8_bom);
> +		memmove(out, out + strlen(utf8_bom), outsz + 1);
> +	}
> +	if (outsz_p)
> +		*outsz_p = outsz;
>  	return out;
>  }
>  #endif
> @@ -782,12 +796,9 @@ int is_hfs_dotmailmap(const char *path)
>  	return is_hfs_dot_str(path, "mailmap");
>  }
>  
> -const char utf8_bom[] = "\357\273\277";
> -
>  int skip_utf8_bom(char **text, size_t len)
>  {
> -	if (len < strlen(utf8_bom) ||
> -	    memcmp(*text, utf8_bom, strlen(utf8_bom)))
> +	if (!has_utf8_bom(*text, len))
>  		return 0;
>  	*text += strlen(utf8_bom);
>  	return 1;
> ---->8------
>
> 1: https://unicode.org/faq/utf_bom.html

That's getting us there, now we don't fail on the 2nd test, but do start
failing on the third "re-encode to UTF-16 on checkout" and other
"checkout" tests.

The "test_cmp" at the end of that 3rd tests shows that the difference in
test.utf16.raw and test.utf16 is now that the "raw" one has the BOM, but
not the "test.utf16" file.

  reply	other threads:[~2021-06-02 13:40 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-26 23:58 Matheus Tavares
2021-05-27  7:25 ` Christian Couder
2021-05-27 12:51 ` Ævar Arnfjörð Bjarmason
2021-05-31 14:01   ` Ævar Arnfjörð Bjarmason
2021-05-31 16:09     ` Matheus Tavares
2021-05-31 20:41       ` Ævar Arnfjörð Bjarmason
2021-06-02  1:36     ` Đoàn Trần Công Danh
2021-06-02 10:50       ` Ævar Arnfjörð Bjarmason
2021-06-02 11:14         ` Bagas Sanjaya
2021-06-02 11:22         ` Đoàn Trần Công Danh
2021-06-02 13:36           ` Ævar Arnfjörð Bjarmason [this message]
2021-06-02 13:50             ` Đoàn Trần Công Danh
2021-06-03 12:34               ` Đoàn Trần Công Danh
2021-06-02 19:13             ` UTF-BOM was: [PATCH] t2080: fix cp invocation Torsten Bögershausen
2021-06-03  0:07         ` [PATCH] t2080: fix cp invocation to copy symlinks instead of following them brian m. carlson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mts875d3.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=congdanhqx@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=matheus.bernardino@usp.br \
    --cc=sandals@crustytoothpaste.net \
    --subject='Re: [PATCH] t2080: fix cp invocation to copy symlinks instead of following them' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).