From: "Đoàn Trần Công Danh" <congdanhqx@gmail.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Matheus Tavares <matheus.bernardino@usp.br>,
gitster@pobox.com, git@vger.kernel.org,
"brian m . carlson" <sandals@crustytoothpaste.net>
Subject: Re: [PATCH] t2080: fix cp invocation to copy symlinks instead of following them
Date: Wed, 2 Jun 2021 18:22:54 +0700 [thread overview]
Message-ID: <YLdqDn9vCBc7sPDN@danh.dev> (raw)
In-Reply-To: <87pmx47cs9.fsf@evledraar.gmail.com>
On 2021-06-02 12:50:53+0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
> On Wed, Jun 02 2021, Đoàn Trần Công Danh wrote:
>
> > On 2021-05-31 16:01:01+0200, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> >>
> >> On Thu, May 27 2021, Ævar Arnfjörð Bjarmason wrote:
> >>
> >> > On Wed, May 26 2021, Matheus Tavares wrote:
> >> >
> >> >> t2080 makes a few copies of a test repository and later performs a
> >> >> branch switch on each one of the copies to verify that parallel checkout
> >> >> and sequential checkout produce the same results. However, the
> >> >> repository is copied with `cp -R` which, on some systems, defaults to
> >> >> following symlinks on the directory hierarchy and copying their target
> >> >> files instead of copying the symlinks themselves. AIX is one example of
> >> >> system where this happens. Because the symlinks are not preserved, the
> >> >> copied repositories have paths that do not match what is in the index,
> >> >> causing git to abort the checkout operation that we want to test. This
> >> >> makes the test fail on these systems.
> >> >>
> >> >> Fix this by copying the repository with the POSIX flag '-P', which
> >> >> forces cp to copy the symlinks instead of following them. Note that we
> >> >> already use this flag for other cp invocations in our test suite (see
> >> >> t7001). With this change, t2080 now passes on AIX.
> >> >>
> >> >> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> >> >> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> >> >> ---
> >> >> t/t2080-parallel-checkout-basics.sh | 2 +-
> >> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >> >>
> >> >> diff --git a/t/t2080-parallel-checkout-basics.sh b/t/t2080-parallel-checkout-basics.sh
> >> >> index 7087818550..3e0f8c675f 100755
> >> >> --- a/t/t2080-parallel-checkout-basics.sh
> >> >> +++ b/t/t2080-parallel-checkout-basics.sh
> >> >> @@ -114,7 +114,7 @@ do
> >> >>
> >> >> test_expect_success "$mode checkout" '
> >> >> repo=various_$mode &&
> >> >> - cp -R various $repo &&
> >> >> + cp -R -P various $repo &&
> >> >>
> >> >> # The just copied files have more recent timestamps than their
> >> >> # associated index entries. So refresh the cached timestamps
> >> >
> >> > Thanks for the quick fix, I can confirm that this makes the test pass on
> >> > AIX 7.2.
> >>
> >> There's still a failure[1] in t2082-parallel-checkout-attributes.sh
> >> though, which is new in 2.32.0-rc*. The difference is in an unexpected
> >> BOM:
> >>
> >> avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/A.internal
> >> efbbbf74657874
> >> avar@gcc119:[/scratch/avar/git/t]perl -nle 'print unpack "H*"' trash\ directory.t2082-parallel-checkout-attributes/encoding/utf8-text
> >> 74657874
> >>
> >> I.e. the A.internal starts with 0xefbbbf. The 2nd test of t0028*.sh also
> >> fails similarly[2], so perhaps it's some old/iconv/whatever issue not
> >> per-se related to any change of yours.
> >
> > The 0xefbbbf looks interesting, it's BOM for utf-8.
> >
> >> I tried compiling with both NO_ICONV=Y and ICONV_OMITS_BOM=Y, both have
> >> the same failure.
> >
> > I didn't check the code-path for NO_ICONV=Y but ICONV_OMITS_BOM=Y only
> > affects output of converting *to* utf-16 and utf-32.
> >
> > So, I think AIX iconv implementation automatically add BOM to utf-8?
> >
> > Perhap we need to call skip_utf8_bom somewhere?
>
> I debugged this a bit more, it's probably *also* an issue in our use of
> libiconv, but it goes wrong just with our test setup with
> iconv(1). I.e. on my boring linux box:
>
> echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a'
> $VAR1 = [
> '0xff',
> '0xfe',
> '0x78',
> '0x0',
> '0xa',
> '0x0'
> ];
>
>
> On the AIX box to get the same I need to do that as:
>
> (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...]
FWIW, my Linux with musl-libc also need to be done like this.
> I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian
> UTF-16, a plain UTF-16 gives you the big-endian version.
Per spec, plain UTF-16 *is* big-endian. [1]
In the table <BOM> indicates that the byte order is determined
by a byte order mark, if present at the beginning of the data
stream, otherwise it is big-endian.
> To make things
> worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE
> version. So it seems we can't get the same result at all for that one.
Ditto for UTF-32
> So from the outset the code added around 79444c92943 (utf8: handle
> systems that don't write BOM for UTF-16, 2019-02-12) needs to be more
> careful (although this looked broken before), i.e. we should test exact
> known-good bytes and see if UTF-16 is really what we think it is,
> etc. This is likely broken on any big-endian non-GNUish iconv
> implementation.
Linux with musl-libc on little endian also thinks UTF-16 without BOM is UTF-16-BE
I still think we should strip UTF-8 BOM after reencode_string_len
I.e. something like this, I can't test this, though, since I don't have any AIX box.
And my Linux with musl-libc doesn't output BOM for utf-8
It doesn't write BOM for utf-16be and utf-32be, anyway.
-----8<----
diff --git a/utf8.c b/utf8.c
index de4ce5c0e6..73631632bd 100644
--- a/utf8.c
+++ b/utf8.c
@@ -8,6 +8,7 @@ static const char utf16_be_bom[] = {'\xFE', '\xFF'};
static const char utf16_le_bom[] = {'\xFF', '\xFE'};
static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
+const char utf8_bom[] = "\357\273\277";
struct interval {
ucs_char_t first;
@@ -28,6 +29,12 @@ size_t display_mode_esc_sequence_len(const char *s)
return p - s;
}
+static int has_utf8_bom(const char *text, size_t len)
+{
+ return len >= strlen(utf8_bom) &&
+ memcmp(text, utf8_bom, strlen(utf8_bom)) == 0;
+}
+
/* auxiliary function for binary search in interval table */
static int bisearch(ucs_char_t ucs, const struct interval *table, int max)
{
@@ -539,12 +546,13 @@ static const char *fallback_encoding(const char *name)
char *reencode_string_len(const char *in, size_t insz,
const char *out_encoding, const char *in_encoding,
- size_t *outsz)
+ size_t *outsz_p)
{
iconv_t conv;
char *out;
const char *bom_str = NULL;
size_t bom_len = 0;
+ size_t outsz = 0;
if (!in_encoding)
return NULL;
@@ -590,10 +598,16 @@ char *reencode_string_len(const char *in, size_t insz,
if (conv == (iconv_t) -1)
return NULL;
}
- out = reencode_string_iconv(in, insz, conv, bom_len, outsz);
+ out = reencode_string_iconv(in, insz, conv, bom_len, &outsz);
iconv_close(conv);
if (out && bom_str && bom_len)
memcpy(out, bom_str, bom_len);
+ if (is_encoding_utf8(out_encoding) && has_utf8_bom(out, outsz)) {
+ outsz -= strlen(utf8_bom);
+ memmove(out, out + strlen(utf8_bom), outsz + 1);
+ }
+ if (outsz_p)
+ *outsz_p = outsz;
return out;
}
#endif
@@ -782,12 +796,9 @@ int is_hfs_dotmailmap(const char *path)
return is_hfs_dot_str(path, "mailmap");
}
-const char utf8_bom[] = "\357\273\277";
-
int skip_utf8_bom(char **text, size_t len)
{
- if (len < strlen(utf8_bom) ||
- memcmp(*text, utf8_bom, strlen(utf8_bom)))
+ if (!has_utf8_bom(*text, len))
return 0;
*text += strlen(utf8_bom);
return 1;
---->8------
1: https://unicode.org/faq/utf_bom.html
--
Danh
next prev parent reply other threads:[~2021-06-02 11:22 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-26 23:58 [PATCH] t2080: fix cp invocation to copy symlinks instead of following them Matheus Tavares
2021-05-27 7:25 ` Christian Couder
2021-05-27 12:51 ` Ævar Arnfjörð Bjarmason
2021-05-31 14:01 ` Ævar Arnfjörð Bjarmason
2021-05-31 16:09 ` Matheus Tavares
2021-05-31 20:41 ` Ævar Arnfjörð Bjarmason
2021-06-02 1:36 ` Đoàn Trần Công Danh
2021-06-02 10:50 ` Ævar Arnfjörð Bjarmason
2021-06-02 11:14 ` Bagas Sanjaya
2021-06-02 11:22 ` Đoàn Trần Công Danh [this message]
2021-06-02 13:36 ` Ævar Arnfjörð Bjarmason
2021-06-02 13:50 ` Đoàn Trần Công Danh
2021-06-03 12:34 ` Đoàn Trần Công Danh
2021-06-02 19:13 ` UTF-BOM was: [PATCH] t2080: fix cp invocation Torsten Bögershausen
2021-06-03 0:07 ` [PATCH] t2080: fix cp invocation to copy symlinks instead of following them brian m. carlson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YLdqDn9vCBc7sPDN@danh.dev \
--to=congdanhqx@gmail.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=matheus.bernardino@usp.br \
--cc=sandals@crustytoothpaste.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).