All of lore.kernel.org
 help / color / mirror / Atom feed
From: "brian m. carlson" <sandals@crustytoothpaste.net>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: "Đoàn Trần Công Danh" <congdanhqx@gmail.com>,
	"Matheus Tavares" <matheus.bernardino@usp.br>,
	gitster@pobox.com, git@vger.kernel.org
Subject: Re: [PATCH] t2080: fix cp invocation to copy symlinks instead of following them
Date: Thu, 3 Jun 2021 00:07:15 +0000	[thread overview]
Message-ID: <YLgdM7i1FkM3f5PN@camp.crustytoothpaste.net> (raw)
In-Reply-To: <87pmx47cs9.fsf@evledraar.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2822 bytes --]

On 2021-06-02 at 10:50:53, Ævar Arnfjörð Bjarmason wrote:
> I debugged this a bit more, it's probably *also* an issue in our use of
> libiconv, but it goes wrong just with our test setup with
> iconv(1). I.e. on my boring linux box:
>     
>     echo x | iconv -f UTF-8 -t UTF-16 | perl -0777 -MData::Dumper -ne 'my @a = map { sprintf "0x%x", $_ } unpack "C*"; print Dumper \@a'
>     $VAR1 = [
>               '0xff',
>               '0xfe',
>               '0x78',
>               '0x0',
>               '0xa',
>               '0x0'
>             ];
> 

This is a little-endian encoding of UTF-16 with a BOM.  The BOM is
required here since the default, if no BOM is provided, is big endian.
However, as I alluded to in 79444c92943, while the standard permits the
BOM to be omitted, doing so is generally improvident because that leads
to breakage when interoperating with Windows machines, many programs for
which assume little endian.

I mean, I don't use Windows and I think those programs are broken and
their authors rightfully should have known better, but practically,
using a BOM solves the problem easily, and if we can be slightly nicer
to the poor, hapless users of those programs, why not?

> On the AIX box to get the same I need to do that as:
> 
>     (printf '\376\377'; echo x | iconv -f UTF-8 -t UTF-16LE) | [...]
> 
> I.e. we omit the BOM *and* AIX's idea of our UTF-16 is little-endian
> UTF-16, a plain UTF-16 gives you the big-endian version. To make things
> worse the same is true of UTF-32, except "iconv -l" lists no UTF-32LE
> version. So it seems we can't get the same result at all for that one.

But what do you get if you just use UTF-16?  Is it little endian with
BOM, big endian with BOM, or big endian without BOM?  If it's big endian
without BOM, did you set ICONV_OMITS_BOM when building?

> So from the outset the code added around 79444c92943 (utf8: handle
> systems that don't write BOM for UTF-16, 2019-02-12) needs to be more
> careful (although this looked broken before), i.e. we should test exact
> known-good bytes and see if UTF-16 is really what we think it is,
> etc. This is likely broken on any big-endian non-GNUish iconv
> implementation.

We probably could have been more careful here.  Part of the problem is
that I don't have access to any affected systems here, so it's not in
general easy for me to write a test (or even a patch) for this case.

We also did use iconv(1) before that, but I _think_ it's possible to
remove it.  The thing that's tricky is the use of SHIFT-JIS, which has
known round-tripping problems, but I don't think we rely on using the
system iconv(3) there and encoding any valid SHIFT-JIS sequence is
probably fine.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

      parent reply	other threads:[~2021-06-03  0:07 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-26 23:58 [PATCH] t2080: fix cp invocation to copy symlinks instead of following them Matheus Tavares
2021-05-27  7:25 ` Christian Couder
2021-05-27 12:51 ` Ævar Arnfjörð Bjarmason
2021-05-31 14:01   ` Ævar Arnfjörð Bjarmason
2021-05-31 16:09     ` Matheus Tavares
2021-05-31 20:41       ` Ævar Arnfjörð Bjarmason
2021-06-02  1:36     ` Đoàn Trần Công Danh
2021-06-02 10:50       ` Ævar Arnfjörð Bjarmason
2021-06-02 11:14         ` Bagas Sanjaya
2021-06-02 11:22         ` Đoàn Trần Công Danh
2021-06-02 13:36           ` Ævar Arnfjörð Bjarmason
2021-06-02 13:50             ` Đoàn Trần Công Danh
2021-06-03 12:34               ` Đoàn Trần Công Danh
2021-06-02 19:13             ` UTF-BOM was: [PATCH] t2080: fix cp invocation Torsten Bögershausen
2021-06-03  0:07         ` brian m. carlson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YLgdM7i1FkM3f5PN@camp.crustytoothpaste.net \
    --to=sandals@crustytoothpaste.net \
    --cc=avarab@gmail.com \
    --cc=congdanhqx@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=matheus.bernardino@usp.br \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.