Re: [PATCH] utf8: handle systems that don't write BOM for UTF-16

From: Eric Sunshine <sunshine@sunshineco.com>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: Git List <git@vger.kernel.org>,
	Lars Schneider <larsxschneider@gmail.com>,
	Rich Felker <dalias@libc.org>, Junio C Hamano <gitster@pobox.com>,
	Kevin Daudt <me@ikke.info>
Subject: Re: [PATCH] utf8: handle systems that don't write BOM for UTF-16
Date: Sat, 9 Feb 2019 20:45:16 -0500	[thread overview]
Message-ID: <CAPig+cRyzZMOM19ztgR_wqvk68P_1eNNVBBj5pbY=MhQm08WAw@mail.gmail.com> (raw)
In-Reply-To: <20190209200802.277139-1-sandals@crustytoothpaste.net>

On Sat, Feb 9, 2019 at 3:08 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> [...]
> Add a Makefile and #define knob, ICONV_NEEDS_BOM, that can be set if the
> iconv implementation has this behavior. When set, Git will write a BOM
> manually for UTF-16 and UTF-32 and then force the data to be written in
> UTF-16BE or UTF-32BE. We choose big-endian behavior here because the
> tests use the raw "UTF-16" encoding, which will be big-endian when the
> implementation requires this knob to be set.

The name ICONV_NEEDS_BOM makes it sound as if we must feed a BOM
_into_ 'iconv', which is quite confusing since the actual intention is
that 'iconv' doesn't emit a BOM and we need to make up for the
deficiency. Using a name such as ICONV_OMITS_BOM or ICONV_NEGLECTS_BOM
makes it somewhat clearer that there is some deficiency with which we
need to deal.

> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
> diff --git a/Makefile b/Makefile
> @@ -259,6 +259,9 @@ all::
> +# Define ICONV_NEEDS_BOM if your iconv implementation does not write a
> +# byte-order mark (BOM) when writing UTF-16 or UTF-32.

Not a big deal, but I wonder if it would be helpful to tack on "...,
in which case it outputs big-endian unconditionally." or something.

> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> @@ -6,6 +6,25 @@ test_description='working-tree-encoding conversion via gitattributes'
> +test_lazy_prereq NO_UTF16_BOM '
> +       test $(printf abc | iconv -f UTF-8 -t UTF-16 | wc -c) = 6
> +'
> +
> +test_lazy_prereq NO_UTF32_BOM '
> +       test $(printf abc | iconv -f UTF-8 -t UTF-32 | wc -c) = 12
> +'
> +
> +write_utf16 () {
> +       test_have_prereq NO_UTF16_BOM && printf '\xfe\xff'
> +       iconv -f UTF-8 -t UTF-16
> +
> +}

Stray blank line before the closing brace.

> +
> +write_utf32 () {
> +       test_have_prereq NO_UTF32_BOM && printf '\x00\x00\xfe\xff'
> +       iconv -f UTF-8 -t UTF-32
> +}

It's probably doesn't matter much with these two tiny functions, but I
was wondering if it would make sense to maintain the &&-chain, perhaps
like this:

    if test test_have_prereq NO_UTF32_BOM
    then
        printf '\x00\x00\xfe\xff'
    fi &&
    iconv -f UTF-8 -t UTF-32