Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute

From: Lars Schneider <larsxschneider@gmail.com>
To: Eric Sunshine <sunshine@sunshineco.com>
Cc: "Lars Schneider" <lars.schneider@autodesk.com>,
	"Git List" <git@vger.kernel.org>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Torsten Bögershausen" <tboegi@web.de>,
	"Johannes Sixt" <j6t@kdbg.org>, "Jeff King" <peff@peff.net>,
	"Ramsay Jones" <ramsay@ramsayjones.plus.com>,
	"Johannes Schindelin" <Johannes.Schindelin@gmx.de>
Subject: Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute
Date: Tue, 27 Feb 2018 12:16:56 +0100	[thread overview]
Message-ID: <3A3B1BD6-642C-467F-B80C-82853410C235@gmail.com> (raw)
In-Reply-To: <CAPig+cRtp5g0iDd3pHybZXd_VkGG2-a4CdCZdiqN55Cr1T5Zyg@mail.gmail.com>

> On 25 Feb 2018, at 08:15, Eric Sunshine <sunshine@sunshineco.com> wrote:
> 
> On Sat, Feb 24, 2018 at 11:27 AM,  <lars.schneider@autodesk.com> wrote:
>> Git recognizes files encoded with ASCII or one of its supersets (e.g.
>> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
>> interpreted as binary and consequently built-in Git text processing
>> tools (e.g. 'git diff') as well as most Git web front ends do not
>> visualize the content.
>> 
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
>> @@ -272,6 +272,84 @@ few exceptions.  Even though...
>> +Please note that using the `working-tree-encoding` attribute may have a
>> +number of pitfalls:
>> +
>> +- Alternative Git implementations (e.g. JGit or libgit2) and older Git
>> +  versions (as of March 2018) do not support the `working-tree-encoding`
>> +  attribute. If you decide to use the `working-tree-encoding` attribute
>> +  in your repository, then it is strongly recommended to ensure that all
>> +  clients working with the repository support it.
>> +
>> +  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
>> +  `working-tree-encoding` enabled Git client, then `foo.proj` will be
>> +  stored as UTF-8 internally. A client without `working-tree-encoding`
>> +  support will checkout `foo.proj` as UTF-8 encoded file. This will
>> +  typically cause trouble for the users of this file.
> 
> The above paragraph is giving an example of the scenario described in
> the paragraph above it. It might, therefore, be clearer to start this
> paragraph with "For example,...". Also, as an aid to non-Windows
> developers, it might be helpful to introduce ".proj" files as
> "Microsoft Visual Studio `*.proj` files...".

Agreed. How about this?

  For example, Microsoft Visual Studio resources files (`*.rc`) or
  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16. 
  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with 
  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
  stored as UTF-8 internally. A client without `working-tree-encoding`
  support will checkout `foo.ps1` as UTF-8 encoded file. This will
  typically cause trouble for the users of this file.

>> diff --git a/convert.c b/convert.c
>> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>> +static struct encoding {
>> +       const char *name;
>> +       struct encoding *next;
>> +} *encoding, **encoding_tail;
> 
> Why does this struct need to be a linked-list since it contains only a
> name? In fact, why do the struct and linked list exist at all? Back in
> v1 when the struct held more members, it made sense to place them in a
> collection so they could be re-used, but today it just seems an
> unnecessary artifact which buys you nothing.
> 
> Is the plan that some future patch series will add members to the
> struct? If not, then it seems that it should be retired altogether.

Absolutely! Thank you!

>> +static const char *default_encoding = "UTF-8";
>> +
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> +                        struct strbuf *buf, struct encoding *enc, int conv_flags)
>> +{
>> +       [...]
>> +       if (has_prohibited_utf_bom(enc->name, src, src_len)) {
>> +               const char *error_msg = _(
>> +                       "BOM is prohibited in '%s' if encoded as %s");
>> +               /*
>> +                * This advise is shown for UTF-??BE and UTF-??LE encodings.
> 
> s/advise/advice/

Agreed!

>> +                * We truncate the encoding name to 6 chars with %.6s to cut
>> +                * off the last two "byte order" characters.
>> +                */
>> +               const char *advise_msg = _(
>> +                       "The file '%s' contains a byte order mark (BOM). "
>> +                       "Please use %.6s as working-tree-encoding.");
>> +               advise(advise_msg, path, enc->name);
>> +               if (conv_flags & CONV_WRITE_OBJECT)
>> +                       die(error_msg, path, enc->name);
>> +               else
>> +                       error(error_msg, path, enc->name);
> 
> What is the intended behavior in the non-die() case? As implemented,
> this is still going to attempt the conversion even though it threw an
> error. Should it be returning 0 here instead? Same question for the
> couple additional cases below.

Good question. My intention was to print an error and then let iconv try
the conversion anyways. I agree that this is bogus. Let's return in case
of an error right away.

>> +
>> +       } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
>> +               const char *error_msg = _(
>> +                       "BOM is required in '%s' if encoded as %s");
>> +               const char *advise_msg = _(
>> +                       "The file '%s' is missing a byte order mark (BOM). "
>> +                       "Please use %sBE or %sLE (depending on the byte order) "
>> +                       "as working-tree-encoding.");
>> +               advise(advise_msg, path, enc->name, enc->name);
>> +               if (conv_flags & CONV_WRITE_OBJECT)
>> +                       die(error_msg, path, enc->name);
>> +               else
>> +                       error(error_msg, path, enc->name);
>> +       }
> 
> For a function which presumably is agnostic about the working-tree
> encoding (supporting whatever iconv does), it nevertheless has
> unusually intimate knowledge about UTF BOMs, in general, and the
> internals of has_prohibited_utf_bom() and
> is_missing_required_utf_bom(), in particular. It might be cleaner, and
> would simplify this function, if all this UTF BOM-specific gunk was
> moved elsewhere and generalized into a "validate conversion" function.

Agreed! How about this?

	/*
	 * If we convert to an UTF encoding, then check for common BOM errors.
	 */
	if (!memcmp("UTF-", enc, 4))
		if (has_utf_bom_error(path, enc, src, src_len, die_on_error))
			return 0;

>> +       dst = reencode_string_len(src, src_len, default_encoding, enc->name,
>> +                                 &dst_len);
>> +       if (!dst) {
>> +               /*
>> +                * We could add the blob "as-is" to Git. However, on checkout
>> +                * we would try to reencode to the original encoding. This
>> +                * would fail and we would leave the user with a messed-up
>> +                * working tree. Let's try to avoid this by screaming loud.
>> +                */
>> +               const char* msg = _("failed to encode '%s' from %s to %s");
>> +               if (conv_flags & CONV_WRITE_OBJECT)
>> +                       die(msg, path, enc->name, default_encoding);
>> +               else
>> +                       error(msg, path, enc->name, default_encoding);
>> +       }
>> +
>> +       strbuf_attach(buf, dst, dst_len, dst_len + 1);
>> +       return 1;
>> +}
>> +
>> +static int encode_to_worktree(const char *path, const char *src, size_t src_len,
>> +                             struct strbuf *buf, struct encoding *enc)
>> +{
>> +       [...]
>> +       dst = reencode_string_len(src, src_len, enc->name, default_encoding,
>> +                                 &dst_len);
>> +       if (!dst) {
>> +               error("failed to encode '%s' from %s to %s",
>> +                       path, enc->name, default_encoding);
> 
> The last two arguments need to be swapped.

Nice catch!

>> +               return 0;
>> +       }
>> +
>> +       strbuf_attach(buf, dst, dst_len, dst_len + 1);
>> +       return 1;
>> +}
>> @@ -978,6 +1083,35 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
>> +static struct encoding *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> +       if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
>> +           !strlen(value))
>> +               return NULL;
>> +
>> +       for (enc = encoding; enc; enc = enc->next)
>> +               if (!strcasecmp(value, enc->name))
>> +                       return enc;
>> +
>> +       /* Don't encode to the default encoding */
>> +       if (!strcasecmp(value, default_encoding))
>> +               return NULL;
> 
> You could handle this easy-to-detect case before attempting the more
> expensive loop just above. (Not necessarily worth a re-roll.)

True, but I delete the loop anyways when I removed the "encoding 
linked list" as suggested by your comment above.

>> +       enc = xcalloc(1, sizeof(*enc));
>> +       /*
>> +        * Ensure encoding names are always upper case (e.g. UTF-8) to
>> +        * simplify subsequent string comparisons.
>> +        */
>> +       enc->name = xstrdup_toupper(value);
>> +       *encoding_tail = enc;
>> +       encoding_tail = &(enc->next);
>> +
>> +       return enc;
>> +}
>> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
>> @@ -0,0 +1,226 @@
>> +test_expect_success 'ensure UTF-8 is stored in Git' '
>> +       git cat-file -p :test.utf16 >test.utf16.git &&
>> +       test_cmp_bin test.utf8.raw test.utf16.git &&
>> +
>> +       # cleanup
>> +       rm test.utf16.git
>> +'
> 
> This cleanup won't take place if test_cmp_bin() fails. Instead,
> cleanups are typically handled by test_when_finished() to ensure that
> the cleanup action occurs even if the test fails.
> 
>    test_expect_success 'ensure UTF-8 is stored in Git' '
>        test_when_finished "rm -f  test.utf16.git" &&
>        ...
> 
> Same comment for remaining tests.

Agreed! That's the proper way to cleanup the tests. I'll fix my tests.

>> +test_expect_success 'check prohibited UTF BOM' '
>> +       printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
>> +       printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
>> +       printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
>> +       printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
>> +
>> +       printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
>> +       printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
>> +       printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
>> +       printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
> 
> These resources seem to be used by multiple tests. Perhaps create them
> in a distinct "setup" test instead of bundling them in this test?

Good idea!

>> +       echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
>> +       echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
>> +       echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
>> +       echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&
>> +
>> +       # Here we add a UTF-16 files with BOM (big-endian and little-endian)
>> +       # but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
>> +       # the BOM is prohibited.
>> +       cp bebom.utf16be.raw bebom.utf16be &&
>> +       test_must_fail git add bebom.utf16be 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
>> +
>> +       cp lebom.utf16le.raw lebom.utf16be &&
>> +       test_must_fail git add lebom.utf16be 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
>> +
>> +       cp bebom.utf16be.raw bebom.utf16le &&
>> +       test_must_fail git add bebom.utf16le 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
>> +
>> +       cp lebom.utf16le.raw lebom.utf16le &&
>> +       test_must_fail git add lebom.utf16le 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
>> +
>> +       # ... and the same for UTF-32
>> +       cp bebom.utf32be.raw bebom.utf32be &&
>> +       test_must_fail git add bebom.utf32be 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
>> +
>> +       cp lebom.utf32le.raw lebom.utf32be &&
>> +       test_must_fail git add lebom.utf32be 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
>> +
>> +       cp bebom.utf32be.raw bebom.utf32le &&
>> +       test_must_fail git add bebom.utf32le 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
>> +
>> +       cp lebom.utf32le.raw lebom.utf32le &&
>> +       test_must_fail git add lebom.utf32le 2>err.out &&
>> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
>> +
>> +       # cleanup
>> +       git reset --hard HEAD
>> +'
> 
> Clumping all these checks into the same test makes it difficult to
> determine which one failed (if one does fail). Also, the amount of
> copy/paste code is easy to get wrong and a bit onerous to review.
> Automating via for-loops would address these concerns. For instance,
> to check all combinations (not just 8 combinations tested above):
> 
>    for i in 16be 16le 32be 32le
>    do
>        for j in 16be 16le 32be 32le
>        do
>            test_expect_success "check prohibited UTF BOM utf$i/utf$j" '
>                test_when_finished "git reset --hard HEAD" &&
>                cp bebom.utf$i.raw bebom.utf$j &&
>                test_must_fail git add bebom.utf$j 2>err.out &&
>                J=$(echo $j | tr bel BEL) &&
>                test_i18ngrep "fatal: BOM is prohibited .* UTF-$J" err.out
>            '
>        done
>    done
> 
> Alternately, the test could be moved to a function and called for each
> combination:
> 
>    check_prohibited_bom () {
>        i=$1
>        j=$2
>        test_expect_success "check prohibited UTF BOM utf$i/utf$j" '
>            ...
>        '
>    }
>    check_prohibited_bom 16be 16be
>    check_prohibited_bom 16be 16le
>    ...
> 
> Same comment regarding copy/paste in tests below.

In general I share your concern about code duplication.
However, I don't want be "too clever" in testing code
as it becomes hard to read it. Therefore, I implemented
a loop over "16 32" which is a good compromise I think.

Thanks you very much for this thorough review,
Lars