From: Lars Schneider <larsxschneider@gmail.com>
To: Eric Sunshine <sunshine@sunshineco.com>
Cc: "Lars Schneider" <lars.schneider@autodesk.com>,
"Git List" <git@vger.kernel.org>,
"Junio C Hamano" <gitster@pobox.com>,
"Torsten Bögershausen" <tboegi@web.de>,
"Johannes Sixt" <j6t@kdbg.org>, "Jeff King" <peff@peff.net>,
"Ramsay Jones" <ramsay@ramsayjones.plus.com>,
"Johannes Schindelin" <Johannes.Schindelin@gmx.de>
Subject: Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute
Date: Tue, 27 Feb 2018 12:16:56 +0100 [thread overview]
Message-ID: <3A3B1BD6-642C-467F-B80C-82853410C235@gmail.com> (raw)
In-Reply-To: <CAPig+cRtp5g0iDd3pHybZXd_VkGG2-a4CdCZdiqN55Cr1T5Zyg@mail.gmail.com>
> On 25 Feb 2018, at 08:15, Eric Sunshine <sunshine@sunshineco.com> wrote:
>
> On Sat, Feb 24, 2018 at 11:27 AM, <lars.schneider@autodesk.com> wrote:
>> Git recognizes files encoded with ASCII or one of its supersets (e.g.
>> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
>> interpreted as binary and consequently built-in Git text processing
>> tools (e.g. 'git diff') as well as most Git web front ends do not
>> visualize the content.
>>
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>>
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> ---
>> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
>> @@ -272,6 +272,84 @@ few exceptions. Even though...
>> +Please note that using the `working-tree-encoding` attribute may have a
>> +number of pitfalls:
>> +
>> +- Alternative Git implementations (e.g. JGit or libgit2) and older Git
>> + versions (as of March 2018) do not support the `working-tree-encoding`
>> + attribute. If you decide to use the `working-tree-encoding` attribute
>> + in your repository, then it is strongly recommended to ensure that all
>> + clients working with the repository support it.
>> +
>> + If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
>> + `working-tree-encoding` enabled Git client, then `foo.proj` will be
>> + stored as UTF-8 internally. A client without `working-tree-encoding`
>> + support will checkout `foo.proj` as UTF-8 encoded file. This will
>> + typically cause trouble for the users of this file.
>
> The above paragraph is giving an example of the scenario described in
> the paragraph above it. It might, therefore, be clearer to start this
> paragraph with "For example,...". Also, as an aid to non-Windows
> developers, it might be helpful to introduce ".proj" files as
> "Microsoft Visual Studio `*.proj` files...".
Agreed. How about this?
For example, Microsoft Visual Studio resources files (`*.rc`) or
PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
stored as UTF-8 internally. A client without `working-tree-encoding`
support will checkout `foo.ps1` as UTF-8 encoded file. This will
typically cause trouble for the users of this file.
>> diff --git a/convert.c b/convert.c
>> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>> +static struct encoding {
>> + const char *name;
>> + struct encoding *next;
>> +} *encoding, **encoding_tail;
>
> Why does this struct need to be a linked-list since it contains only a
> name? In fact, why do the struct and linked list exist at all? Back in
> v1 when the struct held more members, it made sense to place them in a
> collection so they could be re-used, but today it just seems an
> unnecessary artifact which buys you nothing.
>
> Is the plan that some future patch series will add members to the
> struct? If not, then it seems that it should be retired altogether.
Absolutely! Thank you!
>> +static const char *default_encoding = "UTF-8";
>> +
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> + struct strbuf *buf, struct encoding *enc, int conv_flags)
>> +{
>> + [...]
>> + if (has_prohibited_utf_bom(enc->name, src, src_len)) {
>> + const char *error_msg = _(
>> + "BOM is prohibited in '%s' if encoded as %s");
>> + /*
>> + * This advise is shown for UTF-??BE and UTF-??LE encodings.
>
> s/advise/advice/
Agreed!
>> + * We truncate the encoding name to 6 chars with %.6s to cut
>> + * off the last two "byte order" characters.
>> + */
>> + const char *advise_msg = _(
>> + "The file '%s' contains a byte order mark (BOM). "
>> + "Please use %.6s as working-tree-encoding.");
>> + advise(advise_msg, path, enc->name);
>> + if (conv_flags & CONV_WRITE_OBJECT)
>> + die(error_msg, path, enc->name);
>> + else
>> + error(error_msg, path, enc->name);
>
> What is the intended behavior in the non-die() case? As implemented,
> this is still going to attempt the conversion even though it threw an
> error. Should it be returning 0 here instead? Same question for the
> couple additional cases below.
Good question. My intention was to print an error and then let iconv try
the conversion anyways. I agree that this is bogus. Let's return in case
of an error right away.
>> +
>> + } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
>> + const char *error_msg = _(
>> + "BOM is required in '%s' if encoded as %s");
>> + const char *advise_msg = _(
>> + "The file '%s' is missing a byte order mark (BOM). "
>> + "Please use %sBE or %sLE (depending on the byte order) "
>> + "as working-tree-encoding.");
>> + advise(advise_msg, path, enc->name, enc->name);
>> + if (conv_flags & CONV_WRITE_OBJECT)
>> + die(error_msg, path, enc->name);
>> + else
>> + error(error_msg, path, enc->name);
>> + }
>
> For a function which presumably is agnostic about the working-tree
> encoding (supporting whatever iconv does), it nevertheless has
> unusually intimate knowledge about UTF BOMs, in general, and the
> internals of has_prohibited_utf_bom() and
> is_missing_required_utf_bom(), in particular. It might be cleaner, and
> would simplify this function, if all this UTF BOM-specific gunk was
> moved elsewhere and generalized into a "validate conversion" function.
Agreed! How about this?
/*
* If we convert to an UTF encoding, then check for common BOM errors.
*/
if (!memcmp("UTF-", enc, 4))
if (has_utf_bom_error(path, enc, src, src_len, die_on_error))
return 0;
>> + dst = reencode_string_len(src, src_len, default_encoding, enc->name,
>> + &dst_len);
>> + if (!dst) {
>> + /*
>> + * We could add the blob "as-is" to Git. However, on checkout
>> + * we would try to reencode to the original encoding. This
>> + * would fail and we would leave the user with a messed-up
>> + * working tree. Let's try to avoid this by screaming loud.
>> + */
>> + const char* msg = _("failed to encode '%s' from %s to %s");
>> + if (conv_flags & CONV_WRITE_OBJECT)
>> + die(msg, path, enc->name, default_encoding);
>> + else
>> + error(msg, path, enc->name, default_encoding);
>> + }
>> +
>> + strbuf_attach(buf, dst, dst_len, dst_len + 1);
>> + return 1;
>> +}
>> +
>> +static int encode_to_worktree(const char *path, const char *src, size_t src_len,
>> + struct strbuf *buf, struct encoding *enc)
>> +{
>> + [...]
>> + dst = reencode_string_len(src, src_len, enc->name, default_encoding,
>> + &dst_len);
>> + if (!dst) {
>> + error("failed to encode '%s' from %s to %s",
>> + path, enc->name, default_encoding);
>
> The last two arguments need to be swapped.
Nice catch!
>> + return 0;
>> + }
>> +
>> + strbuf_attach(buf, dst, dst_len, dst_len + 1);
>> + return 1;
>> +}
>> @@ -978,6 +1083,35 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
>> +static struct encoding *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> + if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
>> + !strlen(value))
>> + return NULL;
>> +
>> + for (enc = encoding; enc; enc = enc->next)
>> + if (!strcasecmp(value, enc->name))
>> + return enc;
>> +
>> + /* Don't encode to the default encoding */
>> + if (!strcasecmp(value, default_encoding))
>> + return NULL;
>
> You could handle this easy-to-detect case before attempting the more
> expensive loop just above. (Not necessarily worth a re-roll.)
True, but I delete the loop anyways when I removed the "encoding
linked list" as suggested by your comment above.
>> + enc = xcalloc(1, sizeof(*enc));
>> + /*
>> + * Ensure encoding names are always upper case (e.g. UTF-8) to
>> + * simplify subsequent string comparisons.
>> + */
>> + enc->name = xstrdup_toupper(value);
>> + *encoding_tail = enc;
>> + encoding_tail = &(enc->next);
>> +
>> + return enc;
>> +}
>> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
>> @@ -0,0 +1,226 @@
>> +test_expect_success 'ensure UTF-8 is stored in Git' '
>> + git cat-file -p :test.utf16 >test.utf16.git &&
>> + test_cmp_bin test.utf8.raw test.utf16.git &&
>> +
>> + # cleanup
>> + rm test.utf16.git
>> +'
>
> This cleanup won't take place if test_cmp_bin() fails. Instead,
> cleanups are typically handled by test_when_finished() to ensure that
> the cleanup action occurs even if the test fails.
>
> test_expect_success 'ensure UTF-8 is stored in Git' '
> test_when_finished "rm -f test.utf16.git" &&
> ...
>
> Same comment for remaining tests.
Agreed! That's the proper way to cleanup the tests. I'll fix my tests.
>> +test_expect_success 'check prohibited UTF BOM' '
>> + printf "\0a\0b\0c" >nobom.utf16be.raw &&
>> + printf "a\0b\0c\0" >nobom.utf16le.raw &&
>> + printf "\376\777\0a\0b\0c" >bebom.utf16be.raw &&
>> + printf "\777\376a\0b\0c\0" >lebom.utf16le.raw &&
>> +
>> + printf "\0\0\0a\0\0\0b\0\0\0c" >nobom.utf32be.raw &&
>> + printf "a\0\0\0b\0\0\0c\0\0\0" >nobom.utf32le.raw &&
>> + printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
>> + printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
>
> These resources seem to be used by multiple tests. Perhaps create them
> in a distinct "setup" test instead of bundling them in this test?
Good idea!
>> + echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
>> + echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
>> + echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
>> + echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&
>> +
>> + # Here we add a UTF-16 files with BOM (big-endian and little-endian)
>> + # but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
>> + # the BOM is prohibited.
>> + cp bebom.utf16be.raw bebom.utf16be &&
>> + test_must_fail git add bebom.utf16be 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
>> +
>> + cp lebom.utf16le.raw lebom.utf16be &&
>> + test_must_fail git add lebom.utf16be 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
>> +
>> + cp bebom.utf16be.raw bebom.utf16le &&
>> + test_must_fail git add bebom.utf16le 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
>> +
>> + cp lebom.utf16le.raw lebom.utf16le &&
>> + test_must_fail git add lebom.utf16le 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
>> +
>> + # ... and the same for UTF-32
>> + cp bebom.utf32be.raw bebom.utf32be &&
>> + test_must_fail git add bebom.utf32be 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
>> +
>> + cp lebom.utf32le.raw lebom.utf32be &&
>> + test_must_fail git add lebom.utf32be 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
>> +
>> + cp bebom.utf32be.raw bebom.utf32le &&
>> + test_must_fail git add bebom.utf32le 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
>> +
>> + cp lebom.utf32le.raw lebom.utf32le &&
>> + test_must_fail git add lebom.utf32le 2>err.out &&
>> + test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
>> +
>> + # cleanup
>> + git reset --hard HEAD
>> +'
>
> Clumping all these checks into the same test makes it difficult to
> determine which one failed (if one does fail). Also, the amount of
> copy/paste code is easy to get wrong and a bit onerous to review.
> Automating via for-loops would address these concerns. For instance,
> to check all combinations (not just 8 combinations tested above):
>
> for i in 16be 16le 32be 32le
> do
> for j in 16be 16le 32be 32le
> do
> test_expect_success "check prohibited UTF BOM utf$i/utf$j" '
> test_when_finished "git reset --hard HEAD" &&
> cp bebom.utf$i.raw bebom.utf$j &&
> test_must_fail git add bebom.utf$j 2>err.out &&
> J=$(echo $j | tr bel BEL) &&
> test_i18ngrep "fatal: BOM is prohibited .* UTF-$J" err.out
> '
> done
> done
>
> Alternately, the test could be moved to a function and called for each
> combination:
>
> check_prohibited_bom () {
> i=$1
> j=$2
> test_expect_success "check prohibited UTF BOM utf$i/utf$j" '
> ...
> '
> }
> check_prohibited_bom 16be 16be
> check_prohibited_bom 16be 16le
> ...
>
> Same comment regarding copy/paste in tests below.
In general I share your concern about code duplication.
However, I don't want be "too clever" in testing code
as it becomes hard to read it. Therefore, I implemented
a loop over "16 32" which is a good compromise I think.
Thanks you very much for this thorough review,
Lars
next prev parent reply other threads:[~2018-02-27 11:58 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-24 16:27 [PATCH v8 0/7] convert: add support for different encodings lars.schneider
2018-02-24 16:27 ` [PATCH v8 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-02-24 16:27 ` [PATCH v8 2/7] strbuf: add xstrdup_toupper() lars.schneider
2018-02-24 16:27 ` [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-02-25 3:41 ` Eric Sunshine
2018-02-25 11:35 ` Lars Schneider
2018-02-27 5:17 ` Eric Sunshine
2018-02-28 21:34 ` Lars Schneider
2018-02-24 16:27 ` [PATCH v8 4/7] utf8: add function to detect a missing " lars.schneider
2018-02-25 3:52 ` Eric Sunshine
2018-02-25 11:41 ` Lars Schneider
2018-02-24 16:27 ` [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
2018-02-25 7:15 ` Eric Sunshine
2018-02-27 11:16 ` Lars Schneider [this message]
2018-02-28 21:20 ` Eric Sunshine
2018-02-24 16:28 ` [PATCH v8 6/7] convert: add tracing for " lars.schneider
2018-02-24 16:28 ` [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-02-25 19:50 ` Eric Sunshine
2018-03-04 19:08 ` Lars Schneider
2018-03-04 19:58 ` Eric Sunshine
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3A3B1BD6-642C-467F-B80C-82853410C235@gmail.com \
--to=larsxschneider@gmail.com \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=j6t@kdbg.org \
--cc=lars.schneider@autodesk.com \
--cc=peff@peff.net \
--cc=ramsay@ramsayjones.plus.com \
--cc=sunshine@sunshineco.com \
--cc=tboegi@web.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).