All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Sunshine <sunshine@sunshineco.com>
To: Lars Schneider <lars.schneider@autodesk.com>
Cc: "Git List" <git@vger.kernel.org>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Torsten Bögershausen" <tboegi@web.de>,
	"Johannes Sixt" <j6t@kdbg.org>, "Jeff King" <peff@peff.net>,
	"Ramsay Jones" <ramsay@ramsayjones.plus.com>,
	"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	"Lars Schneider" <larsxschneider@gmail.com>
Subject: Re: [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding'
Date: Sun, 25 Feb 2018 14:50:56 -0500	[thread overview]
Message-ID: <CAPig+cR81J3fTOtrgAumAs=RC5hqYFfSmeb-ru-Yf_ahFuBiew@mail.gmail.com> (raw)
In-Reply-To: <20180224162801.98860-8-lars.schneider@autodesk.com>

On Sat, Feb 24, 2018 at 11:28 AM,  <lars.schneider@autodesk.com> wrote:
> UTF supports lossless conversion round tripping and conversions between
> UTF and other encodings are mostly round trip safe as Unicode aims to be
> a superset of all other character encodings. However, certain encodings
> (e.g. SHIFT-JIS) are known to have round trip issues [1].
>
> Add 'core.checkRoundtripEncoding', which contains a comma separated
> list of encodings, to define for what encodings Git should check the
> conversion round trip if they are used in the 'working-tree-encoding'
> attribute.
>
> Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.
>
> [1] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> diff --git a/convert.c b/convert.c
> @@ -289,6 +289,39 @@ static void trace_encoding(const char *context, const char *path,
> +static int check_roundtrip(const char* enc_name)
> +{
> +       /*
> +        * check_roundtrip_encoding contains a string of space and/or
> +        * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
> +        * Search for the given encoding in that string.
> +        */
> +       const char *found = strcasestr(check_roundtrip_encoding, enc_name);
> +       const char *next = found + strlen(enc_name);

Is this pointer arithmetic undefined behavior (according to the C
standard) if NULL is returned by strcasestr()? If so, how comfortable
are we with this? Perhaps if you add an 'if' into the mix, you can
avoid it:

    if (found) {
        const char *next = found + strlen(enc_name);
        return ...long complicated expression...;
    }
    return false;

> +       int len = strlen(check_roundtrip_encoding);
> +       return (found && (
> +                       /*
> +                        * check that the found encoding is at the
> +                        * beginning of check_roundtrip_encoding or
> +                        * that it is prefixed with a space or comma
> +                        */
> +                       found == check_roundtrip_encoding || (
> +                               found > check_roundtrip_encoding &&
> +                               (*(found-1) == ' ' || *(found-1) == ',')

Although the 'found > check_roundtrip_encoding' expression is
effectively dead code in this case, it does document that the
programmer didn't just blindly use '*(found-1)' without taking
boundary conditions into consideration.

(It's dead code because, at this point, we know that strcasestr()
didn't return NULL and we know that 'found' doesn't equal
'check_roundtrip_encoding', therefore it _must_ be greater than
'check_roundtrip_encoding'.)

Just an observation; not a request to remove the dead code since it
has some documentation value.

> +                       )
> +               ) && (
> +                       /*
> +                        * check that the found encoding is at the
> +                        * end of check_roundtrip_encoding or
> +                        * that it is suffixed with a space or comma
> +                        */
> +                       next == check_roundtrip_encoding + len || (
> +                               next < check_roundtrip_encoding + len &&
> +                               (*next == ' ' || *next == ',')
> +                       )
> +               ));
> +}
> @@ -366,6 +399,47 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
> +       if ((conv_flags & CONV_WRITE_OBJECT) && check_roundtrip(enc->name)) {
> +               re_src = reencode_string_len(dst, dst_len,
> +                                            enc->name, default_encoding,
> +                                            &re_src_len);
> +
> +               trace_printf("Checking roundtrip encoding for %s...\n", enc->name);
> +               trace_encoding("reencoded source", path, enc->name,
> +                              re_src, re_src_len);
> +
> +               if (!re_src || src_len != re_src_len ||
> +                   memcmp(src, re_src, src_len)) {
> +                       const char* msg = _("encoding '%s' from %s to %s and "
> +                                           "back is not the same");
> +                       die(msg, path, enc->name, default_encoding);

Last two arguments need to be swapped.

> +               }
> +
> +               free(re_src);
> +       }
> +
>         strbuf_attach(buf, dst, dst_len, dst_len + 1);
>         return 1;
>  }
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> @@ -225,4 +225,45 @@ test_expect_success 'error if encoding garbage is already in Git' '
> +test_expect_success 'check roundtrip encoding' '
> +       text="hallo there!\nroundtrip test here!" &&
> +       printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
> +       printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
> +       echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
> +
> +       # SHIFT-JIS encoded files are round-trip checked by default...
> +       GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
> +               grep "Checking roundtrip encoding for SHIFT-JIS" &&

Why redirect to /dev/null? If something does go wrong somewhere, the
more output available for debugging the problem, the better, so
throwing it away unnecessarily seems contraindicated.

> +       git reset &&
> +
> +       # ... unless we overwrite the Git config!
> +       test_config core.checkRoundtripEncoding "garbage" &&
> +       ! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
> +               grep "Checking roundtrip encoding for SHIFT-JIS" &&
> +       test_unconfig core.checkRoundtripEncoding &&

The "unconfig" won't take place if the test fails. Instead of
test_config/test_unconfig, you could use '-c' to set the config
transiently for the git-add operation:

    ! GIT_TRACE=1 git -c core.checkRoundtripEncoding=garbage add ...

> +       git reset &&
> +
> +       # UTF-16 encoded files should not be round-trip checked by default...
> +       ! GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
> +               grep "Checking roundtrip encoding for UTF-16" &&
> +       git reset &&
> +
> +       # ... unless we tell Git to check it!
> +       test_config_global core.checkRoundtripEncoding "UTF-16, UTF-32" &&
> +       GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
> +               grep "Checking roundtrip encoding for UTF-16" &&
> +       git reset &&
> +
> +       # ... unless we tell Git to check it!
> +       # (here we also check that the casing of the encoding is irrelevant)
> +       test_config_global core.checkRoundtripEncoding "UTF-32, utf-16" &&
> +       GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
> +               grep "Checking roundtrip encoding for UTF-16" &&
> +       git reset &&
> +
> +       # cleanup
> +       rm roundtrip.shift roundtrip.utf16 &&
> +       git reset --hard HEAD
> +'

Same comment as for patch 5/7: This cleanup won't happen if the test
fails. Instead, use test_when_finished() earlier in the test:

    test_expect_success 'check roundtrip encoding' '
        test_when_finished "rm -f roundtrip.shift roundtrip.utf16; git
reset --hard HEAD" &&
        ...

  reply	other threads:[~2018-02-25 19:51 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-24 16:27 [PATCH v8 0/7] convert: add support for different encodings lars.schneider
2018-02-24 16:27 ` [PATCH v8 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-02-24 16:27 ` [PATCH v8 2/7] strbuf: add xstrdup_toupper() lars.schneider
2018-02-24 16:27 ` [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-02-25  3:41   ` Eric Sunshine
2018-02-25 11:35     ` Lars Schneider
2018-02-27  5:17       ` Eric Sunshine
2018-02-28 21:34         ` Lars Schneider
2018-02-24 16:27 ` [PATCH v8 4/7] utf8: add function to detect a missing " lars.schneider
2018-02-25  3:52   ` Eric Sunshine
2018-02-25 11:41     ` Lars Schneider
2018-02-24 16:27 ` [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
2018-02-25  7:15   ` Eric Sunshine
2018-02-27 11:16     ` Lars Schneider
2018-02-28 21:20       ` Eric Sunshine
2018-02-24 16:28 ` [PATCH v8 6/7] convert: add tracing for " lars.schneider
2018-02-24 16:28 ` [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-02-25 19:50   ` Eric Sunshine [this message]
2018-03-04 19:08     ` Lars Schneider
2018-03-04 19:58       ` Eric Sunshine

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPig+cR81J3fTOtrgAumAs=RC5hqYFfSmeb-ru-Yf_ahFuBiew@mail.gmail.com' \
    --to=sunshine@sunshineco.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=j6t@kdbg.org \
    --cc=lars.schneider@autodesk.com \
    --cc=larsxschneider@gmail.com \
    --cc=peff@peff.net \
    --cc=ramsay@ramsayjones.plus.com \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.