git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/7] convert: add support for different encodings
@ 2018-02-09 13:28 lars.schneider
  2018-02-09 13:28 ` [PATCH v6 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
                   ` (7 more replies)
  0 siblings, 8 replies; 14+ messages in thread
From: lars.schneider @ 2018-02-09 13:28 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Hi,

Patches 1-4, 6 are preparation and helper functions.
Patch 5,7 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is already
in next.

Changes since v5:

* added 'core.checkRoundtripEncoding' to let the end user define a list
  of encodings for which Git performs conversion round trip checks
  (Junio's feedback)
* add a test case for round trip conversions
* move all round trip related changes into a new patch
* rename has_missing_utf_bom() to is_missing_required_utf_bom() (Junio)
* remove dead code in conversion round trip check (Junio)
* use *.proj files instead of *.txt files as example (Torsten's feedback)
* recommend explicitly setting the eol attribute if working-tree-encoding
  attribute is used (Torsten)
* improve test case and check that 'git ls-files --eol' works as expected
* rename variables from checkout_encoding to working_tree_encoding


Thanks,
Lars

   RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.com/
    v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@autodesk.com/
    v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@autodesk.com/
    v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@autodesk.com/
    v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schneider@autodesk.com/
    v5: https://public-inbox.org/git/20180129201855.9182-1-tboegi@web.de/


Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/9e31832f3c
Checkout: git fetch https://github.com/larsxschneider/git encoding-v6 && git checkout 9e31832f3c


### Interdiff (v5..v6):

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..d7a56054a5 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
    This variable can be set to 'input',
    in which case no output conversion is performed.

+core.checkRoundtripEncoding::
+   A comma separated list of encodings that Git performs UTF-8 round
+   trip checks on if they are used in an `working-tree-encoding`
+   attribute (see linkgit:gitattributes[5]). The default value is
+   `SHIFT-JIS`.
+
 core.symlinks::
    If false, symbolic links are checked out as small plain files that
    contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a8dbf4be30..ea5a9509c6 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -283,10 +283,10 @@ visualize the content.

 In these cases you can tell Git the encoding of a file in the working
 directory with the `working-tree-encoding` attribute. If a file with this
-attributes is added to Git, then Git reencodes the content from the
-specified encoding to UTF-8 and stores the result in its internal data
-structure (called "the index"). On checkout the content is encoded
-back to the specified encoding.
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.

 Please note that using the `working-tree-encoding` attribute may have a
 number of pitfalls:
@@ -296,32 +296,38 @@ number of pitfalls:
   expected encoding. Consequently, these files will appear different
   which typically causes trouble. This is in particular the case for
   older Git versions and alternative Git implementations such as JGit
-  or libgit2 (as of January 2018).
+  or libgit2 (as of February 2018).

-- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
-  errors as the conversion might not be round trip safe.
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to `core.checkRoundtripEncoding`
+  to make Git check the round trip encoding (see linkgit:git-config[1]).
+  SHIFT-JIS (Japanese character set) is known to have round trip issues
+  with UTF-8 and is checked by default.

 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').

-Use the `working-tree-encoding` attribute only if you cannot store a file in
-UTF-8 encoding and if you want Git to be able to process the content as
-text.
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.

-Use the following attributes if your '*.txt' files are UTF-16 encoded
-with byte order mark (BOM) and you want Git to perform automatic line
-ending conversion based on your platform.
+As an example, use the following attributes if your '*.proj' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.

 ------------------------
-*.txt      text working-tree-encoding=UTF-16
+*.proj     text working-tree-encoding=UTF-16
 ------------------------

-Use the following attributes if your '*.txt' files are UTF-16 little
+Use the following attributes if your '*.proj' files are UTF-16 little
 endian encoded without BOM and you want Git to use Windows line endings
-in the working directory.
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.

 ------------------------
-*.txt      working-tree-encoding=UTF-16LE text eol=CRLF
+*.proj         working-tree-encoding=UTF-16LE text eol=CRLF
 ------------------------

 You can get a list of all available encodings on your platform with the
@@ -331,6 +337,13 @@ following command:
 iconv --list
 ------------------------

+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+------------------------
+file foo.proj
+------------------------
+

 `ident`
 ^^^^^^^
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, const char *value)
        return 0;
    }

+   if (!strcmp(var, "core.checkroundtripencoding")) {
+       check_roundtrip_encoding = xstrdup(value);
+       return 0;
+   }
+
    if (!strcmp(var, "core.notesref")) {
        notes_ref_name = xstrdup(value);
        return 0;
diff --git a/convert.c b/convert.c
index 13fad490ce..71dffc7167 100644
--- a/convert.c
+++ b/convert.c
@@ -269,7 +269,7 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 static void trace_encoding(const char *context, const char *path,
               const char *encoding, const char *buf, size_t len)
 {
-   static struct trace_key coe = TRACE_KEY_INIT(CHECKOUT_ENCODING);
+   static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
    struct strbuf trace = STRBUF_INIT;
    int i;

@@ -289,6 +289,39 @@ static void trace_encoding(const char *context, const char *path,
    strbuf_release(&trace);
 }

+static int check_roundtrip(const char* enc_name)
+{
+   /*
+    * check_roundtrip_encoding contains a string of space and/or
+    * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
+    * Search for the given encoding in that string.
+    */
+   const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+   const char *next = found + strlen(enc_name);
+   int len = strlen(check_roundtrip_encoding);
+   return (found && (
+           /*
+            * check that the found encoding is at the
+            * beginning of check_roundtrip_encoding or
+            * that it is prefixed with a space or comma
+            */
+           found == check_roundtrip_encoding || (
+               found > check_roundtrip_encoding &&
+               (*(found-1) == ' ' || *(found-1) == ',')
+           )
+       ) && (
+           /*
+            * check that the found encoding is at the
+            * end of check_roundtrip_encoding or
+            * that it is suffixed with a space or comma
+            */
+           next == check_roundtrip_encoding + len || (
+               next < check_roundtrip_encoding + len &&
+               (*next == ' ' || *next == ',')
+           )
+       ));
+}
+
 static struct encoding {
    const char *name;
    struct encoding *next;
@@ -333,7 +366,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
            error(error_msg, path, enc->name);


-   } else if (has_missing_utf_bom(enc->name, src, src_len)) {
+   } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
        const char *error_msg = _(
            "BOM is required for '%s' if encoded as %s");
        const char *advise_msg = _(
@@ -367,22 +400,25 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
    trace_encoding("destination", path, default_encoding, dst, dst_len);

    /*
-    * UTF supports lossless round tripping [1]. UTF to other encoding are
-    * mostly round trip safe as Unicode aims to be a superset of all other
-    * character encodings. However, the SHIFT-JIS (Japanese character set)
-    * is an exception as some codes are not round trip safe [2].
+    * UTF supports lossless conversion round tripping [1] and conversions
+    * between UTF and other encodings are mostly round trip safe as
+    * Unicode aims to be a superset of all other character encodings.
+    * However, certain encodings (e.g. SHIFT-JIS) are known to have round
+    * trip issues [2]. Check the round trip conversion for all encodings
+    * listed in core.checkRoundTripEncoding.
     *
-    * Reverse the transformation of 'dst' and check the result with 'src'
-    * if content is written to Git. This ensures no information is lost
-    * during conversion to/from UTF-8.
+    * The round trip check is only performed if content is written to Git.
+    * This ensures that no information is lost during conversion to/from
+    * the internal UTF-8 representation.
     *
     * Please note, the code below is not tested because I was not able to
-    * generate a faulty round trip without iconv error.
+    * generate a faulty round trip without an iconv error. Iconv errors
+    * are already caught above.
     *
     * [1] http://unicode.org/faq/utf_bom.html#gen2
     * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
     */
-   if ((conv_flags & CONV_WRITE_OBJECT) && !strcmp(enc->name, "SHIFT-JIS")) {
+   if ((conv_flags & CONV_WRITE_OBJECT) && check_roundtrip(enc->name)) {
        char *re_src;
        int re_src_len;

@@ -390,6 +426,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
                         enc->name, default_encoding,
                         &re_src_len);

+       trace_printf("Checking roundtrip encoding for %s...\n", enc->name);
        trace_encoding("reencoded source", path, enc->name,
                   re_src, re_src_len);

@@ -397,10 +434,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
            memcmp(src, re_src, src_len)) {
            const char* msg = _("encoding '%s' from %s to %s and "
                        "back is not the same");
-           if (conv_flags & CONV_WRITE_OBJECT)
-               die(msg, path, enc->name, default_encoding);
-           else
-               error(msg, path, enc->name, default_encoding);
+           die(msg, path, enc->name, default_encoding);
        }

        free(re_src);
@@ -1165,8 +1199,12 @@ static struct encoding *git_path_check_encoding(struct attr_check_item *check)
    if (!strcasecmp(value, default_encoding))
        return NULL;

-   enc = xcalloc(1, sizeof(struct convert_driver));
-   enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
+   enc = xcalloc(1, sizeof(*enc));
+   /*
+    * Ensure encoding names are always upper case (e.g. UTF-8) to
+    * simplify subsequent string comparisons.
+    */
+   enc->name = xstrdup_toupper(value);
    *encoding_tail = enc;
    encoding_tail = &(enc->next);

@@ -1228,7 +1266,7 @@ struct conv_attrs {
    enum crlf_action attr_action; /* What attr says */
    enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
    int ident;
-   struct encoding *checkout_encoding; /* Supported encoding or default encoding if NULL */
+   struct encoding *working_tree_encoding; /* Supported encoding or default encoding if NULL */
 };

 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1262,7 +1300,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
            else if (eol_attr == EOL_CRLF)
                ca->crlf_action = CRLF_TEXT_CRLF;
        }
-       ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
+       ca->working_tree_encoding = git_path_check_encoding(ccheck + 5);
    } else {
        ca->drv = NULL;
        ca->crlf_action = CRLF_UNDEFINED;
@@ -1344,7 +1382,7 @@ int convert_to_git(const struct index_state *istate,
        len = dst->len;
    }

-   ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, conv_flags);
+   ret |= encode_to_git(path, src, len, dst, ca.working_tree_encoding, conv_flags);
    if (ret && dst) {
        src = dst->buf;
        len = dst->len;
@@ -1373,7 +1411,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
    if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
        die("%s: clean filter '%s' failed", path, ca.drv->name);

-   encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, conv_flags);
+   encode_to_git(path, dst->buf, dst->len, dst, ca.working_tree_encoding, conv_flags);
    crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
    ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1405,7 +1443,7 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
        }
    }

-   ret |= encode_to_worktree(path, src, len, dst, ca.checkout_encoding);
+   ret |= encode_to_worktree(path, src, len, dst, ca.working_tree_encoding);
    if (ret) {
        src = dst->buf;
        len = dst->len;
@@ -1877,7 +1915,7 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
    if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
        return NULL;

-   if (ca.checkout_encoding)
+   if (ca.working_tree_encoding)
        return NULL;

    if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
diff --git a/convert.h b/convert.h
index 1d9539ed0b..765abfbd60 100644
--- a/convert.h
+++ b/convert.h
@@ -56,6 +56,7 @@ struct delayed_checkout {
 };

 extern enum eol core_eol;
+extern char *check_roundtrip_encoding;
 extern const char *get_cached_convert_stats_ascii(const struct index_state *istate,
                          const char *path);
 extern const char *get_wt_convert_stats_ascii(const char *path);
diff --git a/environment.c b/environment.c
index 10a32c20ac..5bae9131ad 100644
--- a/environment.c
+++ b/environment.c
@@ -50,6 +50,7 @@ int check_replace_refs = 1;
 char *git_replace_ref_base;
 enum eol core_eol = EOL_UNSET;
 int global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
+char *check_roundtrip_encoding = "SHIFT-JIS";
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
 enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
 enum rebase_setup_type autorebase = AUTOREBASE_NEVER;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 0f36d4990a..5dcdd5f899 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,7 +4,7 @@ test_description='working-tree-encoding conversion via gitattributes'

 . ./test-lib.sh

-GIT_TRACE_CHECKOUT_ENCODING=1 && export GIT_TRACE_CHECKOUT_ENCODING
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING

 test_expect_success 'setup test repo' '
    git config core.eol lf &&
@@ -22,6 +22,8 @@ test_expect_success 'setup test repo' '
 test_expect_success 'ensure UTF-8 is stored in Git' '
    git cat-file -p :test.utf16 >test.utf16.git &&
    test_cmp_bin test.utf8.raw test.utf16.git &&
+
+   # cleanup
    rm test.utf8.raw test.utf16.git
 '

@@ -122,6 +124,10 @@ test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
    cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
    cp crlf.utf16.raw eol.utf16 &&

+   cat >expectIndexLF <<-\EOF &&
+       i/lf    w/-text attr/text               eol.utf16
+   EOF
+
    git add eol.utf16 &&
    git commit -m eol &&

@@ -130,11 +136,19 @@ test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
    git -c core.eol=crlf checkout eol.utf16 &&
    test_cmp_bin crlf.utf16.raw eol.utf16 &&

+   # Although the file has CRLF in the working tree, ensure LF in the index
+   git ls-files --eol eol.utf16 >actual &&
+   test_cmp expectIndexLF actual &&
+
    # UTF-16 with LF (Unix line endings)
    rm eol.utf16 &&
    git -c core.eol=lf checkout eol.utf16 &&
    test_cmp_bin lf.utf16.raw eol.utf16 &&

+   # The file LF in the working tree, ensure LF in the index
+   git ls-files --eol eol.utf16 >actual &&
+   test_cmp expectIndexLF actual&&
+
    rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&

    # cleanup
@@ -195,4 +209,45 @@ test_expect_success 'error if encoding garbage is already in Git' '
    git reset --hard $BEFORE_STATE
 '

+test_expect_success 'check roundtrip encoding' '
+   text="hallo there!\nroundtrip test here!" &&
+   printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
+   printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+   echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
+
+   # SHIFT-JIS encoded files are round-trip checked by default...
+   GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
+       grep "Checking roundtrip encoding for SHIFT-JIS" &&
+   git reset &&
+
+   # ... unless we overwrite the Git config!
+   test_config core.checkRoundTripEncoding "garbage" &&
+   ! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
+       grep "Checking roundtrip encoding for SHIFT-JIS" &&
+   test_unconfig core.checkRoundTripEncoding &&
+   git reset &&
+
+   # UTF-16 encoded files should not be round-trip checked by default...
+   ! GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+       grep "Checking roundtrip encoding for UTF-16" &&
+   git reset &&
+
+   # ... unless we tell Git to check it!
+   test_config_global core.checkRoundTripEncoding "UTF-16, UTF-32" &&
+   GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+       grep "Checking roundtrip encoding for UTF-16" &&
+   git reset &&
+
+   # ... unless we tell Git to check it!
+   # (here we also check that the casing of the encoding is irrelevant)
+   test_config_global core.checkRoundTripEncoding "UTF-32, utf-16" &&
+   GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+       grep "Checking roundtrip encoding for UTF-16" &&
+   git reset &&
+
+   # cleanup
+   rm roundtrip.shift roundtrip.utf16 &&
+   git reset --hard HEAD
+'
+
 test_done
diff --git a/utf8.c b/utf8.c
index f033fec1c2..5113d26e56 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,7 +562,7 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
    );
 }

-int has_missing_utf_bom(const char *enc, const char *data, size_t len)
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
 {
    return (
       !strcmp(enc, "UTF-16") &&
diff --git a/utf8.h b/utf8.h
index 26b5e91852..62f86fba64 100644
--- a/utf8.h
+++ b/utf8.h
@@ -93,6 +93,6 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
  *     Section 3.10, D98, page 132
  * [3] https://encoding.spec.whatwg.org/#utf-16le
  */
-int has_missing_utf_bom(const char *enc, const char *data, size_t len);
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);

 #endif



### Patches

Lars Schneider (7):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add 'working-tree-encoding' attribute
  convert: add tracing for 'working-tree-encoding' attribute
  convert: add round trip check based on 'core.checkRoundtripEncoding'

 Documentation/config.txt         |   6 +
 Documentation/gitattributes.txt  |  73 +++++++++++
 config.c                         |   5 +
 convert.c                        | 256 ++++++++++++++++++++++++++++++++++++++-
 convert.h                        |   2 +
 environment.c                    |   1 +
 sha1_file.c                      |   2 +-
 strbuf.c                         |  13 +-
 strbuf.h                         |   1 +
 t/t0028-working-tree-encoding.sh | 253 ++++++++++++++++++++++++++++++++++++++
 utf8.c                           |  37 ++++++
 utf8.h                           |  25 ++++
 12 files changed, 671 insertions(+), 3 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh


base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417
--
2.16.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  2018-02-09 13:28 [PATCH v6 0/7] convert: add support for different encodings lars.schneider
@ 2018-02-09 13:28 ` lars.schneider
  2018-02-09 13:28 ` [PATCH v6 2/7] strbuf: add xstrdup_toupper() lars.schneider
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: lars.schneider @ 2018-02-09 13:28 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 1df674e919..55b7daeb35 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
 	result = xmallocz(len);
 	for (i = 0; i < len; i++)
 		result[i] = tolower(string[i]);
-	result[i] = '\0';
 	return result;
 }
 
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 2/7] strbuf: add xstrdup_toupper()
  2018-02-09 13:28 [PATCH v6 0/7] convert: add support for different encodings lars.schneider
  2018-02-09 13:28 ` [PATCH v6 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
@ 2018-02-09 13:28 ` lars.schneider
  2018-02-09 13:28 ` [PATCH v6 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: lars.schneider @ 2018-02-09 13:28 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 strbuf.c | 12 ++++++++++++
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 55b7daeb35..b635f0bdc4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
 	return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+	char *result;
+	size_t len, i;
+
+	len = strlen(string);
+	result = xmallocz(len);
+	for (i = 0; i < len; i++)
+		result[i] = toupper(string[i]);
+	return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
 	struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 3/7] utf8: add function to detect prohibited UTF-16/32 BOM
  2018-02-09 13:28 [PATCH v6 0/7] convert: add support for different encodings lars.schneider
  2018-02-09 13:28 ` [PATCH v6 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
  2018-02-09 13:28 ` [PATCH v6 2/7] strbuf: add xstrdup_toupper() lars.schneider
@ 2018-02-09 13:28 ` lars.schneider
  2018-02-09 13:28 ` [PATCH v6 4/7] utf8: add function to detect a missing " lars.schneider
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 14+ messages in thread
From: lars.schneider @ 2018-02-09 13:28 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 utf8.c | 24 ++++++++++++++++++++++++
 utf8.h |  9 +++++++++
 2 files changed, 33 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..914881cd1f 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,30 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	  (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+	  (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	   has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	  (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+	  (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	   has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..4711429af9 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width,
 		       const char *s);
 
+/*
+ * Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
+ * or UTF-32LE a BOM must not be used [1]. The function returns true if
+ * this is the case.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 4/7] utf8: add function to detect a missing UTF-16/32 BOM
  2018-02-09 13:28 [PATCH v6 0/7] convert: add support for different encodings lars.schneider
                   ` (2 preceding siblings ...)
  2018-02-09 13:28 ` [PATCH v6 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
@ 2018-02-09 13:28 ` lars.schneider
  2018-02-09 19:28   ` Junio C Hamano
  2018-02-09 13:28 ` [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: lars.schneider @ 2018-02-09 13:28 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
     Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>

utf
---
 utf8.c | 13 +++++++++++++
 utf8.h | 16 ++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/utf8.c b/utf8.c
index 914881cd1f..5113d26e56 100644
--- a/utf8.c
+++ b/utf8.c
@@ -562,6 +562,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 	);
 }
 
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
+{
+	return (
+	   !strcmp(enc, "UTF-16") &&
+	   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+	     has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+	) || (
+	   !strcmp(enc, "UTF-32") &&
+	   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+	     has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+	);
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 4711429af9..62f86fba64 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,20 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there
+ * in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG
+ * encoding standard used in HTML5 recommends to assume
+ * little-endian to "deal with deployed content" [3].
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ *     Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute
  2018-02-09 13:28 [PATCH v6 0/7] convert: add support for different encodings lars.schneider
                   ` (3 preceding siblings ...)
  2018-02-09 13:28 ` [PATCH v6 4/7] utf8: add function to detect a missing " lars.schneider
@ 2018-02-09 13:28 ` lars.schneider
  2018-02-10  9:48   ` Torsten Bögershausen
  2018-02-09 13:28 ` [PATCH v6 6/7] convert: add tracing for " lars.schneider
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 14+ messages in thread
From: lars.schneider @ 2018-02-09 13:28 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/gitattributes.txt  |  66 ++++++++++++
 convert.c                        | 157 ++++++++++++++++++++++++++++-
 convert.h                        |   1 +
 sha1_file.c                      |   2 +-
 t/t0028-working-tree-encoding.sh | 210 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 434 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..4ecdcd4859 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,72 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Git clients that do not support the `working-tree-encoding` attribute
+  will checkout the respective files UTF-8 encoded and not in the
+  expected encoding. Consequently, these files will appear different
+  which typically causes trouble. This is in particular the case for
+  older Git versions and alternative Git implementations such as JGit
+  or libgit2 (as of February 2018).
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.proj' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+------------------------
+*.proj		text working-tree-encoding=UTF-16
+------------------------
+
+Use the following attributes if your '*.proj' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+------------------------
+*.proj 		working-tree-encoding=UTF-16LE text eol=CRLF
+------------------------
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+------------------------
+iconv --list
+------------------------
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+------------------------
+file foo.proj
+------------------------
+
+
 `ident`
 ^^^^^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..dc9e2db6b5 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static struct encoding {
+	const char *name;
+	struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+			 struct strbuf *buf, struct encoding *enc, int conv_flags)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	/*
+	 * Looks like we got called from "would_convert_to_git()".
+	 * This means Git wants to know if it would encode (= modify!)
+	 * the content. Let's answer with "yes", since an encoding was
+	 * specified.
+	 */
+	if (!buf && !src)
+		return 1;
+
+	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is prohibited for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is prohibited with this encoding. Either use "
+			"%.6s as working tree encoding or remove the BOM from the "
+			"file.");
+
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+
+
+	} else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
+		const char *error_msg = _(
+			"BOM is required for '%s' if encoded as %s");
+		const char *advise_msg = _(
+			"You told Git to treat '%s' as %s. A byte order mark "
+			"(BOM) is required with this encoding. Either use "
+			"%sBE/%sLE as working tree encoding or add a BOM to the "
+			"file.");
+		advise(advise_msg, path, enc->name, enc->name, enc->name);
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(error_msg, path, enc->name);
+		else
+			error(error_msg, path, enc->name);
+	}
+
+	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
+				  &dst_len);
+	if (!dst) {
+		/*
+		 * We could add the blob "as-is" to Git. However, on checkout
+		 * we would try to reencode to the original encoding. This
+		 * would fail and we would leave the user with a messed-up
+		 * working tree. Let's try to avoid this by screaming loud.
+		 */
+		const char* msg = _("failed to encode '%s' from %s to %s");
+		if (conv_flags & CONV_WRITE_OBJECT)
+			die(msg, path, enc->name, default_encoding);
+		else
+			error(msg, path, enc->name, default_encoding);
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
+static int encode_to_worktree(const char *path, const char *src, size_t src_len,
+			      struct strbuf *buf, struct encoding *enc)
+{
+	char *dst;
+	int dst_len;
+
+	/*
+	 * No encoding is specified or there is nothing to encode.
+	 * Tell the caller that the content was not modified.
+	 */
+	if (!enc || (src && !src_len))
+		return 0;
+
+	dst = reencode_string_len(src, src_len, enc->name, default_encoding,
+				  &dst_len);
+	if (!dst) {
+		error("failed to encode '%s' from %s to %s",
+			path, enc->name, default_encoding);
+		return 0;
+	}
+
+	strbuf_attach(buf, dst, dst_len, dst_len + 1);
+	return 1;
+}
+
 static int crlf_to_git(const struct index_state *istate,
 		       const char *path, const char *src, size_t len,
 		       struct strbuf *buf,
@@ -978,6 +1083,35 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
 	return 1;
 }
 
+static struct encoding *git_path_check_encoding(struct attr_check_item *check)
+{
+	const char *value = check->value;
+	struct encoding *enc;
+
+	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
+	    !strlen(value))
+		return NULL;
+
+	for (enc = encoding; enc; enc = enc->next)
+		if (!strcasecmp(value, enc->name))
+			return enc;
+
+	/* Don't encode to the default encoding */
+	if (!strcasecmp(value, default_encoding))
+		return NULL;
+
+	enc = xcalloc(1, sizeof(*enc));
+	/*
+	 * Ensure encoding names are always upper case (e.g. UTF-8) to
+	 * simplify subsequent string comparisons.
+	 */
+	enc->name = xstrdup_toupper(value);
+	*encoding_tail = enc;
+	encoding_tail = &(enc->next);
+
+	return enc;
+}
+
 static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
 {
 	const char *value = check->value;
@@ -1033,6 +1167,7 @@ struct conv_attrs {
 	enum crlf_action attr_action; /* What attr says */
 	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
 	int ident;
+	struct encoding *working_tree_encoding; /* Supported encoding or default encoding if NULL */
 };
 
 static void convert_attrs(struct conv_attrs *ca, const char *path)
@@ -1041,8 +1176,10 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 
 	if (!check) {
 		check = attr_check_initl("crlf", "ident", "filter",
-					 "eol", "text", NULL);
+					 "eol", "text", "working-tree-encoding",
+					 NULL);
 		user_convert_tail = &user_convert;
+		encoding_tail = &encoding;
 		git_config(read_convert_config, NULL);
 	}
 
@@ -1064,6 +1201,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
 			else if (eol_attr == EOL_CRLF)
 				ca->crlf_action = CRLF_TEXT_CRLF;
 		}
+		ca->working_tree_encoding = git_path_check_encoding(ccheck + 5);
 	} else {
 		ca->drv = NULL;
 		ca->crlf_action = CRLF_UNDEFINED;
@@ -1144,6 +1282,13 @@ int convert_to_git(const struct index_state *istate,
 		src = dst->buf;
 		len = dst->len;
 	}
+
+	ret |= encode_to_git(path, src, len, dst, ca.working_tree_encoding, conv_flags);
+	if (ret && dst) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	if (!(conv_flags & CONV_EOL_KEEP_CRLF)) {
 		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, conv_flags);
 		if (ret && dst) {
@@ -1167,6 +1312,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
 	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
 		die("%s: clean filter '%s' failed", path, ca.drv->name);
 
+	encode_to_git(path, dst->buf, dst->len, dst, ca.working_tree_encoding, conv_flags);
 	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, conv_flags);
 	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
 }
@@ -1198,6 +1344,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
 		}
 	}
 
+	ret |= encode_to_worktree(path, src, len, dst, ca.working_tree_encoding);
+	if (ret) {
+		src = dst->buf;
+		len = dst->len;
+	}
+
 	ret_filter = apply_filter(
 		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
 	if (!ret_filter && ca.drv && ca.drv->required)
@@ -1664,6 +1816,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
 	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
 		return NULL;
 
+	if (ca.working_tree_encoding)
+		return NULL;
+
 	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
 		return NULL;
 
diff --git a/convert.h b/convert.h
index 65ab3e5167..1d9539ed0b 100644
--- a/convert.h
+++ b/convert.h
@@ -12,6 +12,7 @@ struct index_state;
 #define CONV_EOL_RNDTRP_WARN  (1<<1) /* Warn if CRLF to LF to CRLF is different */
 #define CONV_EOL_RENORMALIZE  (1<<2) /* Convert CRLF to LF */
 #define CONV_EOL_KEEP_CRLF    (1<<3) /* Keep CRLF line endings as is */
+#define CONV_WRITE_OBJECT     (1<<4) /* Content is written to the index */
 
 extern int global_conv_flags_eol;
 
diff --git a/sha1_file.c b/sha1_file.c
index 6bc7c6ada9..e2f319d677 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -138,7 +138,7 @@ static int get_conv_flags(unsigned flags)
 	if (flags & HASH_RENORMALIZE)
 		return CONV_EOL_RENORMALIZE;
 	else if (flags & HASH_WRITE_OBJECT)
-	  return global_conv_flags_eol;
+		return global_conv_flags_eol | CONV_WRITE_OBJECT;
 	else
 		return 0;
 }
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
new file mode 100755
index 0000000000..f9ce3e5ef5
--- /dev/null
+++ b/t/t0028-working-tree-encoding.sh
@@ -0,0 +1,210 @@
+#!/bin/sh
+
+test_description='working-tree-encoding conversion via gitattributes'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repo' '
+	git config core.eol lf &&
+
+	text="hallo there!\ncan you read me?" &&
+	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+	printf "$text" >test.utf8.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+	cp test.utf16.raw test.utf16 &&
+
+	git add .gitattributes test.utf16 &&
+	git commit -m initial
+'
+
+test_expect_success 'ensure UTF-8 is stored in Git' '
+	git cat-file -p :test.utf16 >test.utf16.git &&
+	test_cmp_bin test.utf8.raw test.utf16.git &&
+
+	# cleanup
+	rm test.utf8.raw test.utf16.git
+'
+
+test_expect_success 're-encode to UTF-16 on checkout' '
+	rm test.utf16 &&
+	git checkout test.utf16 &&
+	test_cmp_bin test.utf16.raw test.utf16 &&
+
+	# cleanup
+	rm test.utf16.raw
+'
+
+test_expect_success 'check prohibited UTF BOM' '
+	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
+	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
+	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
+	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
+
+	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
+	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
+	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
+	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
+
+	echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
+	echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
+	echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
+	echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&
+
+	# Here we add a UTF-16 files with BOM (big-endian and little-endian)
+	# but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
+	# the BOM is prohibited.
+	cp bebom.utf16be.raw bebom.utf16be &&
+	test_must_fail git add bebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16be &&
+	test_must_fail git add lebom.utf16be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
+
+	cp bebom.utf16be.raw bebom.utf16le &&
+	test_must_fail git add bebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	cp lebom.utf16le.raw lebom.utf16le &&
+	test_must_fail git add lebom.utf16le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
+
+	# ... and the same for UTF-32
+	cp bebom.utf32be.raw bebom.utf32be &&
+	test_must_fail git add bebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32be &&
+	test_must_fail git add lebom.utf32be 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
+
+	cp bebom.utf32be.raw bebom.utf32le &&
+	test_must_fail git add bebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	cp lebom.utf32le.raw lebom.utf32le &&
+	test_must_fail git add lebom.utf32le 2>err.out &&
+	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
+
+	# cleanup
+	git reset --hard HEAD
+'
+
+test_expect_success 'check required UTF BOM' '
+	echo "*.utf32 text working-tree-encoding=utf-32" >>.gitattributes &&
+
+	cp nobom.utf16be.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf16le.raw nobom.utf16 &&
+	test_must_fail git add nobom.utf16 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
+
+	cp nobom.utf32be.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	cp nobom.utf32le.raw nobom.utf32 &&
+	test_must_fail git add nobom.utf32 2>err.out &&
+	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
+
+	# cleanup
+	rm nobom.utf16 nobom.utf32 &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
+	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
+	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
+
+	cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
+	cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
+	cp crlf.utf16.raw eol.utf16 &&
+
+	cat >expectIndexLF <<-\EOF &&
+		i/lf    w/-text attr/text             	eol.utf16
+	EOF
+
+	git add eol.utf16 &&
+	git commit -m eol &&
+
+	# UTF-16 with CRLF (Windows line endings)
+	rm eol.utf16 &&
+	git -c core.eol=crlf checkout eol.utf16 &&
+	test_cmp_bin crlf.utf16.raw eol.utf16 &&
+
+	# Although the file has CRLF in the working tree, ensure LF in the index
+	git ls-files --eol eol.utf16 >actual &&
+	test_cmp expectIndexLF actual &&
+
+	# UTF-16 with LF (Unix line endings)
+	rm eol.utf16 &&
+	git -c core.eol=lf checkout eol.utf16 &&
+	test_cmp_bin lf.utf16.raw eol.utf16 &&
+
+	# The file LF in the working tree, ensure LF in the index
+	git ls-files --eol eol.utf16 >actual &&
+	test_cmp expectIndexLF actual&&
+
+	rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
+
+	# cleanup
+	git reset --hard HEAD^
+'
+
+test_expect_success 'check unsupported encodings' '
+
+	echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
+	printf "nothing" >t.nothing &&
+	git add t.nothing &&
+
+	echo "*.garbage text working-tree-encoding=garbage" >>.gitattributes &&
+	printf "garbage" >t.garbage &&
+	test_must_fail git add t.garbage 2>err.out &&
+	test_i18ngrep "fatal: failed to encode" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard HEAD
+'
+
+test_expect_success 'error if encoding round trip is not the same during refresh' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no working tree encoding support
+	echo "hallo" >nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	test_must_fail git checkout HEAD^ 2>err.out &&
+	test_i18ngrep "error: .* overwritten by checkout:" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_expect_success 'error if encoding garbage is already in Git' '
+	BEFORE_STATE=$(git rev-parse HEAD) &&
+
+	# Skip the UTF-16 filter for the added file
+	# This simulates a Git version that has no checkoutEncoding support
+	cp nobom.utf16be.raw nonsense.utf16 &&
+	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
+	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
+	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
+	git update-ref refs/heads/master $COMMIT &&
+
+	git diff 2>err.out &&
+	test_i18ngrep "error: BOM is required" err.out &&
+
+	# cleanup
+	rm err.out &&
+	git reset --hard $BEFORE_STATE
+'
+
+test_done
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 6/7] convert: add tracing for 'working-tree-encoding' attribute
  2018-02-09 13:28 [PATCH v6 0/7] convert: add support for different encodings lars.schneider
                   ` (4 preceding siblings ...)
  2018-02-09 13:28 ` [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
@ 2018-02-09 13:28 ` lars.schneider
  2018-02-09 13:28 ` [PATCH v6 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
  2018-02-09 20:09 ` [PATCH v6 0/7] convert: add support for different encodings Junio C Hamano
  7 siblings, 0 replies; 14+ messages in thread
From: lars.schneider @ 2018-02-09 13:28 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

Add the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 convert.c                        | 25 +++++++++++++++++++++++++
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/convert.c b/convert.c
index dc9e2db6b5..5b49416ee1 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,29 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 
 }
 
+static void trace_encoding(const char *context, const char *path,
+			   const char *encoding, const char *buf, size_t len)
+{
+	static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
+	struct strbuf trace = STRBUF_INIT;
+	int i;
+
+	strbuf_addf(&trace, "%s (%s, considered %s):\n", context, path, encoding);
+	for (i = 0; i < len && buf; ++i) {
+		strbuf_addf(
+			&trace,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+			i,
+			(unsigned char) buf[i],
+			(buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+			((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+		);
+	}
+	strbuf_addchars(&trace, '\n', 1);
+
+	trace_strbuf(&coe, &trace);
+	strbuf_release(&trace);
+}
+
 static struct encoding {
 	const char *name;
 	struct encoding *next;
@@ -325,6 +348,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 			error(error_msg, path, enc->name);
 	}
 
+	trace_encoding("source", path, enc->name, src, src_len);
 	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
 				  &dst_len);
 	if (!dst) {
@@ -340,6 +364,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 		else
 			error(msg, path, enc->name, default_encoding);
 	}
+	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
 	strbuf_attach(buf, dst, dst_len, dst_len + 1);
 	return 1;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index f9ce3e5ef5..01789ae1b8 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+
 test_expect_success 'setup test repo' '
 	git config core.eol lf &&
 
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v6 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding'
  2018-02-09 13:28 [PATCH v6 0/7] convert: add support for different encodings lars.schneider
                   ` (5 preceding siblings ...)
  2018-02-09 13:28 ` [PATCH v6 6/7] convert: add tracing for " lars.schneider
@ 2018-02-09 13:28 ` lars.schneider
  2018-02-09 20:09 ` [PATCH v6 0/7] convert: add support for different encodings Junio C Hamano
  7 siblings, 0 replies; 14+ messages in thread
From: lars.schneider @ 2018-02-09 13:28 UTC (permalink / raw)
  To: git
  Cc: gitster, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

UTF supports lossless conversion round tripping and conversions between
UTF and other encodings are mostly round trip safe as Unicode aims to be
a superset of all other character encodings. However, certain encodings
(e.g. SHIFT-JIS) are known to have round trip issues [1].

Add 'core.checkRoundTripEncoding', which contains a comma separated
list of encodings, to define for what encodings Git should check the
conversion round trip if they are used in the 'working-tree-encoding'
attribute.

Set SHIFT-JIS as default value for 'core.checkRoundTripEncoding'.

[1] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---
 Documentation/config.txt         |  6 ++++
 Documentation/gitattributes.txt  |  7 ++++
 config.c                         |  5 +++
 convert.c                        | 74 ++++++++++++++++++++++++++++++++++++++++
 convert.h                        |  1 +
 environment.c                    |  1 +
 t/t0028-working-tree-encoding.sh | 41 ++++++++++++++++++++++
 7 files changed, 135 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..d7a56054a5 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
 	This variable can be set to 'input',
 	in which case no output conversion is performed.
 
+core.checkRoundtripEncoding::
+	A comma separated list of encodings that Git performs UTF-8 round
+	trip checks on if they are used in an `working-tree-encoding`
+	attribute (see linkgit:gitattributes[5]). The default value is
+	`SHIFT-JIS`.
+
 core.symlinks::
 	If false, symbolic links are checked out as small plain files that
 	contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 4ecdcd4859..ea5a9509c6 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -298,6 +298,13 @@ number of pitfalls:
   older Git versions and alternative Git implementations such as JGit
   or libgit2 (as of February 2018).
 
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to `core.checkRoundtripEncoding`
+  to make Git check the round trip encoding (see linkgit:git-config[1]).
+  SHIFT-JIS (Japanese character set) is known to have round trip issues
+  with UTF-8 and is checked by default.
+
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
 
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.checkroundtripencoding")) {
+		check_roundtrip_encoding = xstrdup(value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.notesref")) {
 		notes_ref_name = xstrdup(value);
 		return 0;
diff --git a/convert.c b/convert.c
index 5b49416ee1..71dffc7167 100644
--- a/convert.c
+++ b/convert.c
@@ -289,6 +289,39 @@ static void trace_encoding(const char *context, const char *path,
 	strbuf_release(&trace);
 }
 
+static int check_roundtrip(const char* enc_name)
+{
+	/*
+	 * check_roundtrip_encoding contains a string of space and/or
+	 * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
+	 * Search for the given encoding in that string.
+	 */
+	const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+	const char *next = found + strlen(enc_name);
+	int len = strlen(check_roundtrip_encoding);
+	return (found && (
+			/*
+			 * check that the found encoding is at the
+			 * beginning of check_roundtrip_encoding or
+			 * that it is prefixed with a space or comma
+			 */
+			found == check_roundtrip_encoding || (
+				found > check_roundtrip_encoding &&
+				(*(found-1) == ' ' || *(found-1) == ',')
+			)
+		) && (
+			/*
+			 * check that the found encoding is at the
+			 * end of check_roundtrip_encoding or
+			 * that it is suffixed with a space or comma
+			 */
+			next == check_roundtrip_encoding + len || (
+				next < check_roundtrip_encoding + len &&
+				(*next == ' ' || *next == ',')
+			)
+		));
+}
+
 static struct encoding {
 	const char *name;
 	struct encoding *next;
@@ -366,6 +399,47 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	}
 	trace_encoding("destination", path, default_encoding, dst, dst_len);
 
+	/*
+	 * UTF supports lossless conversion round tripping [1] and conversions
+	 * between UTF and other encodings are mostly round trip safe as
+	 * Unicode aims to be a superset of all other character encodings.
+	 * However, certain encodings (e.g. SHIFT-JIS) are known to have round
+	 * trip issues [2]. Check the round trip conversion for all encodings
+	 * listed in core.checkRoundTripEncoding.
+	 *
+	 * The round trip check is only performed if content is written to Git.
+	 * This ensures that no information is lost during conversion to/from
+	 * the internal UTF-8 representation.
+	 *
+	 * Please note, the code below is not tested because I was not able to
+	 * generate a faulty round trip without an iconv error. Iconv errors
+	 * are already caught above.
+	 *
+	 * [1] http://unicode.org/faq/utf_bom.html#gen2
+	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
+	 */
+	if ((conv_flags & CONV_WRITE_OBJECT) && check_roundtrip(enc->name)) {
+		char *re_src;
+		int re_src_len;
+
+		re_src = reencode_string_len(dst, dst_len,
+					     enc->name, default_encoding,
+					     &re_src_len);
+
+		trace_printf("Checking roundtrip encoding for %s...\n", enc->name);
+		trace_encoding("reencoded source", path, enc->name,
+			       re_src, re_src_len);
+
+		if (!re_src || src_len != re_src_len ||
+		    memcmp(src, re_src, src_len)) {
+			const char* msg = _("encoding '%s' from %s to %s and "
+					    "back is not the same");
+			die(msg, path, enc->name, default_encoding);
+		}
+
+		free(re_src);
+	}
+
 	strbuf_attach(buf, dst, dst_len, dst_len + 1);
 	return 1;
 }
diff --git a/convert.h b/convert.h
index 1d9539ed0b..765abfbd60 100644
--- a/convert.h
+++ b/convert.h
@@ -56,6 +56,7 @@ struct delayed_checkout {
 };
 
 extern enum eol core_eol;
+extern char *check_roundtrip_encoding;
 extern const char *get_cached_convert_stats_ascii(const struct index_state *istate,
 						  const char *path);
 extern const char *get_wt_convert_stats_ascii(const char *path);
diff --git a/environment.c b/environment.c
index 10a32c20ac..5bae9131ad 100644
--- a/environment.c
+++ b/environment.c
@@ -50,6 +50,7 @@ int check_replace_refs = 1;
 char *git_replace_ref_base;
 enum eol core_eol = EOL_UNSET;
 int global_conv_flags_eol = CONV_EOL_RNDTRP_WARN;
+char *check_roundtrip_encoding = "SHIFT-JIS";
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;
 enum branch_track git_branch_track = BRANCH_TRACK_REMOTE;
 enum rebase_setup_type autorebase = AUTOREBASE_NEVER;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 01789ae1b8..5dcdd5f899 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -209,4 +209,45 @@ test_expect_success 'error if encoding garbage is already in Git' '
 	git reset --hard $BEFORE_STATE
 '
 
+test_expect_success 'check roundtrip encoding' '
+	text="hallo there!\nroundtrip test here!" &&
+	printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift &&
+	printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 &&
+	echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes &&
+
+	# SHIFT-JIS encoded files are round-trip checked by default...
+	GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for SHIFT-JIS" &&
+	git reset &&
+
+	# ... unless we overwrite the Git config!
+	test_config core.checkRoundTripEncoding "garbage" &&
+	! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for SHIFT-JIS" &&
+	test_unconfig core.checkRoundTripEncoding &&
+	git reset &&
+
+	# UTF-16 encoded files should not be round-trip checked by default...
+	! GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset &&
+
+	# ... unless we tell Git to check it!
+	test_config_global core.checkRoundTripEncoding "UTF-16, UTF-32" &&
+	GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset &&
+
+	# ... unless we tell Git to check it!
+	# (here we also check that the casing of the encoding is irrelevant)
+	test_config_global core.checkRoundTripEncoding "UTF-32, utf-16" &&
+	GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
+		grep "Checking roundtrip encoding for UTF-16" &&
+	git reset &&
+
+	# cleanup
+	rm roundtrip.shift roundtrip.utf16 &&
+	git reset --hard HEAD
+'
+
 test_done
-- 
2.16.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 4/7] utf8: add function to detect a missing UTF-16/32 BOM
  2018-02-09 13:28 ` [PATCH v6 4/7] utf8: add function to detect a missing " lars.schneider
@ 2018-02-09 19:28   ` Junio C Hamano
  2018-02-09 19:47     ` Lars Schneider
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2018-02-09 19:28 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

lars.schneider@autodesk.com writes:

> From: Lars Schneider <larsxschneider@gmail.com>
>
> If the endianness is not defined in the encoding name, then let's
> ...
> [3] https://encoding.spec.whatwg.org/#utf-16le
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>
> utf
> ---

Huh?

>  utf8.c | 13 +++++++++++++
>  utf8.h | 16 ++++++++++++++++
>  2 files changed, 29 insertions(+)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 4/7] utf8: add function to detect a missing UTF-16/32 BOM
  2018-02-09 19:28   ` Junio C Hamano
@ 2018-02-09 19:47     ` Lars Schneider
  0 siblings, 0 replies; 14+ messages in thread
From: Lars Schneider @ 2018-02-09 19:47 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Lars Schneider, git, Torsten Bögershausen, Johannes Sixt,
	Eric Sunshine, Jeff King, ramsay, Johannes.Schindelin


> On 09 Feb 2018, at 20:28, Junio C Hamano <gitster@pobox.com> wrote:
> 
> lars.schneider@autodesk.com writes:
> 
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> If the endianness is not defined in the encoding name, then let's
>> ...
>> [3] https://encoding.spec.whatwg.org/#utf-16le
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
>> 
>> utf
>> ---
> 
> Huh?

That is garbage, sorry! :-(

Can you remove it for me?

Thanks,
Lars

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/7] convert: add support for different encodings
  2018-02-09 13:28 [PATCH v6 0/7] convert: add support for different encodings lars.schneider
                   ` (6 preceding siblings ...)
  2018-02-09 13:28 ` [PATCH v6 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
@ 2018-02-09 20:09 ` Junio C Hamano
  2018-02-10  1:04   ` Lars Schneider
  7 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2018-02-09 20:09 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, tboegi, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

Documentation has core.checkRoundtripEncoding while t0028 and a
comment in convert.c capitalize it differently.  I suspect that it
would be more reader-friendly to update the documentation to match.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 0/7] convert: add support for different encodings
  2018-02-09 20:09 ` [PATCH v6 0/7] convert: add support for different encodings Junio C Hamano
@ 2018-02-10  1:04   ` Lars Schneider
  0 siblings, 0 replies; 14+ messages in thread
From: Lars Schneider @ 2018-02-10  1:04 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: lars.schneider, git, tboegi, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin


> On 09 Feb 2018, at 21:09, Junio C Hamano <gitster@pobox.com> wrote:
> 
> Documentation has core.checkRoundtripEncoding while t0028 and a
> comment in convert.c capitalize it differently.  I suspect that it
> would be more reader-friendly to update the documentation to match.

Agreed. I will wait a little for additional feedback and then send 
a new round.

Thanks,
Lars

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute
  2018-02-09 13:28 ` [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
@ 2018-02-10  9:48   ` Torsten Bögershausen
  2018-02-14 13:22     ` Lars Schneider
  0 siblings, 1 reply; 14+ messages in thread
From: Torsten Bögershausen @ 2018-02-10  9:48 UTC (permalink / raw)
  To: lars.schneider
  Cc: git, gitster, j6t, sunshine, peff, ramsay, Johannes.Schindelin,
	Lars Schneider

On Fri, Feb 09, 2018 at 02:28:28PM +0100, lars.schneider@autodesk.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
> 
> Git recognizes files encoded with ASCII or one of its supersets (e.g.
> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> interpreted as binary and consequently built-in Git text processing
> tools (e.g. 'git diff') as well as most Git web front ends do not
> visualize the content.
> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
>  Documentation/gitattributes.txt  |  66 ++++++++++++
>  convert.c                        | 157 ++++++++++++++++++++++++++++-
>  convert.h                        |   1 +
>  sha1_file.c                      |   2 +-
>  t/t0028-working-tree-encoding.sh | 210 +++++++++++++++++++++++++++++++++++++++
>  5 files changed, 434 insertions(+), 2 deletions(-)
>  create mode 100755 t/t0028-working-tree-encoding.sh
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 30687de81a..4ecdcd4859 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -272,6 +272,72 @@ few exceptions.  Even though...
>    catch potential problems early, safety triggers.
>  
>  
> +`working-tree-encoding`
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Git recognizes files encoded with ASCII or one of its supersets (e.g.
> +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
> +interpreted as binary and consequently built-in Git text processing
> +tools (e.g. 'git diff') as well as most Git web front ends do not
> +visualize the content.
> +
> +In these cases you can tell Git the encoding of a file in the working
> +directory with the `working-tree-encoding` attribute. If a file with this
> +attribute is added to Git, then Git reencodes the content from the
> +specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
> +content in its internal data structure (called "the index"). On checkout
> +the content is reencoded back to the specified encoding.
> +
> +Please note that using the `working-tree-encoding` attribute may have a
> +number of pitfalls:
> +
> +- Git clients that do not support the `working-tree-encoding` attribute

A client to Git ?
Or may be "third party Git implementations"

> +  will checkout the respective files UTF-8 encoded and not in the
> +  expected encoding. Consequently, these files will appear different
> +  which typically causes trouble. This is in particular the case for
> +  older Git versions and alternative Git implementations such as JGit
> +  or libgit2 (as of February 2018).
> +
> +- Reencoding content requires resources that might slow down certain
> +  Git operations (e.g 'git checkout' or 'git add').
> +
> +Use the `working-tree-encoding` attribute only if you cannot store a file
> +in UTF-8 encoding and if you want Git to be able to process the content
> +as text.
> +
> +As an example, use the following attributes if your '*.proj' files are
> +UTF-16 encoded with byte order mark (BOM) and you want Git to perform
> +automatic line ending conversion based on your platform.
> +
> +------------------------
> +*.proj		text working-tree-encoding=UTF-16
> +------------------------
> +
> +Use the following attributes if your '*.proj' files are UTF-16 little
> +endian encoded without BOM and you want Git to use Windows line endings
> +in the working directory. Please note, it is highly recommended to
> +explicitly define the line endings with `eol` if the `working-tree-encoding`
> +attribute is used to avoid ambiguity.
> +
> +------------------------
> +*.proj 		working-tree-encoding=UTF-16LE text eol=CRLF
> +------------------------
> +
> +You can get a list of all available encodings on your platform with the
> +following command:

One question:
 +*.proj		text working-tree-encoding=UTF-16
vs
*.proj 		working-tree-encoding=UTF-16LE text eol=CRLF

Technically the order of attributes doesn't matter, but that is not what we
want to demonstrate here and now.
I would probably move the "text" attribute to the end of the line.
So that readers don't start to wonder if the order is important.

> +
> +------------------------
> +iconv --list
> +------------------------
> +
> +If you do not know the encoding of a file, then you can use the `file`
> +command to guess the encoding:
> +
> +------------------------
> +file foo.proj
> +------------------------
> +
> +
>  `ident`
>  ^^^^^^^
>  
> diff --git a/convert.c b/convert.c
> index b976eb968c..dc9e2db6b5 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -7,6 +7,7 @@
>  #include "sigchain.h"
>  #include "pkt-line.h"
>  #include "sub-process.h"
> +#include "utf8.h"
>  
>  /*
>   * convert.c - convert a file when checking it out and checking it in.
> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>  
>  }
>  
> +static struct encoding {
> +	const char *name;
> +	struct encoding *next;
> +} *encoding, **encoding_tail;
> +static const char *default_encoding = "UTF-8";
> +
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +			 struct strbuf *buf, struct encoding *enc, int conv_flags)
> +{
> +	char *dst;
> +	int dst_len;
> +
> +	/*
> +	 * No encoding is specified or there is nothing to encode.
> +	 * Tell the caller that the content was not modified.
> +	 */
> +	if (!enc || (src && !src_len))
> +		return 0;
> +
> +	/*
> +	 * Looks like we got called from "would_convert_to_git()".
> +	 * This means Git wants to know if it would encode (= modify!)
> +	 * the content. Let's answer with "yes", since an encoding was
> +	 * specified.
> +	 */
> +	if (!buf && !src)
> +		return 1;
> +
> +	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
> +		const char *error_msg = _(
> +			"BOM is prohibited for '%s' if encoded as %s");
> +		const char *advise_msg = _(
> +			"You told Git to treat '%s' as %s. A byte order mark "
> +			"(BOM) is prohibited with this encoding. Either use "
> +			"%.6s as working tree encoding or remove the BOM from the "
> +			"file.");

"You told Git" is probly right from Gits point of view, and advises are really helpfull.
But what should the user do about it ?
Could we give a better advise ?


"A byte order mark (BOM) is prohibited with %s.
Please remove the BOM from the file %s 
or use "%s as working-tree-encoding"

I would probably suspect that a tool wrote the BOM, and that is
good and can or should not be changed by a user.

So a simply message like this could be the preferred (and only)
solution for a user:
"A byte order mark (BOM) is prohibited with %s.
Please use "%s as working-tree-encoding"


(And why %.6s and not simply %s ?)

No more comments for now, I didn't review the test cases.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute
  2018-02-10  9:48   ` Torsten Bögershausen
@ 2018-02-14 13:22     ` Lars Schneider
  0 siblings, 0 replies; 14+ messages in thread
From: Lars Schneider @ 2018-02-14 13:22 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Lars Schneider, git, gitster, j6t, sunshine, peff, ramsay,
	Johannes.Schindelin


> On 10 Feb 2018, at 10:48, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On Fri, Feb 09, 2018 at 02:28:28PM +0100, lars.schneider@autodesk.com wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>> 
>> ...
>> 
>> +Please note that using the `working-tree-encoding` attribute may have a
>> +number of pitfalls:
>> +
>> +- Git clients that do not support the `working-tree-encoding` attribute
> 
> A client to Git ?
> Or may be "third party Git implementations"

OK, I'll go with "Third party Git implementations".


>> 
>> +As an example, use the following attributes if your '*.proj' files are
>> +UTF-16 encoded with byte order mark (BOM) and you want Git to perform
>> +automatic line ending conversion based on your platform.
>> +
>> +------------------------
>> +*.proj		text working-tree-encoding=UTF-16
>> +------------------------
>> +
>> +Use the following attributes if your '*.proj' files are UTF-16 little
>> +endian encoded without BOM and you want Git to use Windows line endings
>> +in the working directory. Please note, it is highly recommended to
>> +explicitly define the line endings with `eol` if the `working-tree-encoding`
>> +attribute is used to avoid ambiguity.
>> +
>> +------------------------
>> +*.proj 		working-tree-encoding=UTF-16LE text eol=CRLF
>> +------------------------
>> +
>> +You can get a list of all available encodings on your platform with the
>> +following command:
> 
> One question:
> +*.proj		text working-tree-encoding=UTF-16
> vs
> *.proj 		working-tree-encoding=UTF-16LE text eol=CRLF
> 
> Technically the order of attributes doesn't matter, but that is not what we
> want to demonstrate here and now.
> I would probably move the "text" attribute to the end of the line.
> So that readers don't start to wonder if the order is important.

I agree in general. However, I would move "text" to the beginning to be
consistent with the gitattribute pattern above. OK?


>> 
>> +	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
>> +		const char *error_msg = _(
>> +			"BOM is prohibited for '%s' if encoded as %s");
>> +		const char *advise_msg = _(
>> +			"You told Git to treat '%s' as %s. A byte order mark "
>> +			"(BOM) is prohibited with this encoding. Either use "
>> +			"%.6s as working tree encoding or remove the BOM from the "
>> +			"file.");
> 
> "You told Git" is probly right from Gits point of view, and advises are really helpfull.
> But what should the user do about it ?
> Could we give a better advise ?
> 
> 
> "A byte order mark (BOM) is prohibited with %s.
> Please remove the BOM from the file %s 
> or use "%s as working-tree-encoding"
> 
> I would probably suspect that a tool wrote the BOM, and that is
> good and can or should not be changed by a user.
> 
> So a simply message like this could be the preferred (and only)
> solution for a user:
> "A byte order mark (BOM) is prohibited with %s.
> Please use "%s as working-tree-encoding"

OK. I like the last one!


> (And why %.6s and not simply %s ?)

The encodings is UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE.
I just use the first 6 characters to print the encoding that
allows BOMs (UTF-16 or UTF-32). I'll add a comment to explain 
the trickery in the code!

Thanks,
Lars

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-02-14 13:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-09 13:28 [PATCH v6 0/7] convert: add support for different encodings lars.schneider
2018-02-09 13:28 ` [PATCH v6 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-02-09 13:28 ` [PATCH v6 2/7] strbuf: add xstrdup_toupper() lars.schneider
2018-02-09 13:28 ` [PATCH v6 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-02-09 13:28 ` [PATCH v6 4/7] utf8: add function to detect a missing " lars.schneider
2018-02-09 19:28   ` Junio C Hamano
2018-02-09 19:47     ` Lars Schneider
2018-02-09 13:28 ` [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
2018-02-10  9:48   ` Torsten Bögershausen
2018-02-14 13:22     ` Lars Schneider
2018-02-09 13:28 ` [PATCH v6 6/7] convert: add tracing for " lars.schneider
2018-02-09 13:28 ` [PATCH v6 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-02-09 20:09 ` [PATCH v6 0/7] convert: add support for different encodings Junio C Hamano
2018-02-10  1:04   ` Lars Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).