All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lars Schneider <larsxschneider@gmail.com>
To: "Torsten Bögershausen" <tboegi@web.de>
Cc: Lars Schneider <lars.schneider@autodesk.com>,
	Git List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>,
	peff@peff.net, patrick@luehne.de
Subject: Re: [PATCH v1] convert: add support for 'encoding' attribute
Date: Fri, 29 Dec 2017 14:56:17 +0100	[thread overview]
Message-ID: <2ADD036C-6714-4DE5-A290-4C76EFACE166@gmail.com> (raw)
In-Reply-To: <20171229125954.GA9772@tor.lan>


> On 29 Dec 2017, at 13:59, Torsten Bögershausen <tboegi@web.de> wrote:
> 
> On 2017-12-28 17:14, Lars Schneider wrote:> 
>>> On 17 Dec 2017, at 18:14, Torsten Bögershausen <tboegi@web.de> wrote:
>>> 
>>> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@autodesk.com wrote:
>>>> From: Lars Schneider <larsxschneider@gmail.com>
>>>> 
>> 
>>>> +`encoding`
>>>> +^^^^^^^^^^
>>>> +
>>>> +By default Git assumes UTF-8 encoding for text files.  The `encoding`
>>> 
>>> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8.
>> 
>> I am not sure I understand what you want to tell me here.
>> Do you think UTF-8 encoding is not correct in the sentence above?
> 
> Git itself was designed to handle source code in ASCII.
> Text files which are encoded in ISO-8859-1, UTF-8 or any
> super-set of ASCII are handled as well, and what exactly to do
> with bytes >0x80 is outside the responsibility of Git.
> E.g. the interpretation and rendering on the screen may be
> dependent on the locale or being guessed.
> All in all, saying that Git expects UTF-8 may be "overdriven".
> Any kind of file that uses an '\n' as an end of line
> and has no NUL bytes '\0' works as a text file in Git.
> (End of BlaBla ;-)
> 
> We can probably replace:
> "By default Git assumes UTF-8 encoding for text files"
> 
> with something like
> "Git handles UTF-8 encoded files as text files, but UTF-16 encoded
> files are handled as binary files"

Well, UTF-32 are handled as binary file, too :-)

How about this?

    Git recognizes files encoded with ASCII or one of its supersets
    (e.g. UTF-8 or ISO-8859-1) as text files.  All other...


> 
>>> 
>>>> +attribute sets the encoding to be used in the working directory.
>>>> +If the path is added to the index, then Git encodes the content to
>>>> +UTF-8.  On checkout the content is encoded back to the original
>>>> +encoding.  Consequently, you can use all built-in Git text processing
>>>> +tools (e.g. git diff, line ending conversions, etc.) with your
>>>> +non-UTF-8 encoded file.
>>>> +
>>>> +Please note that re-encoding content can cause errors and requires
>>>> +resources. Use the `encoding` attribute only if you cannot store
>>>> +a file in UTF-8 encoding and if you want Git to be able to process
>>>> +the text.
>>>> +
>>>> +------------------------
>>>> +*.txt		text encoding=UTF-16
>>>> +------------------------
>>> 
>>> I think that encoding (or worktree-encoding as said elsewhere) must imply text.
>>> (That is not done in convert.c, but assuming binary or even auto
>>> does not make much sense to me)
>> 
>> "text" only means "LF". "-text" means CRLF which would be perfectly fine
>> for UTF-16. Therefore I don't think we need to imply text.
>> Does this make sense?
> Yes and now.
> 
> "text" means convert CRLF at "git add" (or commit) into LF,
> And potentially LF into CRLF at checkout,
> depending on the EOL attribute (if set), core.autocrlf and/or core.eol.
> 
> "-text" means don't touch CRLF or LF at all. Files with CRLF are commited
> with CRLF.
> Running a Unix like "diff" tool will often show "^M" to indicate that there
> is a CR before the LF, which may be distracting or confusing.
> 
> If someone ever wants to handle the UTF-16 files on a Linux/Mac/Unix system,
> it makes sense to convert the line endings into LF before storing them
> into the index (at least this is my experience).
> 
> (Not specifying "text" or "-text" at all will trigger the auto-handling,
> which is not needed at all).
> So if we want to be sure that line endings are CRLF in the worktree we
> should ask the user to specify things like this:
> 
> *.ext worktree-encoding=UTF-16 text eol=CRLF
> 
> It may be enough to give this example in the documentation.
> and I would rather be over-careful here, to avoid future problems.

Agreed. I'll add that example!


>>>> +
>>>> +All `iconv` encodings with a stable round-trip conversion to and from
>>>> +UTF-8 are supported.  You can see a full list with the following command:
>>>> +
>>>> +------------------------
>>>> +iconv --list
>>>> +------------------------
>>>> +
>>>> `eol`
>>>> ^^^^^
>>>> 
>>>> diff --git a/convert.c b/convert.c
>>>> index 20d7ab67bd..ee19c17104 100644
>>>> --- a/convert.c
>>>> +++ b/convert.c
>>>> @@ -7,6 +7,7 @@
>>>> #include "sigchain.h"
>>>> #include "pkt-line.h"
>>>> #include "sub-process.h"
>>>> +#include "utf8.h"
>>>> 
>>>> /*
>>>> * convert.c - convert a file when checking it out and checking it in.
>>>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>>>> 
>>>> }
>>> 
>>> I would avoid to use these #ifdefs here.
>>> All functions can be in strbuf.c (and may have #ifdefs there), but not here.
>> 
>> I'll try that. But wouldn't it make more sense to move the functions to utf.c?
> 
> May be.
> Originally utf8.c was about encoding and all kind of UTF-8 related stuff.
> Especially it didn't know anything about strbuf.
> I don't know why strbuf.h and other functions had been added here,
> 
> I once moved them into strbuf.c without any problems, but never send out
> a patch, because of possible merge conflicts in ongoing patches.
> 
> In any case, if it is about strbuf, I would try to put it into strbuf.c

I think I simplified the code. I reused "reencode_string_len".
I added just two BOM check functions to utf8.c ... I'll send out a new series
in a bit.

- Lars


  reply	other threads:[~2017-12-29 13:56 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-11 15:50 [PATCH v1] convert: add support for 'encoding' attribute lars.schneider
2017-12-11 18:39 ` Eric Sunshine
2017-12-11 23:47   ` Lars Schneider
2017-12-11 23:58     ` Eric Sunshine
2017-12-12 10:58       ` Lars Schneider
2017-12-11 20:47 ` Johannes Sixt
2017-12-11 23:42   ` Lars Schneider
2017-12-12  0:59     ` Junio C Hamano
2017-12-12  7:15       ` Johannes Sixt
2017-12-12 10:55         ` Lars Schneider
2017-12-12 19:31           ` Junio C Hamano
2017-12-13 17:57             ` Lars Schneider
2017-12-13 18:11               ` Junio C Hamano
2017-12-13 23:02                 ` Lars Schneider
2017-12-14 23:01                   ` Junio C Hamano
2017-12-12  7:09     ` Johannes Sixt
2017-12-18 10:13   ` Torsten Bögershausen
2017-12-18 13:12     ` Jeff King
2017-12-23  8:08       ` Torsten Bögershausen
2017-12-29 13:28       ` [PATCH/RFC 0/2] git diff --UTF-8 tboegi
2017-12-29 13:28       ` [PATCH/RFC 1/2] convert_to_git(): checksafe becomes an integer tboegi
2017-12-29 13:28       ` [PATCH/RFC 2/2] git diff: Allow to reencode into UTF-8 tboegi
2018-02-26 17:27       ` [PATCH/RFC 1/1] Auto diff of UTF-16 files in UTF-8 tboegi
2018-02-26 18:43         ` Peter Krefting
2018-02-27 22:39         ` Jeff King
2017-12-18 18:02     ` [PATCH v1] convert: add support for 'encoding' attribute Junio C Hamano
2017-12-18 21:55     ` Johannes Sixt
2017-12-15  9:58 ` Jeff King
2017-12-18 10:54   ` Lars Schneider
2017-12-18 12:59     ` Jeff King
2017-12-17 17:14 ` Torsten Bögershausen
2017-12-28 16:14   ` Lars Schneider
2017-12-29 12:59     ` Torsten Bögershausen
2017-12-29 13:56       ` Lars Schneider [this message]
2018-01-03 19:15       ` Junio C Hamano
2018-01-03 20:45         ` Lars Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2ADD036C-6714-4DE5-A290-4C76EFACE166@gmail.com \
    --to=larsxschneider@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=lars.schneider@autodesk.com \
    --cc=patrick@luehne.de \
    --cc=peff@peff.net \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.