All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] git-p4: fix crlf handling for utf16 files on Windows
@ 2022-07-20  5:27 Moritz Baumann via GitGitGadget
  2022-07-20 16:08 ` Junio C Hamano
  2022-07-20 18:17 ` [PATCH v2] git-p4: fix CR LF handling for utf16 files Moritz Baumann via GitGitGadget
  0 siblings, 2 replies; 6+ messages in thread
From: Moritz Baumann via GitGitGadget @ 2022-07-20  5:27 UTC (permalink / raw)
  To: git; +Cc: Tao Klerks, Junio C Hamano, Moritz Baumann, Moritz Baumann

From: Moritz Baumann <moritz.baumann@sap.com>

Signed-off-by: Moritz Baumann <moritz.baumann@sap.com>
---
    git-p4: fix crlf handling for utf16 files on Windows
    
    Signed-off-by: Moritz Baumann moritz.baumann@sap.com

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1294%2Fmbs-c%2Ffix-crlf-conversion-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1294/mbs-c/fix-crlf-conversion-v1
Pull-Request: https://github.com/git/git/pull/1294

 git-p4.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/git-p4.py b/git-p4.py
index 8fbf6eb1fe3..0a9d7e2ed7c 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -3148,7 +3148,7 @@ class P4Sync(Command, P4UserMap):
                     raise e
             else:
                 if p4_version_string().find('/NT') >= 0:
-                    text = text.replace(b'\r\n', b'\n')
+                    text = text.replace(b'\x0d\x00\x0a\x00', b'\x0a\x00')
                 contents = [text]
 
         if type_base == "apple":

base-commit: bbea4dcf42b28eb7ce64a6306cdde875ae5d09ca
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] git-p4: fix crlf handling for utf16 files on Windows
  2022-07-20  5:27 [PATCH] git-p4: fix crlf handling for utf16 files on Windows Moritz Baumann via GitGitGadget
@ 2022-07-20 16:08 ` Junio C Hamano
  2022-07-20 16:32   ` Baumann, Moritz
  2022-07-20 18:17 ` [PATCH v2] git-p4: fix CR LF handling for utf16 files Moritz Baumann via GitGitGadget
  1 sibling, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2022-07-20 16:08 UTC (permalink / raw)
  To: Moritz Baumann via GitGitGadget; +Cc: git, Tao Klerks, Moritz Baumann

"Moritz Baumann via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Moritz Baumann <moritz.baumann@sap.com>

Can you describe briefly what problem is being solved and how the
change solves it in this place above your Sign-off?  The title says
"fix", without saying how the behaviour by the current code is
"broken", so that is one thing you can describe.  It talks about
"UTF-16 files on Windows", but does it mean git-p4 running on
Windows or git-p4 running anywhere that (over the wire) talks with
P4 running on Windows?  IOW, would the same problem trigger if you
are on macOS but the contents of the file you exchange with P4
happens to be in UTF-16?

These are the things you can describe to help those who are not you
(i.e. without access to an environment similar to what you saw the
problem on) understand the issue and help them convince themselves
that the patch they are seeing is a sensible solution.  Without any,
it is hard to evaluate.

> Signed-off-by: Moritz Baumann <moritz.baumann@sap.com>
> ---

> diff --git a/git-p4.py b/git-p4.py
> index 8fbf6eb1fe3..0a9d7e2ed7c 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -3148,7 +3148,7 @@ class P4Sync(Command, P4UserMap):
>                      raise e
>              else:
>                  if p4_version_string().find('/NT') >= 0:
> -                    text = text.replace(b'\r\n', b'\n')
> +                    text = text.replace(b'\x0d\x00\x0a\x00', b'\x0a\x00')
>                  contents = [text]
>  
>          if type_base == "apple":

OK, the part being touched is inside this context:

        if type_base == "utf16":
            # ...
            # But ascii text saved as -t utf16 is completely mangled.
            # Invoke print -o to get the real contents.
            #
            # On windows, the newlines will always be mangled by print, so put
            # them back too.  This is not needed to the cygwin windows version,
            # just the native "NT" type.
            #

            try:
                text = ...
            except Exception as e:
                ...
            else:
                if p4_version_string().find('/NT') >= 0:
                    text = text.replace(b'\r\n', b'\n')
                contents = [text]

So the intent of the existing code is "we know we are dealing with
UTF-16 text, and after successfully reading 'text' without
exception, we need to convert CRLF back to LF if we are on 'the
native NT type'".  Presumably 'text' that came from
p4_read_pipe(... raw=True) is not unicode string but just a bunch of
bytes, so each "char" is represented as two-byte sequence in UTF-16?

With that (speculative) understanding, I can guess that the patch
makes sense, but the patch should not make readers guess.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] git-p4: fix crlf handling for utf16 files on Windows
  2022-07-20 16:08 ` Junio C Hamano
@ 2022-07-20 16:32   ` Baumann, Moritz
  2022-07-20 17:18     ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Baumann, Moritz @ 2022-07-20 16:32 UTC (permalink / raw)
  To: Junio C Hamano, Moritz Baumann via GitGitGadget; +Cc: git, Tao Klerks

Hi Junio,

Thank you for your notes. I assumed the intent of the original code would be clear, in which case the fix should also be clear, but I am happy to elaborate.

> Can you describe briefly what problem is being solved and how the change
> solves it in this place above your Sign-off?  […] It talks about
> "UTF-16 files on Windows", but does it mean git-p4 running on Windows or
> git-p4 running anywhere that (over the wire) talks with
> P4 running on Windows?  IOW, would the same problem trigger if you are on
> macOS but the contents of the file you exchange with P4 happens to be in
> UTF-16?

The potential problem that the original code was trying to solve is the following: If a file is marked as utf16 in Perforce, and if the Perforce client is on Windows, then Perforce will replace all LF line endings with CRLF when the file is synced. This is different from git's autocrlf behavior, which ignores UTF-16 encoded files and always treats them as binary files. Without special handling, this can lead to git-p4 creating files with different hashes when run on Windows. (Which is how I stumbled upon this issue.)

Therefore, git-p4 checks the Perforce "file type" and tries to undo the line endings changes.

> So the intent of the existing code is "we know we are dealing with
> UTF-16 text, and after successfully reading 'text' without exception, we need
> to convert CRLF back to LF if we are on 'the native NT type'".  Presumably
> 'text' that came from p4_read_pipe(... raw=True) is not unicode string but just
> a bunch of bytes, so each "char" is represented as two-byte sequence in UTF-
> 16?

Exactly. The original code tried to do the right thing to ensure stable hashes that are independent of the operating system git-p4 is run on, but failed to do so successfully. With my fix, I finally got deterministic hashes on my test repository.

> With that (speculative) understanding, I can guess that the patch makes sense,
> but the patch should not make readers guess.

Do you need me to resubmit the patch with an explanatory description? If so, I can try to summarize the above.

Best regards,
Moritz

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] git-p4: fix crlf handling for utf16 files on Windows
  2022-07-20 16:32   ` Baumann, Moritz
@ 2022-07-20 17:18     ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2022-07-20 17:18 UTC (permalink / raw)
  To: Baumann, Moritz; +Cc: Moritz Baumann via GitGitGadget, git, Tao Klerks

"Baumann, Moritz" <moritz.baumann@sap.com> writes:

> Do you need me to resubmit the patch with an explanatory
> description? If so, I can try to summarize the above.

Yup.

Review comments are not a request to the authors to explain their
patches to reviewers.  Their primary purpose is to point out what
potential issues readers of the commit that would result by the
proposed patches may have.  So answering in your response to see if
your clarifications are understandable is very good, but please
consider it a preparation to write a better version (i.e. [PATCH
v2]).

Thanks for working on this fix.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2] git-p4: fix CR LF handling for utf16 files
  2022-07-20  5:27 [PATCH] git-p4: fix crlf handling for utf16 files on Windows Moritz Baumann via GitGitGadget
  2022-07-20 16:08 ` Junio C Hamano
@ 2022-07-20 18:17 ` Moritz Baumann via GitGitGadget
  2022-07-20 18:42   ` Junio C Hamano
  1 sibling, 1 reply; 6+ messages in thread
From: Moritz Baumann via GitGitGadget @ 2022-07-20 18:17 UTC (permalink / raw)
  To: git
  Cc: Tao Klerks, Junio C Hamano, Baumann, Moritz, Moritz Baumann,
	Moritz Baumann

From: Moritz Baumann <moritz.baumann@sap.com>

Perforce silently replaces LF with CR LF for "utf16" files if the client
is a native Windows client. Since git's autocrlf logic does not undo
this transformation for UTF-16 encoded files, git-p4 replaces CR LF with
LF during the sync if the file type "utf16" is detected and the Perforce
client platform indicates that this conversion is performed.

Windows only runs on little-endian architectures, therefore the encoding
of the byte stream received from the Perforce client is UTF-16-LE and
the relevant byte sequence is 0D 00 0A 00.

Signed-off-by: Moritz Baumann <moritz.baumann@sap.com>
---
    git-p4: fix crlf handling for utf16 files on Windows

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1294%2Fmbs-c%2Ffix-crlf-conversion-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1294/mbs-c/fix-crlf-conversion-v2
Pull-Request: https://github.com/git/git/pull/1294

Range-diff vs v1:

 1:  4a7a14eec28 ! 1:  4d0043712d3 git-p4: fix crlf handling for utf16 files on Windows
     @@ Metadata
      Author: Moritz Baumann <moritz.baumann@sap.com>
      
       ## Commit message ##
     -    git-p4: fix crlf handling for utf16 files on Windows
     +    git-p4: fix CR LF handling for utf16 files
     +
     +    Perforce silently replaces LF with CR LF for "utf16" files if the client
     +    is a native Windows client. Since git's autocrlf logic does not undo
     +    this transformation for UTF-16 encoded files, git-p4 replaces CR LF with
     +    LF during the sync if the file type "utf16" is detected and the Perforce
     +    client platform indicates that this conversion is performed.
     +
     +    Windows only runs on little-endian architectures, therefore the encoding
     +    of the byte stream received from the Perforce client is UTF-16-LE and
     +    the relevant byte sequence is 0D 00 0A 00.
      
          Signed-off-by: Moritz Baumann <moritz.baumann@sap.com>
      


 git-p4.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/git-p4.py b/git-p4.py
index 8fbf6eb1fe3..0a9d7e2ed7c 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -3148,7 +3148,7 @@ class P4Sync(Command, P4UserMap):
                     raise e
             else:
                 if p4_version_string().find('/NT') >= 0:
-                    text = text.replace(b'\r\n', b'\n')
+                    text = text.replace(b'\x0d\x00\x0a\x00', b'\x0a\x00')
                 contents = [text]
 
         if type_base == "apple":

base-commit: bbea4dcf42b28eb7ce64a6306cdde875ae5d09ca
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] git-p4: fix CR LF handling for utf16 files
  2022-07-20 18:17 ` [PATCH v2] git-p4: fix CR LF handling for utf16 files Moritz Baumann via GitGitGadget
@ 2022-07-20 18:42   ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2022-07-20 18:42 UTC (permalink / raw)
  To: Moritz Baumann via GitGitGadget; +Cc: git, Tao Klerks, Baumann, Moritz

"Moritz Baumann via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Moritz Baumann <moritz.baumann@sap.com>
>
> Perforce silently replaces LF with CR LF for "utf16" files if the client
> is a native Windows client. Since git's autocrlf logic does not undo
> this transformation for UTF-16 encoded files, git-p4 replaces CR LF with
> LF during the sync if the file type "utf16" is detected and the Perforce
> client platform indicates that this conversion is performed.
>
> Windows only runs on little-endian architectures, therefore the encoding
> of the byte stream received from the Perforce client is UTF-16-LE and
> the relevant byte sequence is 0D 00 0A 00.
>
> Signed-off-by: Moritz Baumann <moritz.baumann@sap.com>
> ---

Will queue.  Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-07-20 18:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-20  5:27 [PATCH] git-p4: fix crlf handling for utf16 files on Windows Moritz Baumann via GitGitGadget
2022-07-20 16:08 ` Junio C Hamano
2022-07-20 16:32   ` Baumann, Moritz
2022-07-20 17:18     ` Junio C Hamano
2022-07-20 18:17 ` [PATCH v2] git-p4: fix CR LF handling for utf16 files Moritz Baumann via GitGitGadget
2022-07-20 18:42   ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.