git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] git-p4: encoding of data from perforce
@ 2021-04-12  8:52 Andrew Oakley
  2021-04-12  8:52 ` [PATCH 1/2] git-p4: avoid decoding more " Andrew Oakley
  2021-04-12  8:52 ` [PATCH 2/2] git-p4: do not decode data from perforce by default Andrew Oakley
  0 siblings, 2 replies; 13+ messages in thread
From: Andrew Oakley @ 2021-04-12  8:52 UTC (permalink / raw)
  To: git; +Cc: Luke Diamand, Feiyang Xue, Tzadik Vanderhoof

When using python3, git-p4 fails to handle data from perforce which is
not valid UTF-8.  In large repositories it's very likely that such data
will exist - perforce itself does no validation of the data by default.

Historically git-p4 has just passed whatever bytes it got from perforce
into git.  This seems like a sensible approach - git-p4 has no idea what
encoding may have been used and it seems likely that different encodings
are used within a repository.

I was trying to do a more thorough job, moving more of git-p4 over to
using bytes.  Unfortunately the changes end up being large and hard to
review.  In most cases it's probably sufficient to just avoid decoding
the commit messages.

There have been a couple of previous proposals around trying to decode
this data using a user-configured encoding:
http://public-inbox.org/git/CAE5ih7-F9efsiV5AQmw3ocjiy+BT6ZAT5fA0Lx0OSkVTO8Kqjg@mail.gmail.com/T/
http://public-inbox.org/git/20210409153815.7joohvmlnh6itczc@tb-raspi4/T/



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] git-p4: avoid decoding more data from perforce
  2021-04-12  8:52 [PATCH 0/2] git-p4: encoding of data from perforce Andrew Oakley
@ 2021-04-12  8:52 ` Andrew Oakley
  2021-04-12  8:52 ` [PATCH 2/2] git-p4: do not decode data from perforce by default Andrew Oakley
  1 sibling, 0 replies; 13+ messages in thread
From: Andrew Oakley @ 2021-04-12  8:52 UTC (permalink / raw)
  To: git; +Cc: Luke Diamand, Feiyang Xue, Tzadik Vanderhoof, Andrew Oakley

Perforce does not validate or store the encoding of user submitted data
by default (although this can be enabled).  In large repositories it is
therefore very likely that some data will not be valid UTF-8.

Historically (with python2) git-p4 did not attempt to decode the data
from the perforce server - it just passed bytes from perforce to git,
preserving whatever was stored in perforce.  This seems like a sensible
approach - it avoids any loss of data, and there is no way to determine
the intended encoding for any invalid data from perforce.

This change updates git-p4 to avoid decoding changelist descriptions,
user and time information.  The time data is almost certainly valid
unicode, but as they are processed with the user information it is more
convenient for them to be handled as bytes.

Signed-off-by: Andrew Oakley <andrew@adoakley.name>
---
 git-p4.py                          | 57 +++++++++++++++---------------
 t/t9835-git-p4-message-encoding.sh | 48 +++++++++++++++++++++++++
 2 files changed, 77 insertions(+), 28 deletions(-)
 create mode 100755 t/t9835-git-p4-message-encoding.sh

diff --git a/git-p4.py b/git-p4.py
index 09c9e93ac4..8407ec5c7a 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -764,13 +764,15 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
         while True:
             entry = marshal.load(p4.stdout)
             if bytes is not str:
-                # Decode unmarshalled dict to use str keys and values, except for:
-                #   - `data` which may contain arbitrary binary data
-                #   - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text
+                # Decode unmarshalled dict to use str keys and values, except
+                # for cases where the values may not be valid UTF-8.
+                binary_keys = ('data', 'path', 'clientFile', 'Description',
+                               'desc', 'Email', 'FullName', 'Owner', 'time',
+                               'user', 'User')
                 decoded_entry = {}
                 for key, value in entry.items():
                     key = key.decode()
-                    if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
+                    if isinstance(value, bytes) and not (key in binary_keys or key.startswith('depotFile')):
                         value = value.decode()
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
@@ -949,11 +951,11 @@ def gitConfigInt(key):
             _gitConfig[key] = None
     return _gitConfig[key]
 
-def gitConfigList(key):
+def gitConfigList(key, raw=False):
     if key not in _gitConfig:
-        s = read_pipe(["git", "config", "--get-all", key], ignore_error=True)
+        s = read_pipe(["git", "config", "--get-all", key], ignore_error=True, raw=raw)
         _gitConfig[key] = s.strip().splitlines()
-        if _gitConfig[key] == ['']:
+        if _gitConfig[key] == [''] or _gitConfig[key] == [b'']:
             _gitConfig[key] = []
     return _gitConfig[key]
 
@@ -1499,35 +1501,35 @@ def getUserMapFromPerforceServer(self):
         for output in p4CmdList("users"):
             if "User" not in output:
                 continue
-            self.users[output["User"]] = output["FullName"] + " <" + output["Email"] + ">"
+            self.users[output["User"]] = output["FullName"] + b" <" + output["Email"] + b">"
             self.emails[output["Email"]] = output["User"]
 
-        mapUserConfigRegex = re.compile(r"^\s*(\S+)\s*=\s*(.+)\s*<(\S+)>\s*$", re.VERBOSE)
-        for mapUserConfig in gitConfigList("git-p4.mapUser"):
+        mapUserConfigRegex = re.compile(br"^\s*(\S+)\s*=\s*(.+)\s*<(\S+)>\s*$", re.VERBOSE)
+        for mapUserConfig in gitConfigList("git-p4.mapUser", raw=True):
             mapUser = mapUserConfigRegex.findall(mapUserConfig)
             if mapUser and len(mapUser[0]) == 3:
                 user = mapUser[0][0]
                 fullname = mapUser[0][1]
                 email = mapUser[0][2]
-                self.users[user] = fullname + " <" + email + ">"
+                self.users[user] = fullname + b" <" + email + b">"
                 self.emails[email] = user
 
-        s = ''
+        s = b''
         for (key, val) in self.users.items():
-            s += "%s\t%s\n" % (key.expandtabs(1), val.expandtabs(1))
+            s += b"%s\t%s\n" % (key.expandtabs(1), val.expandtabs(1))
 
-        open(self.getUserCacheFilename(), 'w').write(s)
+        open(self.getUserCacheFilename(), 'wb').write(s)
         self.userMapFromPerforceServer = True
 
     def loadUserMapFromCache(self):
         self.users = {}
         self.userMapFromPerforceServer = False
         try:
-            cache = open(self.getUserCacheFilename(), 'r')
+            cache = open(self.getUserCacheFilename(), 'rb')
             lines = cache.readlines()
             cache.close()
             for line in lines:
-                entry = line.strip().split("\t")
+                entry = line.strip().split(b"\t")
                 self.users[entry[0]] = entry[1]
         except IOError:
             self.getUserMapFromPerforceServer()
@@ -1780,7 +1782,7 @@ def p4UserForCommit(self,id):
         # Return the tuple (perforce user,git email) for a given git commit id
         self.getUserMapFromPerforceServer()
         gitEmail = read_pipe(["git", "log", "--max-count=1",
-                              "--format=%ae", id])
+                              "--format=%ae", id], raw=True)
         gitEmail = gitEmail.strip()
         if gitEmail not in self.emails:
             return (None,gitEmail)
@@ -1911,7 +1913,7 @@ def prepareSubmitTemplate(self, changelist=None):
             template += key + ':'
             if key == 'Description':
                 template += '\n'
-            for field_line in change_entry[key].splitlines():
+            for field_line in decode_text_stream(change_entry[key]).splitlines():
                 template += '\t'+field_line+'\n'
         if len(files_list) > 0:
             template += '\n'
@@ -2163,7 +2165,7 @@ def applyCommit(self, id):
            submitTemplate += "\n######## Actual user %s, modified after commit\n" % p4User
 
         if self.checkAuthorship and not self.p4UserIsMe(p4User):
-            submitTemplate += "######## git author %s does not match your p4 account.\n" % gitEmail
+            submitTemplate += "######## git author %s does not match your p4 account.\n" % decode_text_stream(gitEmail)
             submitTemplate += "######## Use option --preserve-user to modify authorship.\n"
             submitTemplate += "######## Variable git-p4.skipUserNameCheck hides this message.\n"
 
@@ -2802,7 +2804,7 @@ def __init__(self):
         self.knownBranches = {}
         self.initialParents = {}
 
-        self.tz = "%+03d%02d" % (- time.timezone / 3600, ((- time.timezone % 3600) / 60))
+        self.tz = b"%+03d%02d" % (- time.timezone / 3600, ((- time.timezone % 3600) / 60))
         self.labels = {}
 
     # Force a checkpoint in fast-import and wait for it to finish
@@ -3161,7 +3163,7 @@ def make_email(self, userid):
         if userid in self.users:
             return self.users[userid]
         else:
-            return "%s <a@b>" % userid
+            return b"%s <a@b>" % userid
 
     def streamTag(self, gitStream, labelName, labelDetails, commit, epoch):
         """ Stream a p4 tag.
@@ -3184,9 +3186,9 @@ def streamTag(self, gitStream, labelName, labelDetails, commit, epoch):
             email = self.make_email(owner)
         else:
             email = self.make_email(self.p4UserId())
-        tagger = "%s %s %s" % (email, epoch, self.tz)
+        tagger = b"%s %s %s" % (email, epoch, self.tz)
 
-        gitStream.write("tagger %s\n" % tagger)
+        gitStream.write(b"tagger %s\n" % tagger)
 
         print("labelDetails=",labelDetails)
         if 'Description' in labelDetails:
@@ -3279,12 +3281,11 @@ def commit(self, details, files, branch, parent = "", allow_empty=False):
         self.gitStream.write("commit %s\n" % branch)
         self.gitStream.write("mark :%s\n" % details["change"])
         self.committedChanges.add(int(details["change"]))
-        committer = ""
         if author not in self.users:
             self.getUserMapFromPerforceServer()
-        committer = "%s %s %s" % (self.make_email(author), epoch, self.tz)
+        committer = b"%s %s %s" % (self.make_email(author), epoch, self.tz)
 
-        self.gitStream.write("committer %s\n" % committer)
+        self.gitStream.write(b"committer %s\n" % committer)
 
         self.gitStream.write("data <<EOT\n")
         self.gitStream.write(details["desc"])
@@ -3422,7 +3423,7 @@ def importP4Labels(self, stream, p4Labels):
                         print("Could not convert label time %s" % labelDetails['Update'])
                         tmwhen = 1
 
-                    when = int(time.mktime(tmwhen))
+                    when = b"%i" % int(time.mktime(tmwhen))
                     self.streamTag(stream, name, labelDetails, gitCommit, when)
                     if verbose:
                         print("p4 label %s mapped to git commit %s" % (name, gitCommit))
@@ -3708,7 +3709,7 @@ def importHeadRevision(self, revision):
         print("Doing initial import of %s from revision %s into %s" % (' '.join(self.depotPaths), revision, self.branch))
 
         details = {}
-        details["user"] = "git perforce import user"
+        details["user"] = b"git perforce import user"
         details["desc"] = ("Initial import of %s from the state at revision %s\n"
                            % (' '.join(self.depotPaths), revision))
         details["change"] = revision
diff --git a/t/t9835-git-p4-message-encoding.sh b/t/t9835-git-p4-message-encoding.sh
new file mode 100755
index 0000000000..93f24fe295
--- /dev/null
+++ b/t/t9835-git-p4-message-encoding.sh
@@ -0,0 +1,48 @@
+#!/bin/sh
+
+test_description='Clone repositories with non ASCII commit messages'
+
+. ./lib-git-p4.sh
+
+UTF8="$(printf "a-\303\244_o-\303\266_u-\303\274")"
+ISO8859="$(printf "a-\344_o-\366_u-\374")"
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'create commits in perforce' '
+	(
+		cd "$cli" &&
+
+		p4_add_user "${UTF8}" &&
+		p4_add_user "${ISO8859}" &&
+
+		>dummy-file1 &&
+		P4USER="${UTF8}" p4 add dummy-file1 &&
+		P4USER="${UTF8}" p4 submit -d "message ${UTF8}" &&
+
+		>dummy-file2 &&
+		P4USER="${ISO8859}" p4 add dummy-file2 &&
+		P4USER="${ISO8859}" p4 submit -d "message ${ISO8859}"
+	)
+'
+
+test_expect_success 'check UTF-8 commit' '
+	(
+		git p4 clone --destination="$git/1" //depot@1,1 &&
+		git -C "$git/1" cat-file commit HEAD | grep -q "^message ${UTF8}$" &&
+		git -C "$git/1" cat-file commit HEAD | grep -q "^author Dr. ${UTF8} <${UTF8}@example.com>"
+	)
+'
+
+test_expect_success 'check ISO-8859 commit' '
+	(
+		git p4 clone --destination="$git/2" //depot@2,2 &&
+		git -C "$git/2" cat-file commit HEAD > /tmp/dump.txt &&
+		git -C "$git/2" cat-file commit HEAD | grep -q "^message ${ISO8859}$" &&
+		git -C "$git/2" cat-file commit HEAD | grep -q "^author Dr. ${ISO8859} <${ISO8859}@example.com>"
+	)
+'
+
+test_done
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] git-p4: do not decode data from perforce by default
  2021-04-12  8:52 [PATCH 0/2] git-p4: encoding of data from perforce Andrew Oakley
  2021-04-12  8:52 ` [PATCH 1/2] git-p4: avoid decoding more " Andrew Oakley
@ 2021-04-12  8:52 ` Andrew Oakley
  2021-04-29 10:00   ` Tzadik Vanderhoof
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Oakley @ 2021-04-12  8:52 UTC (permalink / raw)
  To: git; +Cc: Luke Diamand, Feiyang Xue, Tzadik Vanderhoof, Andrew Oakley

This commit is not intended to change behaviour, any we still attempt to
decode values that might not be valid unicode.  It's not clear that all
of these values are safe to decode, but it's now more obvious which data
is decoded.
---
 git-p4.py | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/git-p4.py b/git-p4.py
index 8407ec5c7a..8a97ff3dd2 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -764,15 +764,19 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
         while True:
             entry = marshal.load(p4.stdout)
             if bytes is not str:
-                # Decode unmarshalled dict to use str keys and values, except
-                # for cases where the values may not be valid UTF-8.
-                binary_keys = ('data', 'path', 'clientFile', 'Description',
-                               'desc', 'Email', 'FullName', 'Owner', 'time',
-                               'user', 'User')
+                # Decode unmarshalled dict to use str keys and values where it
+                # is expected that the data is always valid UTF-8.
+                text_keys = ('action', 'change', 'Change', 'Client', 'code',
+                             'fileSize', 'headAction', 'headRev', 'headType',
+                             'Jobs', 'label', 'options', 'perm', 'rev', 'Root',
+                             'Status', 'type', 'Update')
+                text_key_prefixes = ('action', 'File', 'job', 'rev', 'type',
+                                     'View')
                 decoded_entry = {}
                 for key, value in entry.items():
                     key = key.decode()
-                    if isinstance(value, bytes) and not (key in binary_keys or key.startswith('depotFile')):
+                    if isinstance(value, bytes) and (key in text_keys or
+                            any(filter(key.startswith, text_key_prefixes))):
                         value = value.decode()
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] git-p4: do not decode data from perforce by default
  2021-04-12  8:52 ` [PATCH 2/2] git-p4: do not decode data from perforce by default Andrew Oakley
@ 2021-04-29 10:00   ` Tzadik Vanderhoof
  2021-04-30  8:53     ` Andrew Oakley
  0 siblings, 1 reply; 13+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-29 10:00 UTC (permalink / raw)
  To: Andrew Oakley; +Cc: Git List, Luke Diamand, Feiyang Xue

I checked out "seen" and ran the test script from this patch
(t9835-git-p4-message-encoding.sh) on my Windows machine and it fails.

I don't think the solution in this patch will solve the issue of non
UTF-8 descriptions on Windows. The interaction between git-p4.py and
p4 around non-ASCII descriptions is different on Linux and Windows (at
least with the default code page settings).  Unfortunately the CI on
gitlab does not include any Windows test environments that have p4
installed.

As far as I can tell, non-ASCII strings passed to "p4 submit -d" pass
unchanged to the Perforce database on Linux.  As well, such data also
passes unchanged in the other direction, when "p4" output is consumed
by git-p4.py.  Since this patch avoids decoding descriptions, and the
test script uses binary data for descriptions, the tests pass on
Linux.

However, on Windows, UTF-8 strings passed to "p4 submit -d" are
somehow converted to the default Windows code page by the time they
are stored in the Perforce database, probably as part of the process
of passing the command line arguments to the Windows p4 executable.
However, the "code page" data is *not* converted to UTF-8 on the way
back from p4 to git-p4.py.  The only way to get it into UTF-8 is to
call string.decode().  As a result, this patch, which takes out the
call to string.decode() will not work on Windows.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] git-p4: do not decode data from perforce by default
  2021-04-29 10:00   ` Tzadik Vanderhoof
@ 2021-04-30  8:53     ` Andrew Oakley
  2021-04-30 15:33       ` Luke Diamand
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Oakley @ 2021-04-30  8:53 UTC (permalink / raw)
  To: Tzadik Vanderhoof; +Cc: Git List, Luke Diamand, Feiyang Xue

On Thu, 29 Apr 2021 03:00:06 -0700
Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> wrote:
> However, on Windows, UTF-8 strings passed to "p4 submit -d" are
> somehow converted to the default Windows code page by the time they
> are stored in the Perforce database, probably as part of the process
> of passing the command line arguments to the Windows p4 executable.
> However, the "code page" data is *not* converted to UTF-8 on the way
> back from p4 to git-p4.py.  The only way to get it into UTF-8 is to
> call string.decode().  As a result, this patch, which takes out the
> call to string.decode() will not work on Windows.

Thanks for that explanation, the reencoding of the data on Windows is
not something I was expecting.  Given the behaviour you've described, I
suspect that there might be two different problems that we are trying
to solve.

The perforce depot I'm working with has a mixture of encodings, and
commits are created from a variety of different environments. The
majority of commits are ASCII or UTF-8, there are a small number that
are in some other encoding.  Any attempt to reencode the data is likely
to make the problem worse in at least some cases.

I suspect that other perforce depots are used primarily from Windows
machines, and have data that is encoded in a mostly consistent way but
the encoding is not UTF-8.  Re-encoding the data for git makes sense in
that case.  Is this the kind of repository you have?

If there are these two different cases then we probably need to come up
with a patch that solves both issues.

For my cases where we've got a repository containing all sorts of junk,
it sounds like it might be awkward to create a test case that works on
Windows.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] git-p4: do not decode data from perforce by default
  2021-04-30  8:53     ` Andrew Oakley
@ 2021-04-30 15:33       ` Luke Diamand
  2021-04-30 18:08         ` Tzadik Vanderhoof
  0 siblings, 1 reply; 13+ messages in thread
From: Luke Diamand @ 2021-04-30 15:33 UTC (permalink / raw)
  To: Andrew Oakley, Tzadik Vanderhoof; +Cc: Git List, Feiyang Xue



On 30/04/2021 08:53, Andrew Oakley wrote:
> On Thu, 29 Apr 2021 03:00:06 -0700
> Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> wrote:
>> However, on Windows, UTF-8 strings passed to "p4 submit -d" are
>> somehow converted to the default Windows code page by the time they
>> are stored in the Perforce database, probably as part of the process
>> of passing the command line arguments to the Windows p4 executable.
>> However, the "code page" data is *not* converted to UTF-8 on the way
>> back from p4 to git-p4.py.  The only way to get it into UTF-8 is to
>> call string.decode().  As a result, this patch, which takes out the
>> call to string.decode() will not work on Windows.
> 
> Thanks for that explanation, the reencoding of the data on Windows is
> not something I was expecting.  Given the behaviour you've described, I
> suspect that there might be two different problems that we are trying
> to solve.
> 
> The perforce depot I'm working with has a mixture of encodings, and
> commits are created from a variety of different environments. The
> majority of commits are ASCII or UTF-8, there are a small number that
> are in some other encoding.  Any attempt to reencode the data is likely
> to make the problem worse in at least some cases.
> 
> I suspect that other perforce depots are used primarily from Windows
> machines, and have data that is encoded in a mostly consistent way but
> the encoding is not UTF-8.  Re-encoding the data for git makes sense in
> that case.  Is this the kind of repository you have?
> 
> If there are these two different cases then we probably need to come up
> with a patch that solves both issues.
> 
> For my cases where we've got a repository containing all sorts of junk,
> it sounds like it might be awkward to create a test case that works on
> Windows.
> 


https://www.perforce.com/perforce/doc.current/user/i18nnotes.txt

Tzadik - is your server unicode enabled or not? That would be 
interesting to know:

     p4 counters | grep -i unicode

I suspect it is not. It's only if unicode is enabled that the server 
will convert to/from utf8 (at least that's my understanding). Without 
this setting, p4d and p4 are (probably) not doing any conversions.

I think it might be useful to clarify exactly what conversions are 
actually happening.

I wonder what encoding Perforce thinks you've got in place.






^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] git-p4: do not decode data from perforce by default
  2021-04-30 15:33       ` Luke Diamand
@ 2021-04-30 18:08         ` Tzadik Vanderhoof
  2021-05-04 21:01           ` Andrew Oakley
  0 siblings, 1 reply; 13+ messages in thread
From: Tzadik Vanderhoof @ 2021-04-30 18:08 UTC (permalink / raw)
  To: Luke Diamand; +Cc: Andrew Oakley, Git List, Feiyang Xue

On Fri, Apr 30, 2021 at 8:33 AM Luke Diamand <luke@diamand.org> wrote:
>
> Tzadik - is your server unicode enabled or not? That would be
> interesting to know:
>
>      p4 counters | grep -i unicode
>
> I suspect it is not. It's only if unicode is enabled that the server
> will convert to/from utf8 (at least that's my understanding). Without
> this setting, p4d and p4 are (probably) not doing any conversions.

My server is not unicode.

These conversions are happening even with a non-Unicode perforce db.
I don't think it's the p4d code per se that is doing the conversion, but
rather an interaction between the OS and the code, which is different
under Linux vs Windows.  If you create a trivial C program that dumps
the hex values of the bytes it receives in argv, you can see this
different behavior:

#include <stdio.h>

void main(int argc, char *argv[]) {
    int i, j;
    char *s;
    for (i = 1; i < argc; ++i) {
        s = argv[i];
        for (j = 0; s[j] != '\0'; ++j)
            printf(" %X", (unsigned char)s[j]);
        printf("\n");
        printf("[%s]\n\n", s);
    }
}

When built with Visual Studio and called from Cygwin, if you pass in
args with UTF-8 encoded characters, the program will spit them out in
cp1252. If you compile it on a Linux system using gcc, it will spit them out
in UTF-8 (unchanged).  I suspect that's what's happening with p4d on
Windows vs Linux.

In any event, if you look at my patch (v6 is the latest...
https://lore.kernel.org/git/20210429073905.837-1-tzadik.vanderhoof@gmail.com/ ),
you will see I have written tests that pass under both Linux and Windows.
(If you want to run them yourself, you need to base my patch off of "master",
not "seen").  The tests make clear what the different behavior is and
also show that p4d is not set to Unicode (since the tests do not change the
default setting).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] git-p4: do not decode data from perforce by default
  2021-04-30 18:08         ` Tzadik Vanderhoof
@ 2021-05-04 21:01           ` Andrew Oakley
  2021-05-04 21:46             ` Tzadik Vanderhoof
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Oakley @ 2021-05-04 21:01 UTC (permalink / raw)
  To: Tzadik Vanderhoof; +Cc: Luke Diamand, Git List, Feiyang Xue

On Fri, 30 Apr 2021 11:08:57 -0700
Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> wrote:
> My server is not unicode.
> 
> These conversions are happening even with a non-Unicode perforce db.
> I don't think it's the p4d code per se that is doing the conversion,
> but rather an interaction between the OS and the code, which is
> different under Linux vs Windows.

It's not particularly obvious exactly what is happening here.  The
perforce command line client is written in a rather odd way - it uses
the unicode (UTF-16) wWinMainCRTStartup entry point but then calls an
undocumented API to get the "narrow" version of the command line.  The
code is here:

https://swarm.workshop.perforce.com/projects/perforce_software-p4/files/2016-1/client/clientmain.cc

I think that perforce will end up with the data in a code page that
depends on the configuration of the machine.  I don't think the exact
details matter here - just that it's some semi-arbitrary encoding that
isn't recorded in the commit.

The key thing that I'm trying to point out here is that the encoding is
not necessarily consistent between different commits.  The changes that
you have proposed force you to pick one encoding that will be used for
every commit.  If it's wrong then data will be corrupted, and there is
no option provided to avoid that.  The only way I can see to avoid this
issue is to not attempt to re-encode the data - just pass it directly
to git.

I think another way to solve the issue you have is the encoding header
on git commits.  We can pass the bytes through git-p4 unmodified, but
mark the commit message as being encoded using something that isn't
UTF-8.  That avoids any potentially lossy conversions when cloning the
repository, but should allow the data to be displayed correctly in git.

> In any event, if you look at my patch (v6 is the latest...
> https://lore.kernel.org/git/20210429073905.837-1-tzadik.vanderhoof@gmail.com/
> ), you will see I have written tests that pass under both Linux and
> Windows. (If you want to run them yourself, you need to base my patch
> off of "master", not "seen").  The tests make clear what the
> different behavior is and also show that p4d is not set to Unicode
> (since the tests do not change the default setting).

I don't think the tests are doing anything interesting on Linux - you
stick valid UTF-8 in, and valid UTF-8 data comes out.   I suspect the
tests will fail on Windows if the relevant code page is set to a value
that you're not expecting.

For the purposes of writing tests that work the same everywhere we can
use `p4 submit -i`.  The data written on stdin isn't reencoded, even on
Windows.

I can rework my test to use `p4 submit -i` on windows.  It should be
fairly simple to write another change which allows the encoding to be
set on commits created by git-p4.

Does that seem like a reasonable way forward?  I think it gets us:
- sensible handling for repositories with mixed encodings
- sensible handling for repositories with known encodings
- tests that work the same on Linux and Windows

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] git-p4: do not decode data from perforce by default
  2021-05-04 21:01           ` Andrew Oakley
@ 2021-05-04 21:46             ` Tzadik Vanderhoof
  2021-05-05  1:11               ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Tzadik Vanderhoof @ 2021-05-04 21:46 UTC (permalink / raw)
  To: Andrew Oakley; +Cc: Luke Diamand, Git List, Feiyang Xue

On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote:
> The key thing that I'm trying to point out here is that the encoding is
> not necessarily consistent between different commits.  The changes that
> you have proposed force you to pick one encoding that will be used for
> every commit.  If it's wrong then data will be corrupted, and there is
> no option provided to avoid that.  The only way I can see to avoid this
> issue is to not attempt to re-encode the data - just pass it directly
> to git.

No, my "fallbackEndocing" setting is just that... a fallback.  My proposal
*always* tries to decode in UTF-8 first!  Only if that throws an exception
will my "fallbackEncoding" come into play, and it only comes into play
for the single changeset description that was invalid UTF-8.  After that,
subsequent descriptions will again be tried in UTF-8 first.

The design of the UTF-8 format makes it very unlikely that non UTF-8 text
will pass undetected through a UTF-8 decoder, so by attempting to decode
in UTF-8 first, there is very little risk of a lossy conversion.

As for passing data through "raw", that will *guarantee* bad encoding on
any descriptions that are not UTF-8, because git will interpret the data
as UTF-8 once it has been put into the commit (unless the encoding header
is used, as you mentioned) .  If that header is not used, and it was not in
UTF-8 in Perforce, it has zero chance of being correct in git unless
it is decoded.
At least "fallbackEncoding" gives it SOME chance of decoding it correctly.

> I think another way to solve the issue you have is the encoding header
> on git commits.  We can pass the bytes through git-p4 unmodified, but
> mark the commit message as being encoded using something that isn't
> UTF-8.  That avoids any potentially lossy conversions when cloning the
> repository, but should allow the data to be displayed correctly in git.

Yes, that could be a solution.  I will try that out.

> > In any event, if you look at my patch (v6 is the latest...
> > https://lore.kernel.org/git/20210429073905.837-1-tzadik.vanderhoof@gmail.com/
> > ), you will see I have written tests that pass under both Linux and
> > Windows. (If you want to run them yourself, you need to base my patch
> > off of "master", not "seen").  The tests make clear what the
> > different behavior is and also show that p4d is not set to Unicode
> > (since the tests do not change the default setting).
>
> I don't think the tests are doing anything interesting on Linux - you
> stick valid UTF-8 in, and valid UTF-8 data comes out.

Totally agree.... I only did that to get them to pass the Gitlab CI.
I submitted an earlier
patch that simply skipped the test file on Linux, but I got pushback
on that, so I
made them pass on Linux, even though they aren't useful.

> I suspect the tests will fail on Windows if the relevant code page is set to a value
> that you're not expecting.

It depends.  If the code page is set to UTF-8 (65001) I think the
tests would still work,
because as I said above, my code always tries to decode with UTF-8
first, no matter what
the "fallbackEncoding" setting is.

If the code page is set to something other than UTF-8 or the default,
then one of my tests would
fail, because it uses a hard-coded "fallbackEncoding" of "cp1252".

But the code would work for the user!  All they would need to do is
set "fallbackEncoding" to
the code page they're actually using, instead of "cp1252".

(More sophisticated tests could be developed that explicitly set the
code page and use the
corresponding "fallbackEncoding" setting.)

> For the purposes of writing tests that work the same everywhere we can
> use `p4 submit -i`.  The data written on stdin isn't reencoded, even on
> Windows.

I already have gone down the `p4 submit -i` road.  It behaves exactly
the same as
passing the description on the command line.

(One of several dead-ends I went down that I haven't mentioned in my emails)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] git-p4: do not decode data from perforce by default
  2021-05-04 21:46             ` Tzadik Vanderhoof
@ 2021-05-05  1:11               ` Junio C Hamano
  2021-05-05  4:02                 ` Tzadik Vanderhoof
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2021-05-05  1:11 UTC (permalink / raw)
  To: Tzadik Vanderhoof; +Cc: Andrew Oakley, Luke Diamand, Git List, Feiyang Xue

Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:

> On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote:
>> The key thing that I'm trying to point out here is that the encoding is
>> not necessarily consistent between different commits.  The changes that
>> you have proposed force you to pick one encoding that will be used for
>> every commit.  If it's wrong then data will be corrupted, and there is
>> no option provided to avoid that.  The only way I can see to avoid this
>> issue is to not attempt to re-encode the data - just pass it directly
>> to git.
>
> No, my "fallbackEndocing" setting is just that... a fallback.  My proposal
> *always* tries to decode in UTF-8 first!  Only if that throws an exception
> will my "fallbackEncoding" come into play, and it only comes into play
> for the single changeset description that was invalid UTF-8.  After that,
> subsequent descriptions will again be tried in UTF-8 first.

Hmph, I do not quite see the need for "No" at the beginning of what
you said.  The fallbackEncoding can specify only one non UTF-8
encoding, so even if majority of commits were in UTF-8 but when you
need to import two commits with non UTF-8 encoding, there is no
suitable value to give to the fallbackEncoding setting.  One of
these two commits will fail to decode first in UTF-8 and then fail
to decode again with the fallback, and after that a corrupted
message remains.

>> I think another way to solve the issue you have is the encoding header
>> on git commits.  We can pass the bytes through git-p4 unmodified, but
>> mark the commit message as being encoded using something that isn't
>> UTF-8.  That avoids any potentially lossy conversions when cloning the
>> repository, but should allow the data to be displayed correctly in git.
>
> Yes, that could be a solution.  I will try that out.

If we can determine in what encoding the thing that came out of
Perforce is written in, we can put it on the encoding header of the
resulting commit.  But if that is possible to begin with, perhaps we
do not even need to do so---if you can determine what the original
encoding is, you can reencode with that encoding into UTF-8 inside
git-p4 while creating the commit, no?

And if the raw data that came from Perforce cannot be reencoded to
UTF-8 (i.e. iconv fails to process for some reason), then whether
the translation is done at the import time (i.e. where you would
have used the fallbackEncoding to reencode into UTF-8) or at the
display time (i.e. "git show" would notice the encoding header and
try to reencode the raw data from that encoding into UTF-8), it
would fail in the same way, so I do not see much advantage in
writing the encoding header into the resulting object (other than
shifting the blame to downstream and keeping the original data
intact, which is a good design principle).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] git-p4: do not decode data from perforce by default
  2021-05-05  1:11               ` Junio C Hamano
@ 2021-05-05  4:02                 ` Tzadik Vanderhoof
  2021-05-05  4:06                   ` Tzadik Vanderhoof
  2021-05-05  4:34                   ` Junio C Hamano
  0 siblings, 2 replies; 13+ messages in thread
From: Tzadik Vanderhoof @ 2021-05-05  4:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Andrew Oakley, Luke Diamand, Git List, Feiyang Xue

On Tue, May 4, 2021 at 6:11 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:
>
> > On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote:
> >> The key thing that I'm trying to point out here is that the encoding is
> >> not necessarily consistent between different commits.  The changes that
> >> you have proposed force you to pick one encoding that will be used for
> >> every commit.  If it's wrong then data will be corrupted, and there is
> >> no option provided to avoid that.  The only way I can see to avoid this
> >> issue is to not attempt to re-encode the data - just pass it directly
> >> to git.
> >
> > No, my "fallbackEndocing" setting is just that... a fallback.  My proposal
> > *always* tries to decode in UTF-8 first!  Only if that throws an exception
> > will my "fallbackEncoding" come into play, and it only comes into play
> > for the single changeset description that was invalid UTF-8.  After that,
> > subsequent descriptions will again be tried in UTF-8 first.
>
>  The fallbackEncoding can specify only one non UTF-8
> encoding, so even if majority of commits were in UTF-8 but when you
> need to import two commits with non UTF-8 encoding, there is no
> suitable value to give to the fallbackEncoding setting.  One of
> these two commits will fail to decode first in UTF-8 and then fail
> to decode again with the fallback, and after that a corrupted
> message remains.

I'm not sure I understand your scenario.  If the majority of commits
are in UTF-8, but there are 2 with the same UTF-8 encoding (say
"cp1252"), then just set "fallbackEndocing" to "cp1252" and all
the commits will display fine.

Are you talking about a scenario where most of the commits are UTF-8,
one is "cp1252" and another one is "cp1251", so a total of 3 encodings
are used in the Perforce depot?  I don't think that is a common scenario.

But you have a point that my patch does not address that scenario.

> If we can determine in what encoding the thing that came out of
> Perforce is written in, we can put it on the encoding header of the
> resulting commit.  But if that is possible to begin with, perhaps we
> do not even need to do so---if you can determine what the original
> encoding is, you can reencode with that encoding into UTF-8 inside
> git-p4 while creating the commit, no?
>
> And if the raw data that came from Perforce cannot be reencoded to
> UTF-8 (i.e. iconv fails to process for some reason), then whether
> the translation is done at the import time (i.e. where you would
> have used the fallbackEncoding to reencode into UTF-8) or at the
> display time (i.e. "git show" would notice the encoding header and
> try to reencode the raw data from that encoding into UTF-8), it
> would fail in the same way, so I do not see much advantage in
> writing the encoding header into the resulting object (other than
> shifting the blame to downstream and keeping the original data
> intact, which is a good design principle).

I agree with the idea that if you know what the encoding is, then
why not just use that knowledge to convert that to UTF-8, rather
than use the encoding header.

I'm not sure about how complete the support in "git" is for encoding
headers. Does *everything* that reads commit messages respect the
encoding header and handle it properly?  The documentation seems to
discourage their use altogether and in any event, the main use case
seems to be a project where everyone has settled on the same legacy
encoding to use everywhere, which is not really the case for our situation.

If we want to abandon the "fallbackEncoding" direction, then the only other
option I see is:

Step 1) try to decode in UTF-8.  If that succeeds then it is almost certain
that it really was in UTF-8 (due to the design of UTF-8).  Make the commit
in UTF-8.

Step 2) If it failed to decode in UTF-8, either use heuristics to detect the
encoding, or query the OS for the current code page and assume it's that.
Then use the selected encoding to convert to UTF-8.

Only the heuristics option will satisfy the case of more than 1 encoding
(in addition to UTF-8) being present in the Perforce depot, and in any event
the current code page may not help at all.

I'm not really familiar with what encoding-detection heuristics are available
for Python and how reliable, stable or performant they are.  It would also
involve taking on another library dependency.

I will take a look at the encoding-detection options out there.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] git-p4: do not decode data from perforce by default
  2021-05-05  4:02                 ` Tzadik Vanderhoof
@ 2021-05-05  4:06                   ` Tzadik Vanderhoof
  2021-05-05  4:34                   ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Tzadik Vanderhoof @ 2021-05-05  4:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Andrew Oakley, Luke Diamand, Git List, Feiyang Xue

Oops, noticed a typo.... the paragraph should read:

I'm not sure I understand your scenario.  If the majority of commits
are in UTF-8, but there are 2 with the same *non* UTF-8 encoding (say
"cp1252"), then just set "fallbackEndocing" to "cp1252" and all
the commits will display fine.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] git-p4: do not decode data from perforce by default
  2021-05-05  4:02                 ` Tzadik Vanderhoof
  2021-05-05  4:06                   ` Tzadik Vanderhoof
@ 2021-05-05  4:34                   ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2021-05-05  4:34 UTC (permalink / raw)
  To: Tzadik Vanderhoof; +Cc: Andrew Oakley, Luke Diamand, Git List, Feiyang Xue

Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:

> On Tue, May 4, 2021 at 6:11 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:
>>
>> > On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote:
>> >> The key thing that I'm trying to point out here is that the encoding is
>> >> not necessarily consistent between different commits.  The changes that
>> >> you have proposed force you to pick one encoding that will be used for
>> >> every commit.  If it's wrong then data will be corrupted, and there is
>> >> no option provided to avoid that.  The only way I can see to avoid this
>> >> issue is to not attempt to re-encode the data - just pass it directly
>> >> to git.
>> > ...
> Are you talking about a scenario where most of the commits are UTF-8,
> one is "cp1252" and another one is "cp1251", so a total of 3 encodings
> are used in the Perforce depot?  I don't think that is a common scenario.

Yes.  I think that is where "not necessarily consistent between
different commits" leads us to---not limited only to two encodings.

> I agree with the idea that if you know what the encoding is, then
> why not just use that knowledge to convert that to UTF-8, rather
> than use the encoding header.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-05-05  4:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-12  8:52 [PATCH 0/2] git-p4: encoding of data from perforce Andrew Oakley
2021-04-12  8:52 ` [PATCH 1/2] git-p4: avoid decoding more " Andrew Oakley
2021-04-12  8:52 ` [PATCH 2/2] git-p4: do not decode data from perforce by default Andrew Oakley
2021-04-29 10:00   ` Tzadik Vanderhoof
2021-04-30  8:53     ` Andrew Oakley
2021-04-30 15:33       ` Luke Diamand
2021-04-30 18:08         ` Tzadik Vanderhoof
2021-05-04 21:01           ` Andrew Oakley
2021-05-04 21:46             ` Tzadik Vanderhoof
2021-05-05  1:11               ` Junio C Hamano
2021-05-05  4:02                 ` Tzadik Vanderhoof
2021-05-05  4:06                   ` Tzadik Vanderhoof
2021-05-05  4:34                   ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).