git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 1/3] git-p4: remove support for Python 2
@ 2021-12-13  0:30 Tzadik Vanderhoof
  0 siblings, 0 replies; 4+ messages in thread
From: Tzadik Vanderhoof @ 2021-12-13  0:30 UTC (permalink / raw)
  To: Git List

> On Sun, Dec 12, 2021, 5:39 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
> This summary makes sense, i.e. if the original SCM doesn't have a
> declared or consistent encoding then having no "encoding" header etc. in
> git likewise makes sense, and we should be trying to handle it in our
> output layer.
>
> [Snipped from above]:
>
> > It's not clear to me how "attempt to detect the encoding somehow" would
> > work.  The first option therefore seems like the best choice.
>
> This really isn't possible to do in the general case, but you can get
> pretty far with heuristics.
>
> I already submitted a patch several months ago to introduce a "p4.fallbackEncoding" option. It got merged to at least the lowest branch, but I think it died at that point.

I did considerable research into the possible options at the time, and
I'm pretty sure the best approach would be:

Add an optional setting for the user to set the encoding.

When decoding, first try UTF-8. If that succeeds, then it's almost
certain that the encoding really is UTF-8. The nature of UTF-8 is that
non-UTF-8 text almost never just happens to be valid when decoded as
UTF-8.

If that fails, use the new setting if present.

This is what my patch does.

I think it would be better to go beyond that, and if it fails UTF-8,
and the new setting was not specified, then use some well- accepted
heuristic library to detect the encoding.

Frankly anything would be better than the current behavior, which is
to completely crash on the first non UTF-8 character encountered (at
least with Python 3).

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 1/3] git-p4: remove support for Python 2
  2021-12-12 17:50   ` Andrew Oakley
@ 2021-12-12 22:01     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 4+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-12-12 22:01 UTC (permalink / raw)
  To: Andrew Oakley
  Cc: Joel Holdsworth, git, Tzadik Vanderhoof, Dorgon Chang,
	Joachim Kuebart, Daniel Levin, Johannes Schindelin, Luke Diamand,
	Ben Keene


On Sun, Dec 12 2021, Andrew Oakley wrote:

> On Fri, 10 Dec 2021 15:30:59 +0000
> Joel Holdsworth <jholdsworth@nvidia.com> wrote:
>> The motivation for this change is that there is a family of issues
>> with git-p4's handling of incoming text data when it contains bytes
>> which cannot be decoded into UTF-8 characters. For text files created
>> in Windows, CP1252 Smart Quote Characters (0x93 and 0x94) are seen
>> fairly frequently. These codes are invalid in UTF-8, so if the script
>> encounters any file or file name containing them, on Python 2 the
>> symbols will be corrupted, and on Python 3 the script will fail with
>> an exception.
>
> As I've pointed out previously, peforce fails to store the encoding of
> text like commit messages.  With Windows perforce clients, the encoding
> used seems to be based on the current code page on the client which
> made the commit.  If you're part of a global organisation with people
> in different locales making commits then you will find that there is
> not a consistent encoding for commit messages.
>
> Given that you don't know the encoding of the text, what's the best
> thing to do with the data?  Options I can see are:
>
> - Feed the raw bytes directly into git.  The i18n.commitEncoding config
>   option can be set by the user if they want to attempt to decode the
>   commit messages in something other than UTF-8.
> - Attempt to detect the encoding somehow, feed the raw bytes directly
>   into git and set the encoding on the commit.
> - Attempt to dedect the encoding somehow and reencode everything into
>   UTF-8.
>
> Right now, if you use python2 then you get the behaviour as described
> in the first of these options.  It doesn't "corrupt" anything, it just
> transfers the bytes from perforce into git.  If you use python3 then
> git-p4 is unusable because it throws exceptions trying to decode things.
>
> [...]
>
> I think that this is the problem which really needs solving.  Dropping
> support for python2 doesn't make the issue go away (although it might
> make it slightly easier to write the code).  I think that the python2
> compatibility should be maintained at least until the encoding problems
> have been solved for python3.
>
> I previously wrote some patches which attempt to move in what I think is
> the right direction, but unfortunately they never got upstreamed:
>
> https://lore.kernel.org/git/20210412085251.51475-1-andrew@adoakley.name/
>
> Your comments elsewhere that git-p4 could benifit from some clean-up
> seem accurate to me, and it would be good to see that kind of change.

This summary makes sense, i.e. if the original SCM doesn't have a
declared or consistent encoding then having no "encoding" header etc. in
git likewise makes sense, and we should be trying to handle it in our
output layer.

[Snipped from above]:

> It's not clear to me how "attempt to detect the encoding somehow" would
> work.  The first option therefore seems like the best choice.

This really isn't possible to do in the general case, but you can get
pretty far with heuristics.

The best way to do this that I'm aware of is Mozilla's character
detection library:

    https://www-archive.mozilla.org/projects/intl/chardetinterface

Here's an old (maybe/probably not up-to-date) copy of its sources that
I've worked with:

    https://metacpan.org/release/JGMYERS/Encode-Detect-1.01/source/src?sort=[[2,1]]

I.e. if you get arbitrary text the best you can do if you're going in
blind is to have various character/language frequency tables to try to
guess at what encoding is the most plausible, but even then you might
still be wrong.

I'd think for git-p4 (and git-svn, how do we handle this there?) a
sensible first approximation would be to use UTF-8, and if we encounter
data that doesn't conform die.

Then offer the user to manually configure a "fallback" encoding. I think
most real-world projects only have two of those, e.g. some old latin1
data before a UTF-8 migration.

And maybe start with a "don't even try" mode, which AFAICT is what
you're describing Python 2 doing.

If we change the Python 3 code to do what the Python 2 code does now,
will we pull the rug from under users who have only ever used the Python
3 code, and are relying on those semantics somehow?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 1/3] git-p4: remove support for Python 2
  2021-12-10 15:30 ` [PATCH v2 1/3] git-p4: remove support for Python 2 Joel Holdsworth
@ 2021-12-12 17:50   ` Andrew Oakley
  2021-12-12 22:01     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 4+ messages in thread
From: Andrew Oakley @ 2021-12-12 17:50 UTC (permalink / raw)
  To: Joel Holdsworth
  Cc: git, Tzadik Vanderhoof, Dorgon Chang, Joachim Kuebart,
	Daniel Levin, Johannes Schindelin, Luke Diamand, Ben Keene

On Fri, 10 Dec 2021 15:30:59 +0000
Joel Holdsworth <jholdsworth@nvidia.com> wrote:
> The motivation for this change is that there is a family of issues
> with git-p4's handling of incoming text data when it contains bytes
> which cannot be decoded into UTF-8 characters. For text files created
> in Windows, CP1252 Smart Quote Characters (0x93 and 0x94) are seen
> fairly frequently. These codes are invalid in UTF-8, so if the script
> encounters any file or file name containing them, on Python 2 the
> symbols will be corrupted, and on Python 3 the script will fail with
> an exception.

As I've pointed out previously, peforce fails to store the encoding of
text like commit messages.  With Windows perforce clients, the encoding
used seems to be based on the current code page on the client which
made the commit.  If you're part of a global organisation with people
in different locales making commits then you will find that there is
not a consistent encoding for commit messages.

Given that you don't know the encoding of the text, what's the best
thing to do with the data?  Options I can see are:

- Feed the raw bytes directly into git.  The i18n.commitEncoding config
  option can be set by the user if they want to attempt to decode the
  commit messages in something other than UTF-8.
- Attempt to detect the encoding somehow, feed the raw bytes directly
  into git and set the encoding on the commit.
- Attempt to dedect the encoding somehow and reencode everything into
  UTF-8.

Right now, if you use python2 then you get the behaviour as described
in the first of these options.  It doesn't "corrupt" anything, it just
transfers the bytes from perforce into git.  If you use python3 then
git-p4 is unusable because it throws exceptions trying to decode things.

It's not clear to me how "attempt to detect the encoding somehow" would
work.  The first option therefore seems like the best choice.

I think that this is the problem which really needs solving.  Dropping
support for python2 doesn't make the issue go away (although it might
make it slightly easier to write the code).  I think that the python2
compatibility should be maintained at least until the encoding problems
have been solved for python3.

I previously wrote some patches which attempt to move in what I think is
the right direction, but unfortunately they never got upstreamed:

https://lore.kernel.org/git/20210412085251.51475-1-andrew@adoakley.name/

Your comments elsewhere that git-p4 could benifit from some clean-up
seem accurate to me, and it would be good to see that kind of change.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 1/3] git-p4: remove support for Python 2
  2021-12-10 15:30 [PATCH v2 0/3] Transition git-p4.py to support Python 3 only Joel Holdsworth
@ 2021-12-10 15:30 ` Joel Holdsworth
  2021-12-12 17:50   ` Andrew Oakley
  0 siblings, 1 reply; 4+ messages in thread
From: Joel Holdsworth @ 2021-12-10 15:30 UTC (permalink / raw)
  To: git
  Cc: Tzadik Vanderhoof, Dorgon Chang, Joachim Kuebart, Daniel Levin,
	Johannes Schindelin, Luke Diamand, Ben Keene, Andrew Oakley,
	Joel Holdsworth

git-p4 previously contained seperate code-paths for Python 2 and 3 to
abstract away the differences in string handling behaviour between the
two platforms.

This patch removes the Python 2 code-paths within this abstraction
without removing the abstractions themselves. These will be removed in
later patches to further modernise the script.

The motivation for this change is that there is a family of issues with
git-p4's handling of incoming text data when it contains bytes which
cannot be decoded into UTF-8 characters. For text files created in
Windows, CP1252 Smart Quote Characters (0x93 and 0x94) are seen fairly
frequently. These codes are invalid in UTF-8, so if the script
encounters any file or file name containing them, on Python 2 the
symbols will be corrupted, and on Python 3 the script will fail with an
exception.

In order to address these issues it will be necessary to overhaul
git-p4's handling of incoming data. Keeping a clean separation between
encoded bytes and decoded text is much easier to do in Python 3. If
Python 2 support must be maintained, this will require careful testing
of the separate code paths for each platform, which is unreasonable
given that Python 2 is now thoroughly deprecated.

The minimum supported Python version has been set to 3.6. This version
is no longer supported by the Python project, however at the current
time it is still available for use in RHEL 8. No features from newer
versions of Python are currently required.

Signed-off-by: Joel Holdsworth <jholdsworth@nvidia.com>
---
 git-p4.py | 90 ++++++++++++++++++-------------------------------------
 1 file changed, 29 insertions(+), 61 deletions(-)

diff --git a/git-p4.py b/git-p4.py
index 2b4500226a..e3fe86e4f2 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 #
 # git-p4.py -- A tool for bidirectional operation between a Perforce depot and git.
 #
@@ -16,8 +16,9 @@
 # pylint: disable=too-many-branches,too-many-nested-blocks
 #
 import sys
-if sys.version_info.major < 3 and sys.version_info.minor < 7:
-    sys.stderr.write("git-p4: requires Python 2.7 or later.\n")
+if (sys.version_info.major < 3 or
+    (sys.version_info.major == 3 and sys.version_info.minor < 6)):
+    sys.stderr.write("git-p4: requires Python 3.6 or later.\n")
     sys.exit(1)
 import os
 import optparse
@@ -36,16 +37,6 @@
 import errno
 import glob
 
-# On python2.7 where raw_input() and input() are both availble,
-# we want raw_input's semantics, but aliased to input for python3
-# compatibility
-# support basestring in python3
-try:
-    if raw_input and input:
-        input = raw_input
-except:
-    pass
-
 verbose = False
 
 # Only labels/tags matching this will be imported/exported
@@ -176,35 +167,16 @@ def prompt(prompt_text):
         if response in choices:
             return response
 
-# We need different encoding/decoding strategies for text data being passed
-# around in pipes depending on python version
-if bytes is not str:
-    # For python3, always encode and decode as appropriate
-    def decode_text_stream(s):
-        return s.decode() if isinstance(s, bytes) else s
-    def encode_text_stream(s):
-        return s.encode() if isinstance(s, str) else s
-else:
-    # For python2.7, pass read strings as-is, but also allow writing unicode
-    def decode_text_stream(s):
-        return s
-    def encode_text_stream(s):
-        return s.encode('utf_8') if isinstance(s, unicode) else s
+def decode_text_stream(s):
+    return s.decode() if isinstance(s, bytes) else s
+def encode_text_stream(s):
+    return s.encode() if isinstance(s, str) else s
 
 def decode_path(path):
     """Decode a given string (bytes or otherwise) using configured path encoding options
     """
     encoding = gitConfig('git-p4.pathEncoding') or 'utf_8'
-    if bytes is not str:
-        return path.decode(encoding, errors='replace') if isinstance(path, bytes) else path
-    else:
-        try:
-            path.decode('ascii')
-        except:
-            path = path.decode(encoding, errors='replace')
-            if verbose:
-                print('Path with non-ASCII characters detected. Used {} to decode: {}'.format(encoding, path))
-        return path
+    return path.decode(encoding, errors='replace') if isinstance(path, bytes) else path
 
 def run_git_hook(cmd, param=[]):
     """Execute a hook if the hook exists."""
@@ -289,8 +261,8 @@ def write_pipe(c, stdin):
 
 def p4_write_pipe(c, stdin):
     real_cmd = p4_build_cmd(c)
-    if bytes is not str and isinstance(stdin, str):
-        stdin = encode_text_stream(stdin)
+    if isinstance(stdin, str):
+        stdin = stdin.encode()
     return write_pipe(real_cmd, stdin)
 
 def read_pipe_full(c):
@@ -762,21 +734,18 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
     result = []
     try:
         while True:
-            entry = marshal.load(p4.stdout)
-            if bytes is not str:
-                # Decode unmarshalled dict to use str keys and values, except for:
-                #   - `data` which may contain arbitrary binary data
-                #   - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text
-                decoded_entry = {}
-                for key, value in entry.items():
-                    key = key.decode()
-                    if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
-                    decoded_entry[key] = value
-                # Parse out data if it's an error response
-                if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
-                    decoded_entry['data'] = decoded_entry['data'].decode()
-                entry = decoded_entry
+            # Decode unmarshalled dict to use str keys and values, except for:
+            #   - `data` which may contain arbitrary binary data
+            #   - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text
+            entry = {}
+            for key, value in marshal.load(p4.stdout).items():
+                key = key.decode()
+                if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
+                    value = value.decode()
+                entry[key] = value
+            # Parse out data if it's an error response
+            if entry.get('code') == 'error' and 'data' in entry:
+                entry['data'] = entry['data'].decode()
             if skip_info:
                 if 'code' in entry and entry['code'] == 'info':
                     continue
@@ -3840,14 +3809,13 @@ def openStreams(self):
         self.gitStream = self.importProcess.stdin
         self.gitError = self.importProcess.stderr
 
-        if bytes is not str:
-            # Wrap gitStream.write() so that it can be called using `str` arguments
-            def make_encoded_write(write):
-                def encoded_write(s):
-                    return write(s.encode() if isinstance(s, str) else s)
-                return encoded_write
+        # Wrap gitStream.write() so that it can be called using `str` arguments
+        def make_encoded_write(write):
+            def encoded_write(s):
+                return write(s.encode() if isinstance(s, str) else s)
+            return encoded_write
 
-            self.gitStream.write = make_encoded_write(self.gitStream.write)
+        self.gitStream.write = make_encoded_write(self.gitStream.write)
 
     def closeStreams(self):
         if self.gitStream is None:
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-12-13  0:30 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-13  0:30 [PATCH v2 1/3] git-p4: remove support for Python 2 Tzadik Vanderhoof
  -- strict thread matches above, loose matches on Subject: below --
2021-12-10 15:30 [PATCH v2 0/3] Transition git-p4.py to support Python 3 only Joel Holdsworth
2021-12-10 15:30 ` [PATCH v2 1/3] git-p4: remove support for Python 2 Joel Holdsworth
2021-12-12 17:50   ` Andrew Oakley
2021-12-12 22:01     ` Ævar Arnfjörð Bjarmason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).