* git-p4 crashes on non UTF-8 output from p4 @ 2021-04-08 19:28 Tzadik Vanderhoof 2021-04-09 15:38 ` Torsten Bögershausen 0 siblings, 1 reply; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-08 19:28 UTC (permalink / raw) To: git When git-p4 reads the output from a p4 command, it assumes it will be 100% UTF-8. If even one character in the output of one p4 command is not UTF-8, git-p4 crashes with: File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList value = value.decode() UnicodeDecodeError: 'utf-8' codec can't decode byte Ox93 in position 42: invalid start byte I'd like to make a pull request to have it try another encoding (eg cp1252) and/or use the Unicode replacement character, to prevent the whole program from crashing on such a minor problem. This is especially a problem on the "git p4 clone" command with @all, where git-p4 needs to read thousands of changeset descriptions, one of which may have a stray smart quote, causing the whole clone operation to fail. Sound ok? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: git-p4 crashes on non UTF-8 output from p4 2021-04-08 19:28 git-p4 crashes on non UTF-8 output from p4 Tzadik Vanderhoof @ 2021-04-09 15:38 ` Torsten Bögershausen 2021-04-11 7:16 ` Tzadik Vanderhoof 0 siblings, 1 reply; 24+ messages in thread From: Torsten Bögershausen @ 2021-04-09 15:38 UTC (permalink / raw) To: Tzadik Vanderhoof; +Cc: git On Thu, Apr 08, 2021 at 12:28:25PM -0700, Tzadik Vanderhoof wrote: > When git-p4 reads the output from a p4 command, it assumes it will be > 100% UTF-8. If even one character in the output of one p4 command is > not UTF-8, git-p4 crashes with: > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList > value = value.decode() UnicodeDecodeError: 'utf-8' codec can't > decode byte Ox93 in position 42: invalid start byte > > I'd like to make a pull request to have it try another encoding (eg > cp1252) and/or use the Unicode replacement character, to prevent the > whole program from crashing on such a minor problem. > > This is especially a problem on the "git p4 clone" command with @all, > where git-p4 needs to read thousands of changeset descriptions, one of > which may have a stray smart quote, causing the whole clone operation > to fail. > > Sound ok? Welcome to the Git community. To start with: I am not a git-p4 expert as such, but seeing that a program is crashing is never a good thing. All efforts to prevent the crash are a step forward. As you mention cp1252 (which is more used under Windows), there are probably lots of system out there which use ISO-8859-15 (or ISO-8859-1) we may have the first whish: Make the encoding/fallback configurable. Let people choose if they want a crash (if things are broken), fallback to cp1252 or one of the other ISO-ISO-8859-x encodings. In that sense: we look forward to a pull-request. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: git-p4 crashes on non UTF-8 output from p4 2021-04-09 15:38 ` Torsten Bögershausen @ 2021-04-11 7:16 ` Tzadik Vanderhoof 2021-04-11 9:37 ` Torsten Bögershausen 0 siblings, 1 reply; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-11 7:16 UTC (permalink / raw) To: Torsten Bögershausen; +Cc: git Here is the pull request: From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001 From: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> Date: Sat, 10 Apr 2021 22:41:39 -0700 Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions --- Documentation/git-p4.txt | 10 ++++++++++ git-p4.py | 11 ++++++++++- 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt index f89e68b..71f3487 100644 --- a/Documentation/git-p4.txt +++ b/Documentation/git-p4.txt @@ -638,6 +638,16 @@ git-p4.pathEncoding:: to transcode the paths to UTF-8. As an example, Perforce on Windows often uses "cp1252" to encode path names. +git-p4.fallbackEncoding:: + Perforce changeset descriptions can be in a mixture of encodings. Git-p4 + first tries to interpret each description as UTF-8. If that fails, this + config allows another encoding to be tried. The default is "cp1252". You + can set it to another encoding, for example, "iso-8859-5". If instead of + an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8 + characters replaced by the Unicode replacement character. If you specify + "none", there is no fallback, and any non UTF-8 character will cause + git-p4 to immediately fail. + git-p4.largeFileSystem:: Specify the system that is used for large (binary) files. Please note that large file systems do not support the 'git p4 submit' command. diff --git a/git-p4.py b/git-p4.py index 09c9e93..18d02b4 100755 --- a/git-p4.py +++ b/git-p4.py @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, for key, value in entry.items(): key = key.decode() if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): - value = value.decode() + try: + value = value.decode() + except: + fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252' + if fallbackEncoding == 'none': + raise + elif fallbackEncoding == 'replace': + value = value.decode(errors='replace') + else: + value = value.decode(encoding=fallbackEncoding) decoded_entry[key] = value # Parse out data if it's an error response if decoded_entry.get('code') == 'error' and 'data' in decoded_entry: -- 2.31.1.windows.1 On Fri, Apr 9, 2021 at 8:38 AM Torsten Bögershausen <tboegi@web.de> wrote: > > On Thu, Apr 08, 2021 at 12:28:25PM -0700, Tzadik Vanderhoof wrote: > > When git-p4 reads the output from a p4 command, it assumes it will be > > 100% UTF-8. If even one character in the output of one p4 command is > > not UTF-8, git-p4 crashes with: > > > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList > > value = value.decode() UnicodeDecodeError: 'utf-8' codec can't > > decode byte Ox93 in position 42: invalid start byte > > > > I'd like to make a pull request to have it try another encoding (eg > > cp1252) and/or use the Unicode replacement character, to prevent the > > whole program from crashing on such a minor problem. > > > > This is especially a problem on the "git p4 clone" command with @all, > > where git-p4 needs to read thousands of changeset descriptions, one of > > which may have a stray smart quote, causing the whole clone operation > > to fail. > > > > Sound ok? > > Welcome to the Git community. > To start with: I am not a git-p4 expert as such, but seeing that a program is crashing > is never a good thing. > All efforts to prevent the crash are a step forward. > > As you mention cp1252 (which is more used under Windows), there are probably lots of > system out there which use ISO-8859-15 (or ISO-8859-1) we may have the first whish: > > Make the encoding/fallback configurable. > Let people choose if they want a crash (if things are broken), > fallback to cp1252 or one of the other ISO-ISO-8859-x encodings. > > In that sense: we look forward to a pull-request. -- Tzadik ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: git-p4 crashes on non UTF-8 output from p4 2021-04-11 7:16 ` Tzadik Vanderhoof @ 2021-04-11 9:37 ` Torsten Bögershausen 2021-04-11 20:21 ` Tzadik Vanderhoof 0 siblings, 1 reply; 24+ messages in thread From: Torsten Bögershausen @ 2021-04-11 9:37 UTC (permalink / raw) To: Tzadik Vanderhoof; +Cc: git On Sun, Apr 11, 2021 at 12:16:25AM -0700, Tzadik Vanderhoof wrote: > Here is the pull request: Thanks for the work. Some comments inline. > > From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001 > From: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> > Date: Sat, 10 Apr 2021 22:41:39 -0700 The subject should be one short line, highlighting what this is all about, followed by a blank line and a longer description about the problem and the solution. The original description was good, see below. > Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent > git-p4 from crashing on non UTF-8 changeset descriptions In that sense I make a first trial here, subject for improvements: Subject: [PATCH] Add git-p4.fallbackEncoding config variable When git-p4 reads the output from a p4 command, it assumes it will be 100% UTF-8. If even one character in the output of one p4 command is not UTF-8, git-p4 crashes e.g. with: File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList value = value.decode() UnicodeDecodeError: 'utf-8' codec can't decode byte Ox93 in position 42: invalid start byte Allow to try another encoding (eg cp1252) and/or use the Unicode replacement character to prevent the whole program from crashing on such a "minor" problem. This is especially a problem on the "git p4 clone" command with @all, where git-p4 needs to read thousands of changeset descriptions, one of which may have a stray smart quote, causing the whole clone operation to fail. Introduce "git-p4.fallbackEncoding" to handle non UTF-8 encodings, if needed. > > --- > Documentation/git-p4.txt | 10 ++++++++++ > git-p4.py | 11 ++++++++++- > 2 files changed, 20 insertions(+), 1 deletion(-) > > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt > index f89e68b..71f3487 100644 > --- a/Documentation/git-p4.txt > +++ b/Documentation/git-p4.txt > @@ -638,6 +638,16 @@ git-p4.pathEncoding:: > to transcode the paths to UTF-8. As an example, Perforce on Windows > often uses "cp1252" to encode path names. > > +git-p4.fallbackEncoding:: > + Perforce changeset descriptions can be in a mixture of encodings. Git-p4 > + first tries to interpret each description as UTF-8. If that fails, this > + config allows another encoding to be tried. The default is "cp1252". You I know that cp1252 is attractive to be used, especially for Windows installations that use Latin-based "characters". But: If we introduce a new config-variable into Git, the default tends to be "if not set to anything, behave as the old Git". > + can set it to another encoding, for example, "iso-8859-5". If instead of ISO-8859-5 may be more portable on the different i18 implementations than the lower-case spelling. > + an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8 > + characters replaced by the Unicode replacement character. If you specify > + "none", there is no fallback, and any non UTF-8 character will cause > + git-p4 to immediately fail. As said, before, many people may expect Git to fail, so that the default should be none to avoid surprises. When a "non-UTF-8-clean" repo is handled, they want to know it. > + > git-p4.largeFileSystem:: > Specify the system that is used for large (binary) files. Please note > that large file systems do not support the 'git p4 submit' command. > diff --git a/git-p4.py b/git-p4.py > index 09c9e93..18d02b4 100755 > --- a/git-p4.py > +++ b/git-p4.py > @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', > cb=None, skip_info=False, > for key, value in entry.items(): > key = key.decode() > if isinstance(value, bytes) and not (key in > ('data', 'path', 'clientFile') or key.startswith('depotFile')): > - value = value.decode() > + try: > + value = value.decode() > + except: > + fallbackEncoding = > gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252' > + if fallbackEncoding == 'none': > + raise Would it make sense to tell the user about the new config value here? raise Exception("Non UTF-8 detected. See git-p4.fallbackEncoding" Or somewhat in that style ? > + elif fallbackEncoding == 'replace': > + value = value.decode(errors='replace') > + else: > + value = value.decode(encoding=fallbackEncoding) > decoded_entry[key] = value > # Parse out data if it's an error response > if decoded_entry.get('code') == 'error' and 'data' in > decoded_entry: Did I miss the Signed-off-by here? Please have a look here: https://git-scm.com/docs/SubmittingPatches (or look at Documentation/SubmittingPatches in your git source code) And finally: Thanks for the contribution. Is there any chance to add test-cases, to make sure that this feature is well-tested now and in the future ? > -- > 2.31.1.windows.1 > > On Fri, Apr 9, 2021 at 8:38 AM Torsten Bögershausen <tboegi@web.de> wrote: > > > > On Thu, Apr 08, 2021 at 12:28:25PM -0700, Tzadik Vanderhoof wrote: > > > When git-p4 reads the output from a p4 command, it assumes it will be > > > 100% UTF-8. If even one character in the output of one p4 command is > > > not UTF-8, git-p4 crashes with: > > > > > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList > > > value = value.decode() UnicodeDecodeError: 'utf-8' codec can't > > > decode byte Ox93 in position 42: invalid start byte > > > > > > I'd like to make a pull request to have it try another encoding (eg > > > cp1252) and/or use the Unicode replacement character, to prevent the > > > whole program from crashing on such a minor problem. > > > > > > This is especially a problem on the "git p4 clone" command with @all, > > > where git-p4 needs to read thousands of changeset descriptions, one of > > > which may have a stray smart quote, causing the whole clone operation > > > to fail. > > > > > > Sound ok? > > > > Welcome to the Git community. > > To start with: I am not a git-p4 expert as such, but seeing that a program is crashing > > is never a good thing. > > All efforts to prevent the crash are a step forward. > > > > As you mention cp1252 (which is more used under Windows), there are probably lots of > > system out there which use ISO-8859-15 (or ISO-8859-1) we may have the first whish: > > > > Make the encoding/fallback configurable. > > Let people choose if they want a crash (if things are broken), > > fallback to cp1252 or one of the other ISO-ISO-8859-x encodings. > > > > In that sense: we look forward to a pull-request. > > > > -- > Tzadik ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: git-p4 crashes on non UTF-8 output from p4 2021-04-11 9:37 ` Torsten Bögershausen @ 2021-04-11 20:21 ` Tzadik Vanderhoof 2021-04-12 4:06 ` Torsten Bögershausen 0 siblings, 1 reply; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-11 20:21 UTC (permalink / raw) To: Torsten Bögershausen; +Cc: git Thank you for your excellent and friendly feedback! I understand everything you said, but I have a question about the unit test you requested. The git-p4.py script currently does not have tests and is not written in a way that would be testable. (The Python function I modified calls into the shell and requires a valid Perforce installation.) Would you prefer I a) refactor the code to be testable and then write tests or b) skip the unit testing (not sure if there are any further options)? (For option a) I would break out the part of the function I modified into another function and then call my new function for my testing. (I guess it would be better to break the refactoring and my changes into 2 separate commits.) On Sun, Apr 11, 2021 at 2:37 AM Torsten Bögershausen <tboegi@web.de> wrote: > > On Sun, Apr 11, 2021 at 12:16:25AM -0700, Tzadik Vanderhoof wrote: > > Here is the pull request: > > Thanks for the work. Some comments inline. > > > > > From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001 > > From: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> > > Date: Sat, 10 Apr 2021 22:41:39 -0700 > > > The subject should be one short line, highlighting what this is all about, > followed by a blank line and a longer description about the problem and > the solution. The original description was good, see below. > > > Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent > > git-p4 from crashing on non UTF-8 changeset descriptions > > In that sense I make a first trial here, subject for improvements: > > > Subject: [PATCH] Add git-p4.fallbackEncoding config variable > > When git-p4 reads the output from a p4 command, it assumes it will be > 100% UTF-8. If even one character in the output of one p4 command is > not UTF-8, git-p4 crashes e.g. with: > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList > value = value.decode() UnicodeDecodeError: 'utf-8' codec can't > decode byte Ox93 in position 42: invalid start byte > > Allow to try another encoding (eg cp1252) and/or use the > Unicode replacement character to prevent the whole program from crashing > on such a "minor" problem. > > This is especially a problem on the "git p4 clone" command with @all, > where git-p4 needs to read thousands of changeset descriptions, one of > which may have a stray smart quote, causing the whole clone operation > to fail. > > Introduce "git-p4.fallbackEncoding" to handle non UTF-8 encodings, if needed. > > > > > --- > > Documentation/git-p4.txt | 10 ++++++++++ > > git-p4.py | 11 ++++++++++- > > 2 files changed, 20 insertions(+), 1 deletion(-) > > > > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt > > index f89e68b..71f3487 100644 > > --- a/Documentation/git-p4.txt > > +++ b/Documentation/git-p4.txt > > @@ -638,6 +638,16 @@ git-p4.pathEncoding:: > > to transcode the paths to UTF-8. As an example, Perforce on Windows > > often uses "cp1252" to encode path names. > > > > +git-p4.fallbackEncoding:: > > + Perforce changeset descriptions can be in a mixture of encodings. Git-p4 > > + first tries to interpret each description as UTF-8. If that fails, this > > + config allows another encoding to be tried. The default is "cp1252". You > > I know that cp1252 is attractive to be used, especially for Windows installations that > use Latin-based "characters". > But: If we introduce a new config-variable into Git, the default tends to be > "if not set to anything, behave as the old Git". > > > + can set it to another encoding, for example, "iso-8859-5". If instead of > ISO-8859-5 may be more portable on the different i18 implementations > than the lower-case spelling. > > > + an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8 > > + characters replaced by the Unicode replacement character. If you specify > > + "none", there is no fallback, and any non UTF-8 character will cause > > + git-p4 to immediately fail. > > As said, before, many people may expect Git to fail, so that the default should be > none to avoid surprises. > When a "non-UTF-8-clean" repo is handled, they want to know it. > > > + > > git-p4.largeFileSystem:: > > Specify the system that is used for large (binary) files. Please note > > that large file systems do not support the 'git p4 submit' command. > > diff --git a/git-p4.py b/git-p4.py > > index 09c9e93..18d02b4 100755 > > --- a/git-p4.py > > +++ b/git-p4.py > > @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', > > cb=None, skip_info=False, > > for key, value in entry.items(): > > key = key.decode() > > if isinstance(value, bytes) and not (key in > > ('data', 'path', 'clientFile') or key.startswith('depotFile')): > > - value = value.decode() > > + try: > > + value = value.decode() > > + except: > > + fallbackEncoding = > > gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252' > > + if fallbackEncoding == 'none': > > + raise > > Would it make sense to tell the user about the new config value here? > raise Exception("Non UTF-8 detected. See git-p4.fallbackEncoding" > Or somewhat in that style ? > > > + elif fallbackEncoding == 'replace': > > + value = value.decode(errors='replace') > > + else: > > + value = value.decode(encoding=fallbackEncoding) > > decoded_entry[key] = value > > # Parse out data if it's an error response > > if decoded_entry.get('code') == 'error' and 'data' in > > decoded_entry: > > > Did I miss the Signed-off-by here? > > Please have a look here: > https://git-scm.com/docs/SubmittingPatches > > (or look at Documentation/SubmittingPatches in your git source code) > > And finally: Thanks for the contribution. > Is there any chance to add test-cases, to make sure that this feature > is well-tested now and in the future ? > > > > -- > > 2.31.1.windows.1 > > > > On Fri, Apr 9, 2021 at 8:38 AM Torsten Bögershausen <tboegi@web.de> wrote: > > > > > > On Thu, Apr 08, 2021 at 12:28:25PM -0700, Tzadik Vanderhoof wrote: > > > > When git-p4 reads the output from a p4 command, it assumes it will be > > > > 100% UTF-8. If even one character in the output of one p4 command is > > > > not UTF-8, git-p4 crashes with: > > > > > > > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList > > > > value = value.decode() UnicodeDecodeError: 'utf-8' codec can't > > > > decode byte Ox93 in position 42: invalid start byte > > > > > > > > I'd like to make a pull request to have it try another encoding (eg > > > > cp1252) and/or use the Unicode replacement character, to prevent the > > > > whole program from crashing on such a minor problem. > > > > > > > > This is especially a problem on the "git p4 clone" command with @all, > > > > where git-p4 needs to read thousands of changeset descriptions, one of > > > > which may have a stray smart quote, causing the whole clone operation > > > > to fail. > > > > > > > > Sound ok? > > > > > > Welcome to the Git community. > > > To start with: I am not a git-p4 expert as such, but seeing that a program is crashing > > > is never a good thing. > > > All efforts to prevent the crash are a step forward. > > > > > > As you mention cp1252 (which is more used under Windows), there are probably lots of > > > system out there which use ISO-8859-15 (or ISO-8859-1) we may have the first whish: > > > > > > Make the encoding/fallback configurable. > > > Let people choose if they want a crash (if things are broken), > > > fallback to cp1252 or one of the other ISO-ISO-8859-x encodings. > > > > > > In that sense: we look forward to a pull-request. > > > > > > > > -- > > Tzadik -- Tzadik ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: git-p4 crashes on non UTF-8 output from p4 2021-04-11 20:21 ` Tzadik Vanderhoof @ 2021-04-12 4:06 ` Torsten Bögershausen 2021-04-21 8:46 ` [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions Tzadik Vanderhoof 0 siblings, 1 reply; 24+ messages in thread From: Torsten Bögershausen @ 2021-04-12 4:06 UTC (permalink / raw) To: Tzadik Vanderhoof; +Cc: git Hej Tzadik, Let's start with a side note: This mailing list doesn't use top-posting, everything new is at the end, so I moved everything there. And removed some of the old stuff. > On Sun, Apr 11, 2021 at 2:37 AM Torsten Bögershausen <tboegi@web.de> wrote: > > > > On Sun, Apr 11, 2021 at 12:16:25AM -0700, Tzadik Vanderhoof wrote: > > > Here is the pull request: > > > > Thanks for the work. Some comments inline. > > > > > > > > From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001 > > > From: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> > > > Date: Sat, 10 Apr 2021 22:41:39 -0700 > > > > > > The subject should be one short line, highlighting what this is all about, > > followed by a blank line and a longer description about the problem and > > the solution. The original description was good, see below. > > > > > Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent > > > git-p4 from crashing on non UTF-8 changeset descriptions > > > > In that sense I make a first trial here, subject for improvements: > > > > > > Subject: [PATCH] Add git-p4.fallbackEncoding config variable > > > > When git-p4 reads the output from a p4 command, it assumes it will be > > 100% UTF-8. If even one character in the output of one p4 command is > > not UTF-8, git-p4 crashes e.g. with: > > > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList > > value = value.decode() UnicodeDecodeError: 'utf-8' codec can't > > decode byte Ox93 in position 42: invalid start byte > > > > Allow to try another encoding (eg cp1252) and/or use the > > Unicode replacement character to prevent the whole program from crashing > > on such a "minor" problem. > > > > This is especially a problem on the "git p4 clone" command with @all, > > where git-p4 needs to read thousands of changeset descriptions, one of > > which may have a stray smart quote, causing the whole clone operation > > to fail. > > > > Introduce "git-p4.fallbackEncoding" to handle non UTF-8 encodings, if needed. > > > > > > > > --- > > > Documentation/git-p4.txt | 10 ++++++++++ > > > git-p4.py | 11 ++++++++++- > > > 2 files changed, 20 insertions(+), 1 deletion(-) > > > > > > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt > > > index f89e68b..71f3487 100644 > > > --- a/Documentation/git-p4.txt > > > +++ b/Documentation/git-p4.txt > > > @@ -638,6 +638,16 @@ git-p4.pathEncoding:: > > > to transcode the paths to UTF-8. As an example, Perforce on Windows > > > often uses "cp1252" to encode path names. > > > > > > +git-p4.fallbackEncoding:: > > > + Perforce changeset descriptions can be in a mixture of encodings. Git-p4 > > > + first tries to interpret each description as UTF-8. If that fails, this > > > + config allows another encoding to be tried. The default is "cp1252". You > > > > I know that cp1252 is attractive to be used, especially for Windows installations that > > use Latin-based "characters". > > But: If we introduce a new config-variable into Git, the default tends to be > > "if not set to anything, behave as the old Git". > > > > > + can set it to another encoding, for example, "iso-8859-5". If instead of > > ISO-8859-5 may be more portable on the different i18 implementations > > than the lower-case spelling. > > > > > + an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8 > > > + characters replaced by the Unicode replacement character. If you specify > > > + "none", there is no fallback, and any non UTF-8 character will cause > > > + git-p4 to immediately fail. > > > > As said, before, many people may expect Git to fail, so that the default should be > > none to avoid surprises. > > When a "non-UTF-8-clean" repo is handled, they want to know it. > > > > > + > > > git-p4.largeFileSystem:: > > > Specify the system that is used for large (binary) files. Please note > > > that large file systems do not support the 'git p4 submit' command. > > > diff --git a/git-p4.py b/git-p4.py > > > index 09c9e93..18d02b4 100755 > > > --- a/git-p4.py > > > +++ b/git-p4.py > > > @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', > > > cb=None, skip_info=False, > > > for key, value in entry.items(): > > > key = key.decode() > > > if isinstance(value, bytes) and not (key in > > > ('data', 'path', 'clientFile') or key.startswith('depotFile')): > > > - value = value.decode() > > > + try: > > > + value = value.decode() > > > + except: > > > + fallbackEncoding = > > > gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252' > > > + if fallbackEncoding == 'none': > > > + raise > > > > Would it make sense to tell the user about the new config value here? > > raise Exception("Non UTF-8 detected. See git-p4.fallbackEncoding" > > Or somewhat in that style ? > > > > > + elif fallbackEncoding == 'replace': > > > + value = value.decode(errors='replace') > > > + else: > > > + value = value.decode(encoding=fallbackEncoding) > > > decoded_entry[key] = value > > > # Parse out data if it's an error response > > > if decoded_entry.get('code') == 'error' and 'data' in > > > decoded_entry: > > > > > > Did I miss the Signed-off-by here? > > > > Please have a look here: > > https://git-scm.com/docs/SubmittingPatches > > > > (or look at Documentation/SubmittingPatches in your git source code) > > > > And finally: Thanks for the contribution. > > Is there any chance to add test-cases, to make sure that this feature > > is well-tested now and in the future ? > > [snip] On Sun, Apr 11, 2021 at 01:21:47PM -0700, Tzadik Vanderhoof wrote: > Thank you for your excellent and friendly feedback! > > I understand everything you said, but I have a question about the unit > test you requested. The git-p4.py script currently does not have > tests and is not written in a way that would be testable. (The Python > function I modified calls into the shell and requires a valid Perforce > installation.) I am not really sure about that (there are no test cases). A valid Perforce installation is needed, yes. Otherwise the p4 tests are skipped. There are a lot of p4 tests under t/t98..p4....sh git show a8b05162e894b88aeb7d5064dab Tells me e.g. that both git-p4.py and, in this very commit, t/t9822-git-p4-path-encoding.sh have been changed. All the test are t98xx-git-p4-something.sh (fiund under t), and you new testcase may be named something like t9835-git-p4-fallbackEncoding.sh > > Would you prefer I a) refactor the code to be testable and then write > tests or b) skip the unit testing (not sure if there are any further > options)? > > (For option a) I would break out the part of the function I modified > into another function and then call my new function for my testing. > (I guess it would be better to break the refactoring and my changes > into 2 separate commits.) > Probably not. Typically we can construct a test case first, and see that ot fails (in you local test-running). After updating git-p4.py with your improvments the new test should pass. And of course all the other p4 tests. There is a whole bunch of "CI tests", run on Linux, MacOs, Windows. One example of a test run is here, and the "regular (linux-clang,...) is running the p4 tests: https://github.com/git/git/runs/2309995044?check_suite_focus=true ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-12 4:06 ` Torsten Bögershausen @ 2021-04-21 8:46 ` Tzadik Vanderhoof 2021-04-21 8:55 ` Tzadik Vanderhoof 0 siblings, 1 reply; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-21 8:46 UTC (permalink / raw) To: git, tboegi, tzadik.vanderhoof --- Documentation/git-p4.txt | 10 ++++ git-p4.py | 11 +++- t/t9835-git-p4-config-fallback-encoding.sh | 65 ++++++++++++++++++++++ 3 files changed, 85 insertions(+), 1 deletion(-) create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt index f89e68b..e0131a9 100644 --- a/Documentation/git-p4.txt +++ b/Documentation/git-p4.txt @@ -638,6 +638,16 @@ git-p4.pathEncoding:: to transcode the paths to UTF-8. As an example, Perforce on Windows often uses "cp1252" to encode path names. +git-p4.fallbackEncoding:: + Perforce changeset descriptions can be in a mixture of encodings. + Git-p4 first tries to interpret each description as UTF-8. If that + fails, this config allows another encoding to be tried. You + can specify, for example, "cp1252". If instead of an encoding, + you specify "replace", UTF-8 will be used, with invalid UTF-8 + characters replaced by the Unicode replacement character. If you + specify "none" (the default), there is no fallback, and any non + UTF-8 character will cause git-p4 to immediately fail. + git-p4.largeFileSystem:: Specify the system that is used for large (binary) files. Please note that large file systems do not support the 'git p4 submit' command. diff --git a/git-p4.py b/git-p4.py index 09c9e93..173f78a 100755 --- a/git-p4.py +++ b/git-p4.py @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, for key, value in entry.items(): key = key.decode() if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): - value = value.decode() + try: + value = value.decode() + except UnicodeDecodeError as ex: + fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none' + if fallbackEncoding == 'none': + raise Exception("UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding") from ex + elif fallbackEncoding == 'replace': + value = value.decode(errors='replace') + else: + value = value.decode(encoding=fallbackEncoding) decoded_entry[key] = value # Parse out data if it's an error response if decoded_entry.get('code') == 'error' and 'data' in decoded_entry: diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh new file mode 100755 index 0000000..56a245e --- /dev/null +++ b/t/t9835-git-p4-config-fallback-encoding.sh @@ -0,0 +1,65 @@ +#!/bin/sh + +test_description='test git-p4.fallbackEncoding config' + +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME + +. ./lib-git-p4.sh + +if test_have_prereq !MINGW,!CYGWIN; then + skip_all='This system is not subject to encoding failures in "git p4 clone"' + test_done +fi + +test_expect_success 'start p4d' ' + start_p4d +' + +test_expect_success 'add cp1252 description' ' + cd "$cli" && + echo file1 >file1 && + p4 add file1 && + p4 submit -d documentación +' + +test_expect_success 'clone fails with git-p4.fallbackEncoding unset' ' + test_might_fail git config --global --unset git-p4.fallbackEncoding && + test_when_finished cleanup_git && + ( + test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual && + grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual + ) +' +test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' ' + git config --global git-p4.fallbackEncoding none && + test_when_finished cleanup_git && + ( + test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual && + grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual + ) +' + +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' ' + git config --global git-p4.fallbackEncoding cp1252 && + test_when_finished cleanup_git && + ( + git p4 clone --dest="$git" //depot@all && + cd "$git" && + git log --oneline >log && + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentación" ] + ) +' + +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' ' + git config --global git-p4.fallbackEncoding replace && + test_when_finished cleanup_git && + ( + git p4 clone --dest="$git" //depot@all && + cd "$git" && + git log --oneline >log && + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentaci�n" ] + ) +' + +test_done -- 2.31.1.windows.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-21 8:46 ` [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions Tzadik Vanderhoof @ 2021-04-21 8:55 ` Tzadik Vanderhoof 2021-04-22 5:05 ` [PATCH v3] add git-p4.fallbackEncoding config setting, " Tzadik Vanderhoof 0 siblings, 1 reply; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-21 8:55 UTC (permalink / raw) To: git, Torsten Bögershausen, Tzadik Vanderhoof Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v3] add git-p4.fallbackEncoding config setting, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-21 8:55 ` Tzadik Vanderhoof @ 2021-04-22 5:05 ` Tzadik Vanderhoof 2021-04-22 15:50 ` Torsten Bögershausen 0 siblings, 1 reply; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-22 5:05 UTC (permalink / raw) To: git, tboegi; +Cc: Tzadik Vanderhoof When git-p4 reads the output from a p4 command, it assumes it will be 100% UTF-8. If even one character in the output of one p4 command is not UTF-8, git-p4 crashes with: File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList value = value.decode() UnicodeDecodeError: 'utf-8' codec can't decode byte Ox93 in position 42: invalid start byte This is especially a problem on the "git p4 clone ... @all" command, where git-p4 needs to read thousands of changeset descriptions, one of which may have a stray smart quote, causing the whole clone operation to fail. This pull request adds a new config setting, allowing git-p4 to try another encoding (for example, "cp1252") and/or use the Unicode replacement character, to prevent the whole program from crashing on such a minor problem. See the documentation included in the patch for more details of how the new config setting works. Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> --- Documentation/git-p4.txt | 10 ++++ git-p4.py | 11 +++- t/t9835-git-p4-config-fallback-encoding.sh | 65 ++++++++++++++++++++++ 3 files changed, 85 insertions(+), 1 deletion(-) create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt index f89e68b..e0131a9 100644 --- a/Documentation/git-p4.txt +++ b/Documentation/git-p4.txt @@ -638,6 +638,16 @@ git-p4.pathEncoding:: to transcode the paths to UTF-8. As an example, Perforce on Windows often uses "cp1252" to encode path names. +git-p4.fallbackEncoding:: + Perforce changeset descriptions can be in a mixture of encodings. + Git-p4 first tries to interpret each description as UTF-8. If that + fails, this config allows another encoding to be tried. You + can specify, for example, "cp1252". If instead of an encoding, + you specify "replace", UTF-8 will be used, with invalid UTF-8 + characters replaced by the Unicode replacement character. If you + specify "none" (the default), there is no fallback, and any non + UTF-8 character will cause git-p4 to immediately fail. + git-p4.largeFileSystem:: Specify the system that is used for large (binary) files. Please note that large file systems do not support the 'git p4 submit' command. diff --git a/git-p4.py b/git-p4.py index 09c9e93..3364287 100755 --- a/git-p4.py +++ b/git-p4.py @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, for key, value in entry.items(): key = key.decode() if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): - value = value.decode() + try: + value = value.decode() + except UnicodeDecodeError as ex: + fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none' + if fallbackEncoding == 'none': + raise Exception("UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding") + elif fallbackEncoding == 'replace': + value = value.decode(errors='replace') + else: + value = value.decode(encoding=fallbackEncoding) decoded_entry[key] = value # Parse out data if it's an error response if decoded_entry.get('code') == 'error' and 'data' in decoded_entry: diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh new file mode 100755 index 0000000..56a245e --- /dev/null +++ b/t/t9835-git-p4-config-fallback-encoding.sh @@ -0,0 +1,65 @@ +#!/bin/sh + +test_description='test git-p4.fallbackEncoding config' + +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME + +. ./lib-git-p4.sh + +if test_have_prereq !MINGW,!CYGWIN; then + skip_all='This system is not subject to encoding failures in "git p4 clone"' + test_done +fi + +test_expect_success 'start p4d' ' + start_p4d +' + +test_expect_success 'add cp1252 description' ' + cd "$cli" && + echo file1 >file1 && + p4 add file1 && + p4 submit -d documentación +' + +test_expect_success 'clone fails with git-p4.fallbackEncoding unset' ' + test_might_fail git config --global --unset git-p4.fallbackEncoding && + test_when_finished cleanup_git && + ( + test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual && + grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual + ) +' +test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' ' + git config --global git-p4.fallbackEncoding none && + test_when_finished cleanup_git && + ( + test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual && + grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual + ) +' + +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' ' + git config --global git-p4.fallbackEncoding cp1252 && + test_when_finished cleanup_git && + ( + git p4 clone --dest="$git" //depot@all && + cd "$git" && + git log --oneline >log && + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentación" ] + ) +' + +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' ' + git config --global git-p4.fallbackEncoding replace && + test_when_finished cleanup_git && + ( + git p4 clone --dest="$git" //depot@all && + cd "$git" && + git log --oneline >log && + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentaci�n" ] + ) +' + +test_done -- 2.31.1.windows.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v3] add git-p4.fallbackEncoding config setting, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-22 5:05 ` [PATCH v3] add git-p4.fallbackEncoding config setting, " Tzadik Vanderhoof @ 2021-04-22 15:50 ` Torsten Bögershausen 2021-04-22 16:17 ` Eric Sunshine 0 siblings, 1 reply; 24+ messages in thread From: Torsten Bögershausen @ 2021-04-22 15:50 UTC (permalink / raw) To: Tzadik Vanderhoof; +Cc: git On Wed, Apr 21, 2021 at 10:05:04PM -0700, Tzadik Vanderhoof wrote: Thanks for V3, please see some comments inline. > When git-p4 reads the output from a p4 command, it assumes it will be > 100% UTF-8. If even one character in the output of one p4 command is > not UTF-8, git-p4 crashes with: > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList > value = value.decode() UnicodeDecodeError: 'utf-8' codec can't > decode byte Ox93 in position 42: invalid start byte > > This is especially a problem on the "git p4 clone ... @all" command, > where git-p4 needs to read thousands of changeset descriptions, one of > which may have a stray smart quote, causing the whole clone operation > to fail. > > This pull request adds a new config setting, allowing git-p4 to try > another encoding (for example, "cp1252") and/or use the Unicode replacement > character, to prevent the whole program from crashing on such a minor problem. "This pull request" is somewhat superflous wording. How about: Add a new config setting, allowing git-p4 to try a fallback encoding (for example, "cp1252") and/or use the Unicode replacement character, to prevent the whole program from crashing on such a minor problem. Documentation is good (and needed, and neccessary). Probably this is then not needed: > See the documentation included in the patch for more details of how > the new config setting works. > > Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> > --- > Documentation/git-p4.txt | 10 ++++ > git-p4.py | 11 +++- > t/t9835-git-p4-config-fallback-encoding.sh | 65 ++++++++++++++++++++++ > 3 files changed, 85 insertions(+), 1 deletion(-) > create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh > > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt > index f89e68b..e0131a9 100644 > --- a/Documentation/git-p4.txt > +++ b/Documentation/git-p4.txt > @@ -638,6 +638,16 @@ git-p4.pathEncoding:: > to transcode the paths to UTF-8. As an example, Perforce on Windows > often uses "cp1252" to encode path names. > > +git-p4.fallbackEncoding:: > + Perforce changeset descriptions can be in a mixture of encodings. > + Git-p4 first tries to interpret each description as UTF-8. If that > + fails, this config allows another encoding to be tried. You > + can specify, for example, "cp1252". That looks OK according to https://docs.python.org/3/library/codecs.html#standard-encodings > + If instead of an encoding, > + you specify "replace", UTF-8 will be used, with invalid UTF-8 > + characters replaced by the Unicode replacement character. If you > + specify "none" (the default), there is no fallback, and any non > + UTF-8 character will cause git-p4 to immediately fail. > + May be, that is a matter of taste: > + If git-p4.fallbackEncoding is "replace" ", UTF-8 will be used, with invalid UTF-8 > + characters replaced by the Unicode replacement character. > + The default is "none": there is no fallback, and any non > + UTF-8 character will cause git-p4 to immediately fail. > + > git-p4.largeFileSystem:: > Specify the system that is used for large (binary) files. Please note > that large file systems do not support the 'git p4 submit' command. > diff --git a/git-p4.py b/git-p4.py > index 09c9e93..3364287 100755 > --- a/git-p4.py > +++ b/git-p4.py > @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, > for key, value in entry.items(): > key = key.decode() > if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): > - value = value.decode() > + try: > + value = value.decode() > + except UnicodeDecodeError as ex: > + fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none' > + if fallbackEncoding == 'none': > + raise Exception("UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding") > + elif fallbackEncoding == 'replace': > + value = value.decode(errors='replace') > + else: > + value = value.decode(encoding=fallbackEncoding) > decoded_entry[key] = value > # Parse out data if it's an error response > if decoded_entry.get('code') == 'error' and 'data' in decoded_entry: > diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh > new file mode 100755 > index 0000000..56a245e > --- /dev/null > +++ b/t/t9835-git-p4-config-fallback-encoding.sh > @@ -0,0 +1,65 @@ > +#!/bin/sh > + > +test_description='test git-p4.fallbackEncoding config' > + > +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main > +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME > + > +. ./lib-git-p4.sh > + > +if test_have_prereq !MINGW,!CYGWIN; then > + skip_all='This system is not subject to encoding failures in "git p4 clone"' > + test_done > +fi Out of curiosity: Why are Windows versions (MINGW, CYGWIN) excluded ? > + > +test_expect_success 'start p4d' ' > + start_p4d > +' > + > +test_expect_success 'add cp1252 description' ' > + cd "$cli" && > + echo file1 >file1 && > + p4 add file1 && > + p4 submit -d documentación > +' > + > +test_expect_success 'clone fails with git-p4.fallbackEncoding unset' ' > + test_might_fail git config --global --unset git-p4.fallbackEncoding && > + test_when_finished cleanup_git && > + ( > + test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual && Should this be >actual ? And please no ' ' between the '>' and the filename. > + grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual > + ) > +' > +test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' ' > + git config --global git-p4.fallbackEncoding none && > + test_when_finished cleanup_git && > + ( > + test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual && Same here 2 >actual > + grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual > + ) > +' > + > +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' ' > + git config --global git-p4.fallbackEncoding cp1252 && > + test_when_finished cleanup_git && > + ( > + git p4 clone --dest="$git" //depot@all && > + cd "$git" && > + git log --oneline >log && > + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentación" ] Style nit: See Documentation/CodingGuidelines: - We prefer "test" over "[ ... ]". desc=$(head -1 log | awk '\''{print $2}'\'') && test "$desc" = "documentación" > + ) > +' > + > +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' ' > + git config --global git-p4.fallbackEncoding replace && > + test_when_finished cleanup_git && > + ( > + git p4 clone --dest="$git" //depot@all && > + cd "$git" && > + git log --oneline >log && > + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentaci�n" ] > + ) > +' > + > +test_done > -- > 2.31.1.windows.1 > Are there any more comments from the p4 experts ? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v3] add git-p4.fallbackEncoding config setting, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-22 15:50 ` Torsten Bögershausen @ 2021-04-22 16:17 ` Eric Sunshine 2021-04-22 22:33 ` Eric Sunshine 0 siblings, 1 reply; 24+ messages in thread From: Eric Sunshine @ 2021-04-22 16:17 UTC (permalink / raw) To: Torsten Bögershausen; +Cc: Tzadik Vanderhoof, Git List On Thu, Apr 22, 2021 at 11:51 AM Torsten Bögershausen <tboegi@web.de> wrote: > On Wed, Apr 21, 2021 at 10:05:04PM -0700, Tzadik Vanderhoof wrote: > > +if test_have_prereq !MINGW,!CYGWIN; then > > + skip_all='This system is not subject to encoding failures in "git p4 clone"' > > + test_done > > +fi > > Out of curiosity: Why are Windows versions (MINGW, CYGWIN) excluded ? The answer to this question is probably worthy of recording as an in-code comment just above this conditional so that people coming upon this test script in the future don't have to ask the same question (which is especially important if the author is no longer reachable). If an in-code comment is overkill, then it would probably be a good idea for the commit message to explain the reason. > > +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' ' > > + git config --global git-p4.fallbackEncoding cp1252 && > > + test_when_finished cleanup_git && > > + ( > > + git p4 clone --dest="$git" //depot@all && > > + cd "$git" && > > + git log --oneline >log && > > + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentación" ] > > Style nit: > See Documentation/CodingGuidelines: - We prefer "test" over "[ ... ]". > > desc=$(head -1 log | awk '\''{print $2}'\'') && test "$desc" = "documentación" Style also suggests splitting the line after the &&. We normally want to avoid using bare single-quotes inside the body of the test since the body itself is a single-quoted string. These single-quotes make it harder for a reader to reason about what is going on; especially with the $2 in there, one has to spend extra cycles wondering if $2 is correctly expanded when the test runs or when it is first defined. So, an easier-to-understand rewrite might be: desc=$(head -1 log | awk ''{print \$2}'') && test "$desc" = "documentación" Many existing tests in this project use `cut` for word-plucking, so an alternative would be: desc=$(head -1 log | cut -d" " -f2) && ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v3] add git-p4.fallbackEncoding config setting, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-22 16:17 ` Eric Sunshine @ 2021-04-22 22:33 ` Eric Sunshine 2021-04-23 6:36 ` [PATCH] add git-p4.fallbackEncoding config variable, " Tzadik Vanderhoof 0 siblings, 1 reply; 24+ messages in thread From: Eric Sunshine @ 2021-04-22 22:33 UTC (permalink / raw) To: Torsten Bögershausen; +Cc: Tzadik Vanderhoof, Git List On Thu, Apr 22, 2021 at 12:17 PM Eric Sunshine <sunshine@sunshineco.com> wrote: > We normally want to avoid using bare single-quotes inside the body of > the test since the body itself is a single-quoted string. These > single-quotes make it harder for a reader to reason about what is > going on; especially with the $2 in there, one has to spend extra > cycles wondering if $2 is correctly expanded when the test runs or > when it is first defined. So, an easier-to-understand rewrite might > be: > > desc=$(head -1 log | awk ''{print \$2}'') && Of course, the quotes surrounding the {print...} should be double-quotes, not pairs of single-quotes: desc=$(head -1 log | awk "{print \$2}") && (I didn't notice the problem when originally composing the email since the compose window wasn't using a fixed-width font, and only noticed it later when re-reading it in a mail reader which does use fixed-width. Sorry for any potential confusion.) > Many existing tests in this project use `cut` for word-plucking, so an > alternative would be: > > desc=$(head -1 log | cut -d" " -f2) && At any rate, using `cut` would be a good option since there's plenty of precedent in existing test scripts. ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-22 22:33 ` Eric Sunshine @ 2021-04-23 6:36 ` Tzadik Vanderhoof 2021-04-23 6:44 ` Tzadik Vanderhoof 0 siblings, 1 reply; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-23 6:36 UTC (permalink / raw) To: Eric Sunshine, Torsten Bögershausen, Git List; +Cc: Tzadik Vanderhoof When git-p4 reads the output from a p4 command, it assumes it will be 100% UTF-8. If even one character in the output of one p4 command is not UTF-8, git-p4 crashes with: File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList value = value.decode() UnicodeDecodeError: 'utf-8' codec can't decode byte Ox93 in position 42: invalid start byte This is especially a problem for the "git p4 clone ... @all" command, where git-p4 needs to read thousands of changeset descriptions, one of which may have a stray smart quote, causing the whole clone operation to fail. Add a new config setting, allowing git-p4 to try a fallback encoding (for example, "cp1252") and/or use the Unicode replacement character, to prevent the whole program from crashing on such a minor problem. Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> --- Documentation/git-p4.txt | 9 +++ git-p4.py | 11 +++- t/t9835-git-p4-config-fallback-encoding.sh | 76 ++++++++++++++++++++++ 3 files changed, 95 insertions(+), 1 deletion(-) create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt index f89e68b424..86d3ffa644 100644 --- a/Documentation/git-p4.txt +++ b/Documentation/git-p4.txt @@ -638,6 +638,15 @@ git-p4.pathEncoding:: to transcode the paths to UTF-8. As an example, Perforce on Windows often uses "cp1252" to encode path names. +git-p4.fallbackEncoding:: + Perforce changeset descriptions can be stored in any encoding. + Git-p4 first tries to interpret each description as UTF-8. If that + fails, this config allows another encoding to be tried. You can specify, + for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will + be used, with invalid UTF-8 characters replaced by the Unicode replacement + character. The default is "none": there is no fallback, and any non UTF-8 + character will cause git-p4 to immediately fail. + git-p4.largeFileSystem:: Specify the system that is used for large (binary) files. Please note that large file systems do not support the 'git p4 submit' command. diff --git a/git-p4.py b/git-p4.py index 09c9e93ac4..202fb01bdf 100755 --- a/git-p4.py +++ b/git-p4.py @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, for key, value in entry.items(): key = key.decode() if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): - value = value.decode() + try: + value = value.decode() + except UnicodeDecodeError: + fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none' + if fallbackEncoding == 'none': + raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding") + elif fallbackEncoding == 'replace': + value = value.decode(errors='replace') + else: + value = value.decode(encoding=fallbackEncoding) decoded_entry[key] = value # Parse out data if it's an error response if decoded_entry.get('code') == 'error' and 'data' in decoded_entry: diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh new file mode 100755 index 0000000000..ce352c826b --- /dev/null +++ b/t/t9835-git-p4-config-fallback-encoding.sh @@ -0,0 +1,76 @@ +#!/bin/sh + +test_description='test git-p4.fallbackEncoding config' + +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME + +. ./lib-git-p4.sh + +# The Windows build of p4 encodes its command-line arguments according to the +# active code page (which defaults to "cp1252"). As a result, "p4 submit -d" causes +# Unicode changeset descriptions to be stored in the Perforce database as cp1252, +# and a subsequent "git p4 clone" attempting to decode these descriptions as UTF-8 +# will raise a UnicodeDecodeError, necessitating the use of the git-p4.fallbackEncoding config. +# +# The Linux build of p4 encodes its command-line arguments as UTF-8, so changeset descriptions +# are stored as UTF-8, and UnicodeDecodeError is never raised by "git p4 clone". + +if test_have_prereq !MINGW,!CYGWIN; then + skip_all='This system is not subject to encoding failures in "git p4 clone"' + test_done +fi + +test_expect_success 'start p4d' ' + start_p4d +' + +test_expect_success 'add cp1252 description' ' + cd "$cli" && + echo file1 >file1 && + p4 add file1 && + p4 submit -d documentación +' + +test_expect_success 'clone fails with git-p4.fallbackEncoding unset' ' + test_might_fail git config --global --unset git-p4.fallbackEncoding && + test_when_finished cleanup_git && + ( + test_must_fail git p4 clone --dest="$git" //depot@all 2>error && + grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error + ) +' +test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' ' + git config --global git-p4.fallbackEncoding none && + test_when_finished cleanup_git && + ( + test_must_fail git p4 clone --dest="$git" //depot@all 2>error && + grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error + ) +' + +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' ' + git config --global git-p4.fallbackEncoding cp1252 && + test_when_finished cleanup_git && + ( + git p4 clone --dest="$git" //depot@all && + cd "$git" && + git log --oneline >log && + desc=$(head -1 log | cut -d" " -f2) && + test "$desc" = "documentación" + ) +' + +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' ' + git config --global git-p4.fallbackEncoding replace && + test_when_finished cleanup_git && + ( + git p4 clone --dest="$git" //depot@all && + cd "$git" && + git log --oneline >log && + desc=$(head -1 log | cut -d" " -f2) && + test "$desc" = "documentaci�n" + ) +' + +test_done -- 2.31.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-23 6:36 ` [PATCH] add git-p4.fallbackEncoding config variable, " Tzadik Vanderhoof @ 2021-04-23 6:44 ` Tzadik Vanderhoof 2021-04-23 19:08 ` Tzadik Vanderhoof 0 siblings, 1 reply; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-23 6:44 UTC (permalink / raw) To: Eric Sunshine, Torsten Bögershausen, Git List (last patch should be labeled as v4... sorry) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-23 6:44 ` Tzadik Vanderhoof @ 2021-04-23 19:08 ` Tzadik Vanderhoof 2021-04-24 8:14 ` Torsten Bögershausen 0 siblings, 1 reply; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-23 19:08 UTC (permalink / raw) To: Eric Sunshine, Torsten Bögershausen, Git List To clarify.... The new config variable I am introducing addresses an issue that only occurs on Windows. This is because the behavior of the "p4" command differs on Windows vs Linux around Unicode in changeset descriptions. I don't have the source code for "p4", but I'm guessing it's written in C, and that this difference in behavior is simply a result of the fact that there is no defined standard of how "char *argv[]" in "main" should deal with non-ASCII characters being passed in from the command line. As a result, "git p4 clone" on Linux is not affected by this "p4" behavior. Since my tests assume the Windows behavior, they fail when run on Linux. For this reason, I added code to my tests to skip them on Linux. On a related note, I don't think there are any CI environments on github for git that are (a) on Windows, and (b) have Python and (c) have Perforce, so I don't think my tests are actually running on github CI. I'm not sure how that can be addressed. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-23 19:08 ` Tzadik Vanderhoof @ 2021-04-24 8:14 ` Torsten Bögershausen 2021-04-27 5:39 ` [PATCH v5] " Tzadik Vanderhoof 0 siblings, 1 reply; 24+ messages in thread From: Torsten Bögershausen @ 2021-04-24 8:14 UTC (permalink / raw) To: Tzadik Vanderhoof Cc: Eric Sunshine, Git List, Johannes Schindelin, Luke Diamand, Pete Wyckoff (Adding some of the p4 and Windows experts in cc) On Fri, Apr 23, 2021 at 12:08:17PM -0700, Tzadik Vanderhoof wrote: > To clarify.... Good. This is good information, and the important stuff could go into the commit message. And because the commit as such should be self-contained (as much as possible). Giving an overview about the problem. > > The new config variable I am introducing addresses an issue that only > occurs on Windows. This is because the behavior of the "p4" command > differs on Windows vs Linux around Unicode in changeset descriptions. What does Windows mean in this context ? Is p4 a "console application" ? In this case it may be possible to use CHCP to change to a different code page ? > > I don't have the source code for "p4", but I'm guessing it's written > in C, and that this difference in behavior is simply a result of the > fact that there is no defined standard of how "char *argv[]" in "main" > should deal with non-ASCII characters being passed in from the command > line. > > As a result, "git p4 clone" on Linux is not affected by this "p4" > behavior. Is it ? What happens if yoy have a p4 depot that was feed from Windows in CP-1252 and is now accessed from a Linux machine ? Doesthe Linux box face the same problems ? > Since my tests assume the Windows behavior, they fail when > run on Linux. For this reason, I added code to my tests to skip them > on Linux. That makes sense, but what is the "Windows behavior" more in detail ? My understanding is that when you press e.g. the key 'Ä' on the keybaord, it will give a different byte sequence once that 'Ä' is transferred across the wire (to the p4 server). This is dependent on what Linux calls a locale, and all major Linux installations use UTF-8 these days by default. But that was not always the case, since in old days they used ISO-8851-1 or something else, usable for your contry/region. So most Windows "console applications" are not run under UTF-8, but it may be possible that "CHCP 65000" (or so) works - more testing needed. > > On a related note, I don't think there are any CI environments on > github for git that are (a) on Windows, and (b) have Python and (c) > have Perforce, so I don't think my tests are actually running on > github CI. I'm not sure how that can be addressed. That are 3 different questions - (a) Yes, git is compiled under Windows, both gcc and MSVC (correct me if that is wrong) (b) Yes, we have python on the different CI. Github actions has python, I use it every day. (c) There are tests run for p4, but it seems if they are only run under Linux. It would be nice if your test can pass under Linux, why are they failing ? If I dig here: <https://github.com/git/git/runs/2420889332?check_suite_focus=true> We can see that the t98 test are run, and are passing. Just to pick one: [15:28:22] t9804-git-p4-label.sh .............................. ok 3969 Thanks for working on this. It would be good to have a v5 version of the patch with some more informations, like above, and may be :how is the p4 server configured ? (Unicode or not ?) ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-24 8:14 ` Torsten Bögershausen @ 2021-04-27 5:39 ` Tzadik Vanderhoof 2021-04-27 5:45 ` Tzadik Vanderhoof 2021-04-28 4:39 ` Junio C Hamano 0 siblings, 2 replies; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-27 5:39 UTC (permalink / raw) To: Eric Sunshine, Git List, Johannes Schindelin, Luke Diamand, Pete Wyckoff Cc: Tzadik Vanderhoof When git-p4 reads the output from a p4 command, it assumes it will be 100% UTF-8. If even one character in the output of one p4 command is not UTF-8, git-p4 crashes with: File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList value = value.decode() UnicodeDecodeError: 'utf-8' codec can't decode byte Ox93 in position 42: invalid start byte This is especially a problem for the "git p4 clone ... @all" command, where git-p4 needs to read thousands of changeset descriptions, one of which may have a stray smart quote, causing the whole clone operation to fail. Add a new config setting, allowing git-p4 to try a fallback encoding (for example, "cp1252") and/or use the Unicode replacement character, to prevent the whole program from crashing on such a minor problem. Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> --- Documentation/git-p4.txt | 9 +++ git-p4.py | 11 ++- t/t9835-git-p4-config-fallback-encoding.sh | 87 ++++++++++++++++++++++ 3 files changed, 106 insertions(+), 1 deletion(-) create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt index f89e68b424..86d3ffa644 100644 --- a/Documentation/git-p4.txt +++ b/Documentation/git-p4.txt @@ -638,6 +638,15 @@ git-p4.pathEncoding:: to transcode the paths to UTF-8. As an example, Perforce on Windows often uses "cp1252" to encode path names. +git-p4.fallbackEncoding:: + Perforce changeset descriptions can be stored in any encoding. + Git-p4 first tries to interpret each description as UTF-8. If that + fails, this config allows another encoding to be tried. You can specify, + for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will + be used, with invalid UTF-8 characters replaced by the Unicode replacement + character. The default is "none": there is no fallback, and any non UTF-8 + character will cause git-p4 to immediately fail. + git-p4.largeFileSystem:: Specify the system that is used for large (binary) files. Please note that large file systems do not support the 'git p4 submit' command. diff --git a/git-p4.py b/git-p4.py index 09c9e93ac4..202fb01bdf 100755 --- a/git-p4.py +++ b/git-p4.py @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, for key, value in entry.items(): key = key.decode() if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): - value = value.decode() + try: + value = value.decode() + except UnicodeDecodeError: + fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none' + if fallbackEncoding == 'none': + raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding") + elif fallbackEncoding == 'replace': + value = value.decode(errors='replace') + else: + value = value.decode(encoding=fallbackEncoding) decoded_entry[key] = value # Parse out data if it's an error response if decoded_entry.get('code') == 'error' and 'data' in decoded_entry: diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh new file mode 100755 index 0000000000..383507803e --- /dev/null +++ b/t/t9835-git-p4-config-fallback-encoding.sh @@ -0,0 +1,87 @@ +#!/bin/sh + +test_description='test git-p4.fallbackEncoding config' + +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME + +. ./lib-git-p4.sh + +test_expect_success 'start p4d' ' + start_p4d +' + +test_expect_success 'add Unicode description' ' + cd "$cli" && + echo file1 >file1 && + p4 add file1 && + p4 submit -d documentación +' + +# Unicode descriptions cause clone to throw in some environments. This test +# determines if that is the case in our environment. If so we create a file called "clone_fails". +# We check that file to in subsequent tests to determine what behavior to expect. + +clone_fails="$TRASH_DIRECTORY/clone_fails" + +test_expect_success 'clone with git-p4.fallbackEncoding unset' ' + test_might_fail git config --global --unset git-p4.fallbackEncoding && + test_when_finished cleanup_git && { + git p4 clone --dest="$git" //depot@all 2>error || ( + cp /dev/null "$clone_fails" && + grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error + ) + } +' + +test_expect_success 'clone with git-p4.fallbackEncoding set to "none"' ' + git config --global git-p4.fallbackEncoding none && + test_when_finished cleanup_git && { + ( + test -f "$clone_fails" && + test_must_fail git p4 clone --dest="$git" //depot@all 2>error && + grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error + ) || + ( + ! test -f "$clone_fails" && + git p4 clone --dest="$git" //depot@all 2>error + ) + } +' + +test_expect_success 'clone with git-p4.fallbackEncoding set to "cp1252"' ' + git config --global git-p4.fallbackEncoding cp1252 && + test_when_finished cleanup_git && + ( + git p4 clone --dest="$git" //depot@all && + cd "$git" && + git log --oneline >log && + desc=$(head -1 log | cut -d" " -f2) && + test "$desc" = "documentación" + ) +' + +test_expect_success 'clone with git-p4.fallbackEncoding set to "replace"' ' + git config --global git-p4.fallbackEncoding replace && + test_when_finished cleanup_git && + ( + git p4 clone --dest="$git" //depot@all && + cd "$git" && + git log --oneline >log && + desc=$(head -1 log | cut -d" " -f2) && + { + (test -f "$clone_fails" && + test "$desc" = "documentaci�n" + ) || + (! test -f "$clone_fails" && + test "$desc" = "documentación" + ) + } + ) +' + +test_expect_success 'unset git-p4.fallbackEncoding' ' + git config --global --unset git-p4.fallbackEncoding +' + +test_done -- 2.31.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-27 5:39 ` [PATCH v5] " Tzadik Vanderhoof @ 2021-04-27 5:45 ` Tzadik Vanderhoof 2021-04-28 4:39 ` Junio C Hamano 1 sibling, 0 replies; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-27 5:45 UTC (permalink / raw) To: Eric Sunshine, Git List, Johannes Schindelin, Luke Diamand, Pete Wyckoff I modified the test to work on both Linux and Windows. See the comments in the test. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-27 5:39 ` [PATCH v5] " Tzadik Vanderhoof 2021-04-27 5:45 ` Tzadik Vanderhoof @ 2021-04-28 4:39 ` Junio C Hamano 2021-04-28 14:58 ` Torsten Bögershausen 1 sibling, 1 reply; 24+ messages in thread From: Junio C Hamano @ 2021-04-28 4:39 UTC (permalink / raw) To: Tzadik Vanderhoof Cc: Eric Sunshine, Git List, Johannes Schindelin, Luke Diamand, Pete Wyckoff Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes: > t/t9835-git-p4-config-fallback-encoding.sh | 87 ++++++++++++++++++++++ > 3 files changed, 106 insertions(+), 1 deletion(-) > create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh 9835 is already taken (see 'seen'). ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions 2021-04-28 4:39 ` Junio C Hamano @ 2021-04-28 14:58 ` Torsten Bögershausen 2021-04-29 7:39 ` [PATCH v6] Add git-p4.fallbackEncoding Tzadik Vanderhoof [not found] ` <20210429074458.891-1-tzadik.vanderhoof@gmail.com> 0 siblings, 2 replies; 24+ messages in thread From: Torsten Bögershausen @ 2021-04-28 14:58 UTC (permalink / raw) To: Junio C Hamano Cc: Tzadik Vanderhoof, Eric Sunshine, Git List, Johannes Schindelin, Luke Diamand, Pete Wyckoff On Wed, Apr 28, 2021 at 01:39:38PM +0900, Junio C Hamano wrote: > Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes: > > > t/t9835-git-p4-config-fallback-encoding.sh | 87 ++++++++++++++++++++++ > > 3 files changed, 106 insertions(+), 1 deletion(-) > > create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh > > 9835 is already taken (see 'seen'). In general, this looks good to me. There are two minor nitpicks to make the patch more the git-way: > Subject: [PATCH v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptionsw The head line is somewhat too long. It should be much shorter, like 50-55 characters, if I recall it rigth. The first line of the commit message is what we see under PATCH in the email, followed by a blank line (that's what we have) and a detailed description (Which we have) How abut this ? git-p4: Add git-p4.fallbackEncoding Add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions. When git-p4 reads the output from a p4 command, it assumes it will be 100% UTF-8. If even one character in the output of one p4 command is not UTF-8, git-p4 crashes with: File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList value = value.decode() UnicodeDecodeError: 'utf-8' codec can't decode byte Ox93 in position 42: invalid start byte This is especially a problem for the "git p4 clone ... @all" command, where git-p4 needs to read thousands of changeset descriptions, one of which may have a stray smart quote, causing the whole clone operation to fail. Add a new config setting, allowing git-p4 to try a fallback encoding (for example, "cp1252") and/or use the Unicode replacement character, to prevent the whole program from crashing on such a minor problem. [] And then, somewhere in the test: cp /dev/null "$clone_fails" && This should create an empty file, right ? Then we can use a simple output-redirection: >"$clone_fails" && ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v6] Add git-p4.fallbackEncoding 2021-04-28 14:58 ` Torsten Bögershausen @ 2021-04-29 7:39 ` Tzadik Vanderhoof 2021-04-29 8:36 ` Luke Diamand [not found] ` <20210429074458.891-1-tzadik.vanderhoof@gmail.com> 1 sibling, 1 reply; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-29 7:39 UTC (permalink / raw) To: Junio C Hamano, Eric Sunshine, Git List, Johannes Schindelin, Luke Diamand, Pete Wyckoff Cc: Tzadik Vanderhoof Add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions. When git-p4 reads the output from a p4 command, it assumes it will be 100% UTF-8. If even one character in the output of one p4 command is not UTF-8, git-p4 crashes with: File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList value = value.decode() UnicodeDecodeError: 'utf-8' codec can't decode byte Ox93 in position 42: invalid start byte This is especially a problem for the "git p4 clone ... @all" command, where git-p4 needs to read thousands of changeset descriptions, one of which may have a stray smart quote, causing the whole clone operation to fail. Add a new config setting, allowing git-p4 to try a fallback encoding (for example, "cp1252") and/or use the Unicode replacement character, to prevent the whole program from crashing on such a minor problem. Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> --- Documentation/git-p4.txt | 9 ++ git-p4.py | 11 ++- t/t9836-git-p4-config-fallback-encoding.sh | 98 ++++++++++++++++++++++ 3 files changed, 117 insertions(+), 1 deletion(-) create mode 100755 t/t9836-git-p4-config-fallback-encoding.sh diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt index f89e68b424..86d3ffa644 100644 --- a/Documentation/git-p4.txt +++ b/Documentation/git-p4.txt @@ -638,6 +638,15 @@ git-p4.pathEncoding:: to transcode the paths to UTF-8. As an example, Perforce on Windows often uses "cp1252" to encode path names. +git-p4.fallbackEncoding:: + Perforce changeset descriptions can be stored in any encoding. + Git-p4 first tries to interpret each description as UTF-8. If that + fails, this config allows another encoding to be tried. You can specify, + for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will + be used, with invalid UTF-8 characters replaced by the Unicode replacement + character. The default is "none": there is no fallback, and any non UTF-8 + character will cause git-p4 to immediately fail. + git-p4.largeFileSystem:: Specify the system that is used for large (binary) files. Please note that large file systems do not support the 'git p4 submit' command. diff --git a/git-p4.py b/git-p4.py index 09c9e93ac4..202fb01bdf 100755 --- a/git-p4.py +++ b/git-p4.py @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, for key, value in entry.items(): key = key.decode() if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): - value = value.decode() + try: + value = value.decode() + except UnicodeDecodeError: + fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none' + if fallbackEncoding == 'none': + raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding") + elif fallbackEncoding == 'replace': + value = value.decode(errors='replace') + else: + value = value.decode(encoding=fallbackEncoding) decoded_entry[key] = value # Parse out data if it's an error response if decoded_entry.get('code') == 'error' and 'data' in decoded_entry: diff --git a/t/t9836-git-p4-config-fallback-encoding.sh b/t/t9836-git-p4-config-fallback-encoding.sh new file mode 100755 index 0000000000..901bb3759d --- /dev/null +++ b/t/t9836-git-p4-config-fallback-encoding.sh @@ -0,0 +1,98 @@ +#!/bin/sh + +test_description='test git-p4.fallbackEncoding config' + +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME + +. ./lib-git-p4.sh + +test_expect_success 'start p4d' ' + start_p4d +' + +test_expect_success 'add Unicode description' ' + cd "$cli" && + echo file1 >file1 && + p4 add file1 && + p4 submit -d documentación +' + +# Unicode descriptions cause "git p4 clone" to crash with a UnicodeDecodeError in some +# environments. This test determines if that is the case in our environment. If so, +# we create a file called "clone_fails". In subsequent tests, we check whether that +# file exists to determine what behavior to expect. + +clone_fails="$TRASH_DIRECTORY/clone_fails" + +# If clone fails with git-p4.fallbackEncoding set to "none", create the "clone_fails" file, +# and make sure the error message is correct + +test_expect_success 'clone with git-p4.fallbackEncoding set to "none"' ' + git config --global git-p4.fallbackEncoding none && + test_when_finished cleanup_git && { + git p4 clone --dest="$git" //depot@all 2>error || ( + >"$clone_fails" && + grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error + ) + } +' + +# If clone fails with git-p4.fallbackEncoding set to "none", it should also fail when it's unset, +# also with the correct error message. Otherwise the clone should succeed. + +test_expect_success 'clone with git-p4.fallbackEncoding unset' ' + git config --global --unset git-p4.fallbackEncoding && + test_when_finished cleanup_git && { + ( + test -f "$clone_fails" && + test_must_fail git p4 clone --dest="$git" //depot@all 2>error && + grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error + ) || + ( + ! test -f "$clone_fails" && + git p4 clone --dest="$git" //depot@all 2>error + ) + } +' + +# Whether or not "clone_fails" exists, setting git-p4.fallbackEncoding +# to "cp1252" should cause clone to succeed and get the right description + +test_expect_success 'clone with git-p4.fallbackEncoding set to "cp1252"' ' + git config --global git-p4.fallbackEncoding cp1252 && + test_when_finished cleanup_git && + ( + git p4 clone --dest="$git" //depot@all && + cd "$git" && + git log --oneline >log && + desc=$(head -1 log | cut -d" " -f2) && + test "$desc" = "documentación" + ) +' + +# Setting git-p4.fallbackEncoding to "replace" should always cause clone to succeed. +# If "clone_fails" exists, the description should contain the Unicode replacement +# character, otherwise the description should be correct (since we're on a system that +# doesn't have the Unicode issue) + +test_expect_success 'clone with git-p4.fallbackEncoding set to "replace"' ' + git config --global git-p4.fallbackEncoding replace && + test_when_finished cleanup_git && + ( + git p4 clone --dest="$git" //depot@all && + cd "$git" && + git log --oneline >log && + desc=$(head -1 log | cut -d" " -f2) && + { + (test -f "$clone_fails" && + test "$desc" = "documentaci�n" + ) || + (! test -f "$clone_fails" && + test "$desc" = "documentación" + ) + } + ) +' + +test_done -- 2.31.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v6] Add git-p4.fallbackEncoding 2021-04-29 7:39 ` [PATCH v6] Add git-p4.fallbackEncoding Tzadik Vanderhoof @ 2021-04-29 8:36 ` Luke Diamand 2021-04-29 17:29 ` Tzadik Vanderhoof 0 siblings, 1 reply; 24+ messages in thread From: Luke Diamand @ 2021-04-29 8:36 UTC (permalink / raw) To: Tzadik Vanderhoof, Junio C Hamano, Eric Sunshine, Git List, Johannes Schindelin, Pete Wyckoff, Anders Bakken Cc: Andrew Oakley On 29/04/2021 07:39, Tzadik Vanderhoof wrote: > Add git-p4.fallbackEncoding config variable, to prevent git-p4 from > crashing on non UTF-8 changeset descriptions. > > When git-p4 reads the output from a p4 command, it assumes it will > be 100% UTF-8. If even one character in the output of one p4 command is > not UTF-8, git-p4 crashes with: > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList > value = value.decode() UnicodeDecodeError: 'utf-8' codec can't > decode byte Ox93 in position 42: invalid start byte > > This is especially a problem for the "git p4 clone ... @all" command, > where git-p4 needs to read thousands of changeset descriptions, one of > which may have a stray smart quote, causing the whole clone operation > to fail. > > Add a new config setting, allowing git-p4 to try a fallback encoding > (for example, "cp1252") and/or use the Unicode replacement character, > to prevent the whole program from crashing on such a minor problem. I think Andrew Oakley pointed this out earlier - but in the days of Python2 this was (I think) never a problem. Python2 just took in the binary data, in whatever encoding, and passed it untouched on to git, which in turn just stored it. https://lore.kernel.org/git/20210412085251.51475-1-andrew@adoakley.name/ It was only whatever was trying to render the bytestream that needed to worry about the encoding. Now we're making the decision in git-p4 when we ingest it - did we consider just passing it along untouched? The problem at hand is that git-p4 is trying to store it internally as a `string' which now is unicode-aware, when perhaps it should not be. It's going to get very confusing if anyone ingests something from Perforce having set the encoding to one thing, and it turns out to be a different encoding, or worse, multiple encodings for the same repo. I also worry that if someone has connected to a Unicode-aware Perforce server, and then unwittingly set P4CHARSET, then are we going to end up silently scrambling everything? I'm not sure, encodings make my head hurt. Luke > > Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> > --- > Documentation/git-p4.txt | 9 ++ > git-p4.py | 11 ++- > t/t9836-git-p4-config-fallback-encoding.sh | 98 ++++++++++++++++++++++ > 3 files changed, 117 insertions(+), 1 deletion(-) > create mode 100755 t/t9836-git-p4-config-fallback-encoding.sh > > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt > index f89e68b424..86d3ffa644 100644 > --- a/Documentation/git-p4.txt > +++ b/Documentation/git-p4.txt > @@ -638,6 +638,15 @@ git-p4.pathEncoding:: > to transcode the paths to UTF-8. As an example, Perforce on Windows > often uses "cp1252" to encode path names. > > +git-p4.fallbackEncoding:: > + Perforce changeset descriptions can be stored in any encoding. > + Git-p4 first tries to interpret each description as UTF-8. If that > + fails, this config allows another encoding to be tried. You can specify, > + for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will > + be used, with invalid UTF-8 characters replaced by the Unicode replacement > + character. The default is "none": there is no fallback, and any non UTF-8 > + character will cause git-p4 to immediately fail. > + > git-p4.largeFileSystem:: > Specify the system that is used for large (binary) files. Please note > that large file systems do not support the 'git p4 submit' command. > diff --git a/git-p4.py b/git-p4.py > index 09c9e93ac4..202fb01bdf 100755 > --- a/git-p4.py > +++ b/git-p4.py > @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, > for key, value in entry.items(): > key = key.decode() > if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): > - value = value.decode() > + try: > + value = value.decode() > + except UnicodeDecodeError: > + fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none' > + if fallbackEncoding == 'none': > + raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding") > + elif fallbackEncoding == 'replace': > + value = value.decode(errors='replace') > + else: > + value = value.decode(encoding=fallbackEncoding) > decoded_entry[key] = value > # Parse out data if it's an error response > if decoded_entry.get('code') == 'error' and 'data' in decoded_entry: > diff --git a/t/t9836-git-p4-config-fallback-encoding.sh b/t/t9836-git-p4-config-fallback-encoding.sh > new file mode 100755 > index 0000000000..901bb3759d > --- /dev/null > +++ b/t/t9836-git-p4-config-fallback-encoding.sh > @@ -0,0 +1,98 @@ > +#!/bin/sh > + > +test_description='test git-p4.fallbackEncoding config' > + > +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main > +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME > + > +. ./lib-git-p4.sh > + > +test_expect_success 'start p4d' ' > + start_p4d > +' > + > +test_expect_success 'add Unicode description' ' > + cd "$cli" && > + echo file1 >file1 && > + p4 add file1 && > + p4 submit -d documentación > +' > + > +# Unicode descriptions cause "git p4 clone" to crash with a UnicodeDecodeError in some > +# environments. This test determines if that is the case in our environment. If so, > +# we create a file called "clone_fails". In subsequent tests, we check whether that > +# file exists to determine what behavior to expect. > + > +clone_fails="$TRASH_DIRECTORY/clone_fails" > + > +# If clone fails with git-p4.fallbackEncoding set to "none", create the "clone_fails" file, > +# and make sure the error message is correct > + > +test_expect_success 'clone with git-p4.fallbackEncoding set to "none"' ' > + git config --global git-p4.fallbackEncoding none && > + test_when_finished cleanup_git && { > + git p4 clone --dest="$git" //depot@all 2>error || ( > + >"$clone_fails" && > + grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error > + ) > + } > +' > + > +# If clone fails with git-p4.fallbackEncoding set to "none", it should also fail when it's unset, > +# also with the correct error message. Otherwise the clone should succeed. > + > +test_expect_success 'clone with git-p4.fallbackEncoding unset' ' > + git config --global --unset git-p4.fallbackEncoding && > + test_when_finished cleanup_git && { > + ( > + test -f "$clone_fails" && > + test_must_fail git p4 clone --dest="$git" //depot@all 2>error && > + grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error > + ) || > + ( > + ! test -f "$clone_fails" && > + git p4 clone --dest="$git" //depot@all 2>error > + ) > + } > +' > + > +# Whether or not "clone_fails" exists, setting git-p4.fallbackEncoding > +# to "cp1252" should cause clone to succeed and get the right description > + > +test_expect_success 'clone with git-p4.fallbackEncoding set to "cp1252"' ' > + git config --global git-p4.fallbackEncoding cp1252 && > + test_when_finished cleanup_git && > + ( > + git p4 clone --dest="$git" //depot@all && > + cd "$git" && > + git log --oneline >log && > + desc=$(head -1 log | cut -d" " -f2) && > + test "$desc" = "documentación" > + ) > +' > + > +# Setting git-p4.fallbackEncoding to "replace" should always cause clone to succeed. > +# If "clone_fails" exists, the description should contain the Unicode replacement > +# character, otherwise the description should be correct (since we're on a system that > +# doesn't have the Unicode issue) > + > +test_expect_success 'clone with git-p4.fallbackEncoding set to "replace"' ' > + git config --global git-p4.fallbackEncoding replace && > + test_when_finished cleanup_git && > + ( > + git p4 clone --dest="$git" //depot@all && > + cd "$git" && > + git log --oneline >log && > + desc=$(head -1 log | cut -d" " -f2) && > + { > + (test -f "$clone_fails" && > + test "$desc" = "documentaci�n" > + ) || > + (! test -f "$clone_fails" && > + test "$desc" = "documentación" > + ) > + } > + ) > +' > + > +test_done > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v6] Add git-p4.fallbackEncoding 2021-04-29 8:36 ` Luke Diamand @ 2021-04-29 17:29 ` Tzadik Vanderhoof 0 siblings, 0 replies; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-29 17:29 UTC (permalink / raw) To: Luke Diamand Cc: Junio C Hamano, Eric Sunshine, Git List, Johannes Schindelin, Pete Wyckoff, Anders Bakken, Andrew Oakley On Thu, Apr 29, 2021 at 1:36 AM Luke Diamand <luke@diamand.org> wrote: > > I think Andrew Oakley pointed this out earlier - but in the days of > Python2 this was (I think) never a problem. Python2 just took in the > binary data, in whatever encoding, and passed it untouched on to git, > which in turn just stored it. > Unfortunately, I just became aware yesterday that Andrew Oakley was also working on this issue (his CC to me 2 weeks ago somehow ended up in my Spam folder, and I only dug it out of there after finding out that we both created a test with the same number). When I first became aware of Andrew's work (yesterday), I thought it would make mine unnecessary, but upon further investigation, I don't think Andrew's work will solve this problem. Please see my reply yesterday to Andew's thread: https://lore.kernel.org/git/CAKu1iLXRrsB4mRsDfhBH5aahWzDjpfqLuWP9t47RMB=RdpL1iA@mail.gmail.com ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20210429074458.891-1-tzadik.vanderhoof@gmail.com>]
[parent not found: <c4c48615-d1f4-fd37-0960-979535907f15@web.de>]
* Re: [PATCH v6] Add git-p4.fallbackEncoding [not found] ` <c4c48615-d1f4-fd37-0960-979535907f15@web.de> @ 2021-04-29 17:14 ` Tzadik Vanderhoof 0 siblings, 0 replies; 24+ messages in thread From: Tzadik Vanderhoof @ 2021-04-29 17:14 UTC (permalink / raw) To: Torsten Bögershausen, Git List On Thu, Apr 29, 2021 at 7:12 AM Torsten Bögershausen <tboegi@web.de> wrote: > > Hej Tzadik, > > This version went only to my email ? > v6 went to the list as well as your email. I just forgot to include you in the email to the list, so I sent you another copy with just you. > The test case number seems to be fixed, thanks. > > (Normally we don't have this collision, but right now > it seem as if there is much going on in the git-p4 area, > which is good) > > The "headline" is still overlong, it seams. > I did shorten the first line of my commit as you asked and used that commit to create the v6 path. That first (short) line goes into the Subject line of the patch. When you do "git am" it will use the subject (which is short) as the first line of the commit. The overlong summary will become the 3rd line of the commit (after a blank second line) ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2021-04-29 17:29 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-08 19:28 git-p4 crashes on non UTF-8 output from p4 Tzadik Vanderhoof 2021-04-09 15:38 ` Torsten Bögershausen 2021-04-11 7:16 ` Tzadik Vanderhoof 2021-04-11 9:37 ` Torsten Bögershausen 2021-04-11 20:21 ` Tzadik Vanderhoof 2021-04-12 4:06 ` Torsten Bögershausen 2021-04-21 8:46 ` [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions Tzadik Vanderhoof 2021-04-21 8:55 ` Tzadik Vanderhoof 2021-04-22 5:05 ` [PATCH v3] add git-p4.fallbackEncoding config setting, " Tzadik Vanderhoof 2021-04-22 15:50 ` Torsten Bögershausen 2021-04-22 16:17 ` Eric Sunshine 2021-04-22 22:33 ` Eric Sunshine 2021-04-23 6:36 ` [PATCH] add git-p4.fallbackEncoding config variable, " Tzadik Vanderhoof 2021-04-23 6:44 ` Tzadik Vanderhoof 2021-04-23 19:08 ` Tzadik Vanderhoof 2021-04-24 8:14 ` Torsten Bögershausen 2021-04-27 5:39 ` [PATCH v5] " Tzadik Vanderhoof 2021-04-27 5:45 ` Tzadik Vanderhoof 2021-04-28 4:39 ` Junio C Hamano 2021-04-28 14:58 ` Torsten Bögershausen 2021-04-29 7:39 ` [PATCH v6] Add git-p4.fallbackEncoding Tzadik Vanderhoof 2021-04-29 8:36 ` Luke Diamand 2021-04-29 17:29 ` Tzadik Vanderhoof [not found] ` <20210429074458.891-1-tzadik.vanderhoof@gmail.com> [not found] ` <c4c48615-d1f4-fd37-0960-979535907f15@web.de> 2021-04-29 17:14 ` Tzadik Vanderhoof
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.