All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
To: "Torsten Bögershausen" <tboegi@web.de>
Cc: git@vger.kernel.org
Subject: Re: git-p4 crashes on non UTF-8 output from p4
Date: Sun, 11 Apr 2021 13:21:47 -0700	[thread overview]
Message-ID: <CAKu1iLUNooP+FDMJKekH4b2Cq5BhFZwAb=d28iPv55C5+cQbCg@mail.gmail.com> (raw)
In-Reply-To: <20210411093746.ymqofe2uawclwu5i@tb-raspi4>

Thank you for your excellent and friendly feedback!

I understand everything you said, but I have a question about the unit
test you requested.  The git-p4.py script currently does not have
tests and is not written in a way that would be testable.  (The Python
function I modified calls into the shell and requires a valid Perforce
installation.)

Would you prefer I  a) refactor the code to be testable and then write
tests  or b) skip the unit testing (not sure if there are any further
options)?

(For option a) I would break out the part of the function I modified
into another function and then call my new function for my testing.
(I guess it would be better to break the refactoring and my changes
into 2 separate commits.)

On Sun, Apr 11, 2021 at 2:37 AM Torsten Bögershausen <tboegi@web.de> wrote:
>
> On Sun, Apr 11, 2021 at 12:16:25AM -0700, Tzadik Vanderhoof wrote:
> > Here is the pull request:
>
> Thanks for the work. Some comments inline.
>
> >
> > From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001
> > From: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
> > Date: Sat, 10 Apr 2021 22:41:39 -0700
>
>
> The subject should be one short line, highlighting what this is all about,
> followed by a blank line and a longer description about the problem and
> the solution. The original description was good, see below.
>
> > Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent
> >  git-p4 from crashing on non UTF-8 changeset descriptions
>
> In that sense I make a first trial here, subject for improvements:
>
>
> Subject: [PATCH] Add git-p4.fallbackEncoding config variable
>
> When git-p4 reads the output from a p4 command, it assumes it will be
> 100% UTF-8. If even one character in the output of one p4 command is
> not UTF-8, git-p4 crashes e.g. with:
>
> File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
>     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> decode byte Ox93 in position 42: invalid start byte
>
> Allow to try another encoding (eg cp1252) and/or use the
> Unicode replacement character  to prevent the whole program from crashing
> on such a "minor" problem.
>
> This is especially a problem on the "git p4 clone" command with @all,
> where git-p4 needs to read thousands of changeset descriptions, one of
> which may have a stray smart quote, causing the whole clone operation
> to fail.
>
> Introduce "git-p4.fallbackEncoding" to handle non UTF-8 encodings, if needed.
>
> >
> > ---
> >  Documentation/git-p4.txt | 10 ++++++++++
> >  git-p4.py                | 11 ++++++++++-
> >  2 files changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> > index f89e68b..71f3487 100644
> > --- a/Documentation/git-p4.txt
> > +++ b/Documentation/git-p4.txt
> > @@ -638,6 +638,16 @@ git-p4.pathEncoding::
> >   to transcode the paths to UTF-8. As an example, Perforce on Windows
> >   often uses "cp1252" to encode path names.
> >
> > +git-p4.fallbackEncoding::
> > +    Perforce changeset descriptions can be in a mixture of encodings. Git-p4
> > +    first tries to interpret each description as UTF-8. If that fails, this
> > +    config allows another encoding to be tried.  The default is "cp1252".  You
>
> I know that cp1252 is attractive to be used, especially for Windows installations that
> use Latin-based "characters".
> But: If we introduce a new config-variable into Git, the default tends to be
> "if not set to anything, behave as the old Git".
>
> > +    can set it to another encoding, for example, "iso-8859-5". If instead of
> ISO-8859-5 may be more portable on the different i18 implementations
> than the lower-case spelling.
>
> > +    an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8
> > +    characters replaced by the Unicode replacement character. If you specify
> > +    "none", there is no fallback, and any non UTF-8 character will cause
> > +    git-p4 to immediately fail.
>
> As said, before, many people may expect Git to fail, so that the default should be
> none to avoid surprises.
> When a "non-UTF-8-clean" repo is handled, they want to know it.
>
> > +
> >  git-p4.largeFileSystem::
> >   Specify the system that is used for large (binary) files. Please note
> >   that large file systems do not support the 'git p4 submit' command.
> > diff --git a/git-p4.py b/git-p4.py
> > index 09c9e93..18d02b4 100755
> > --- a/git-p4.py
> > +++ b/git-p4.py
> > @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b',
> > cb=None, skip_info=False,
> >                  for key, value in entry.items():
> >                      key = key.decode()
> >                      if isinstance(value, bytes) and not (key in
> > ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> > -                        value = value.decode()
> > +                        try:
> > +                            value = value.decode()
> > +                        except:
> > +                            fallbackEncoding =
> > gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252'
> > +                            if fallbackEncoding == 'none':
> > +                                raise
>
> Would it make sense to tell the user about the new config value here?
>  raise Exception("Non UTF-8 detected. See git-p4.fallbackEncoding"
> Or somewhat in that style ?
>
> > +                            elif fallbackEncoding == 'replace':
> > +                                value = value.decode(errors='replace')
> > +                            else:
> > +                                value = value.decode(encoding=fallbackEncoding)
> >                      decoded_entry[key] = value
> >                  # Parse out data if it's an error response
> >                  if decoded_entry.get('code') == 'error' and 'data' in
> > decoded_entry:
>
>
> Did I miss the Signed-off-by here?
>
> Please have a look here:
> https://git-scm.com/docs/SubmittingPatches
>
> (or look at Documentation/SubmittingPatches in your git source code)
>
> And finally: Thanks for the contribution.
> Is there any chance to add test-cases, to make sure that this feature
> is well-tested now and in the future ?
>
>
> > --
> > 2.31.1.windows.1
> >
> > On Fri, Apr 9, 2021 at 8:38 AM Torsten Bögershausen <tboegi@web.de> wrote:
> > >
> > > On Thu, Apr 08, 2021 at 12:28:25PM -0700, Tzadik Vanderhoof wrote:
> > > > When git-p4 reads the output from a p4 command, it assumes it will be
> > > > 100% UTF-8. If even one character in the output of one p4 command is
> > > > not UTF-8, git-p4 crashes with:
> > > >
> > > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
> > > >     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> > > > decode byte Ox93 in position 42: invalid start byte
> > > >
> > > > I'd like to make a pull request to have it try another encoding (eg
> > > > cp1252) and/or use the Unicode replacement character, to prevent the
> > > > whole program from crashing on such a minor problem.
> > > >
> > > > This is especially a problem on the "git p4 clone" command with @all,
> > > > where git-p4 needs to read thousands of changeset descriptions, one of
> > > > which may have a stray smart quote, causing the whole clone operation
> > > > to fail.
> > > >
> > > > Sound ok?
> > >
> > > Welcome to the Git community.
> > > To start with: I am not a git-p4 expert as such, but seeing that a program is crashing
> > > is never a good thing.
> > > All efforts to prevent the crash are a step forward.
> > >
> > > As you mention cp1252 (which is more used under Windows), there are probably lots of
> > > system out there which use ISO-8859-15 (or ISO-8859-1) we may have the first whish:
> > >
> > > Make the encoding/fallback configurable.
> > > Let people choose if they want a crash (if things are broken),
> > > fallback to cp1252 or one of the other ISO-ISO-8859-x encodings.
> > >
> > > In that sense: we look forward to a pull-request.
> >
> >
> >
> > --
> > Tzadik



-- 
Tzadik

  reply	other threads:[~2021-04-11 20:22 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-08 19:28 git-p4 crashes on non UTF-8 output from p4 Tzadik Vanderhoof
2021-04-09 15:38 ` Torsten Bögershausen
2021-04-11  7:16   ` Tzadik Vanderhoof
2021-04-11  9:37     ` Torsten Bögershausen
2021-04-11 20:21       ` Tzadik Vanderhoof [this message]
2021-04-12  4:06         ` Torsten Bögershausen
2021-04-21  8:46           ` [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions Tzadik Vanderhoof
2021-04-21  8:55             ` Tzadik Vanderhoof
2021-04-22  5:05               ` [PATCH v3] add git-p4.fallbackEncoding config setting, " Tzadik Vanderhoof
2021-04-22 15:50                 ` Torsten Bögershausen
2021-04-22 16:17                   ` Eric Sunshine
2021-04-22 22:33                     ` Eric Sunshine
2021-04-23  6:36                       ` [PATCH] add git-p4.fallbackEncoding config variable, " Tzadik Vanderhoof
2021-04-23  6:44                         ` Tzadik Vanderhoof
2021-04-23 19:08                           ` Tzadik Vanderhoof
2021-04-24  8:14                             ` Torsten Bögershausen
2021-04-27  5:39                               ` [PATCH v5] " Tzadik Vanderhoof
2021-04-27  5:45                                 ` Tzadik Vanderhoof
2021-04-28  4:39                                 ` Junio C Hamano
2021-04-28 14:58                                   ` Torsten Bögershausen
2021-04-29  7:39                                     ` [PATCH v6] Add git-p4.fallbackEncoding Tzadik Vanderhoof
2021-04-29  8:36                                       ` Luke Diamand
2021-04-29 17:29                                         ` Tzadik Vanderhoof
     [not found]                                     ` <20210429074458.891-1-tzadik.vanderhoof@gmail.com>
     [not found]                                       ` <c4c48615-d1f4-fd37-0960-979535907f15@web.de>
2021-04-29 17:14                                         ` Tzadik Vanderhoof

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKu1iLUNooP+FDMJKekH4b2Cq5BhFZwAb=d28iPv55C5+cQbCg@mail.gmail.com' \
    --to=tzadik.vanderhoof@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.