Re: [PATCH 2/2] git-p4: do not decode data from perforce by default

From: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Andrew Oakley <andrew@adoakley.name>,
	Luke Diamand <luke@diamand.org>, Git List <git@vger.kernel.org>,
	Feiyang Xue <me@feiyangxue.com>
Subject: Re: [PATCH 2/2] git-p4: do not decode data from perforce by default
Date: Tue, 4 May 2021 21:02:54 -0700	[thread overview]
Message-ID: <CAKu1iLUaLuAZWqjNK4tfhhR=YaSt4MdQ+90ZY-JcEh_SeHyYCw@mail.gmail.com> (raw)
In-Reply-To: <xmqqczu656iv.fsf@gitster.g>

On Tue, May 4, 2021 at 6:11 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:
>
> > On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote:
> >> The key thing that I'm trying to point out here is that the encoding is
> >> not necessarily consistent between different commits.  The changes that
> >> you have proposed force you to pick one encoding that will be used for
> >> every commit.  If it's wrong then data will be corrupted, and there is
> >> no option provided to avoid that.  The only way I can see to avoid this
> >> issue is to not attempt to re-encode the data - just pass it directly
> >> to git.
> >
> > No, my "fallbackEndocing" setting is just that... a fallback.  My proposal
> > *always* tries to decode in UTF-8 first!  Only if that throws an exception
> > will my "fallbackEncoding" come into play, and it only comes into play
> > for the single changeset description that was invalid UTF-8.  After that,
> > subsequent descriptions will again be tried in UTF-8 first.
>
>  The fallbackEncoding can specify only one non UTF-8
> encoding, so even if majority of commits were in UTF-8 but when you
> need to import two commits with non UTF-8 encoding, there is no
> suitable value to give to the fallbackEncoding setting.  One of
> these two commits will fail to decode first in UTF-8 and then fail
> to decode again with the fallback, and after that a corrupted
> message remains.

I'm not sure I understand your scenario.  If the majority of commits
are in UTF-8, but there are 2 with the same UTF-8 encoding (say
"cp1252"), then just set "fallbackEndocing" to "cp1252" and all
the commits will display fine.

Are you talking about a scenario where most of the commits are UTF-8,
one is "cp1252" and another one is "cp1251", so a total of 3 encodings
are used in the Perforce depot?  I don't think that is a common scenario.

But you have a point that my patch does not address that scenario.

> If we can determine in what encoding the thing that came out of
> Perforce is written in, we can put it on the encoding header of the
> resulting commit.  But if that is possible to begin with, perhaps we
> do not even need to do so---if you can determine what the original
> encoding is, you can reencode with that encoding into UTF-8 inside
> git-p4 while creating the commit, no?
>
> And if the raw data that came from Perforce cannot be reencoded to
> UTF-8 (i.e. iconv fails to process for some reason), then whether
> the translation is done at the import time (i.e. where you would
> have used the fallbackEncoding to reencode into UTF-8) or at the
> display time (i.e. "git show" would notice the encoding header and
> try to reencode the raw data from that encoding into UTF-8), it
> would fail in the same way, so I do not see much advantage in
> writing the encoding header into the resulting object (other than
> shifting the blame to downstream and keeping the original data
> intact, which is a good design principle).

I agree with the idea that if you know what the encoding is, then
why not just use that knowledge to convert that to UTF-8, rather
than use the encoding header.

I'm not sure about how complete the support in "git" is for encoding
headers. Does *everything* that reads commit messages respect the
encoding header and handle it properly?  The documentation seems to
discourage their use altogether and in any event, the main use case
seems to be a project where everyone has settled on the same legacy
encoding to use everywhere, which is not really the case for our situation.

If we want to abandon the "fallbackEncoding" direction, then the only other
option I see is:

Step 1) try to decode in UTF-8.  If that succeeds then it is almost certain
that it really was in UTF-8 (due to the design of UTF-8).  Make the commit
in UTF-8.

Step 2) If it failed to decode in UTF-8, either use heuristics to detect the
encoding, or query the OS for the current code page and assume it's that.
Then use the selected encoding to convert to UTF-8.

Only the heuristics option will satisfy the case of more than 1 encoding
(in addition to UTF-8) being present in the Perforce depot, and in any event
the current code page may not help at all.

I'm not really familiar with what encoding-detection heuristics are available
for Python and how reliable, stable or performant they are.  It would also
involve taking on another library dependency.

I will take a look at the encoding-detection options out there.