Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git

From: Tao Klerks <tao@klerks.biz>
To: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Git List" <git@vger.kernel.org>,
	"Luke Diamand" <luke@diamand.org>,
	"Eric Sunshine" <sunshine@sunshineco.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Dorgon Chang" <dorgonman@hotmail.com>,
	"Joachim Kuebart" <joachim.kuebart@gmail.com>,
	"Daniel Levin" <dendy.ua@gmail.com>,
	"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	"Ben Keene" <seraphire@gmail.com>,
	"Andrew Oakley" <andrew@adoakley.name>
Subject: Re: [PATCH] git-p4: preserve utf8 BOM when importing from p4 to git
Date: Thu, 22 Dec 2022 09:26:19 +0100	[thread overview]
Message-ID: <CAPMMpohHoPwxGTrwUi-WNvhU_+CnKP-rKPcDE-7qTUVBuRWA1g@mail.gmail.com> (raw)
In-Reply-To: <CAPMMpoiLYhjjK73Le1u71e-nzU1cf_dA=i7zYYz3t8+AujgbCA@mail.gmail.com>

On Mon, Dec 19, 2022 at 10:09 AM Tao Klerks <tao@klerks.biz> wrote:
>
> I definitely need to keep testing around this to understand what the
> right thing to do for Tzadik (and others like him of course) might be.
>
> Tzadik, could you provide any more detail about the failing situation?
> One piece of info that might be particularly helpful is *what is the
> exact/full p4 FileType of the problem file?*

Hi Tzadik, over the past few days I've been making progress towards
understanding the behavior of unicode-enabled perforce servers (and p4
client tooling against such servers):
* Perforce always stores/adds new files containing a utf8 bom as type
"utf8", and always syncs them to the workspace with a BOM, regardless
of the P4CHARSET env variable. Same goes for utf16 with bom being
stored as type "utf16".
* The only filetype that gets synced to the workspace differently
depending on P4CHARSET is the "unicode" type - a type exclusive to
*specific files* on unicode-enabled servers
* The P4CHARSET env var is used in three situations that I've
detected/understood so far:
  1. It is used when "p4 add"ing new files - if the content of the
file is valid/parseable according to the P4CHARSET, and is not a
built-in more-specific type like utf8-with-bom (perforce type "utf8")
or utf16-with-bom (perforce type "utf16"), then it gets parsed
according to that charset and stored as type "unicode". If the content
is not parseable according to that charset, nor the more specific
types, then it gets stored as "text" or "binary".
  2. It is used when "p4 sync"ing files of type "unicode" to the
workspace. All other types are synced in a "standard" way, only type
"unicode" is sensitive to the charset config during "sync".
  3. It is used when "p4 print"ing files of type "unicode", "utf8",
and "utf16" (unless P4COMMANDCHARSET overrides, or a "-o" output file
is specified). This last behavior spells trouble for git-p4...
* When a file is stored as type "unicode", the way it gets
synchronized to the workspace always depends on the P4CHARSET env
variable. When a "with-bom" charset value is configured, like
"utf8-bom" or "utf16", the resulting files on disk / in the workspace
end up indistinguishable from the corresponding perforce native
with-bom type "utf8" or "utf16".
* If an output file (-o) is specified, "p4 print" behaves like "p4
sync" - using P4CHARSET only for file type "unicode".
* If no output file (-o) is specified, the output of "p4 print"
depends on the P4CHARSET env variable or the more-specific
P4COMMANDCHARSET if specified, but with a number of provisos:
** "p4 print" will refuse to print using a "wide" charset like
"utf16"; it will fall back to utf8.
** If you configure P4CHARSET (and/or P4COMMANDCHARSET) to a charset
value with bom ("utf8-bom" specifically, I guess), you don't in fact
get the BOM in the "p4 print" output.
** If you configure P4CHARSET (and/or P4COMMANDCHARSET) to a charset
value that cannot express the specific contents of a utf8, utf16 or
unicode file, "p4 print" will fail.
* The behavior of "p4 print" matters in git-p4 because this is the p4
command used to import files; for "utf16" files a "file-targeting"
form is used (-o), and for all other file types, the regular form.
* One other "weirdness" in all this is that if you remove the BOM from
a "utf8" file and submit, this change is silently swallowed; your
workspace copy of the file remains without BOM, but if you force-sync
the file, the BOM magically reappears.

This is all further complicated by the fact that older Perforce server
(and client?) versions behaved differently! (eg see
https://stackoverflow.com/questions/38320814/bom-issue-in-unicode-perforce-server).

So far in my testing of git-p4 against a unicode server, I've
reproduced two concrete "problem workflows" specific to this server
configuration:

1: If you have P4CHARSET configured to "utf8-bom"
* "git p4 clone" imports your "unicode" files without BOM (because "p4
print" always omits the BOM)
* "p4 sync" syncs your "unicode" files *with* BOM - because that's
what P4CHARSET is telling it to do
* "git p4 submit" fails to apply (some) git changes (to "unicode"
files) because the git-workspace files do not have a BOM, and the p4
workspace files do.

2: If you have P4CHARSET configured to a non-utf8 encoding like "winansi":
* "git p4 clone" imports your "utf8" files as winansi-encoded (because
"p4 print" respects the configured P4CHARSET), and adds a utf8 bom
(because of the fix earlier this year)
* "p4 sync" syncs your "utf8" files encoded as utf8-with-BOM - because
that's the only and correct behavior
* Some of these files in the git repo may not open correctly in any
given editor, if they have a utf8 bom but are not actually
utf8-encoded (if they contain characters in the 128+ codepoint range)
* "git p4 submit" fails to apply (some) git changes (to these "utf8"
files) because these git-workspace files can have different bytes to
the utf8-encoded files in the workspace.

There seem to be two corresponding "fundamental bugs" in "git-p4 clone":
* Files of type "unicode" are imported according to the
client-configured charset (with some inconsistency around wide
charsets like utf16) - correct behavior is hard to define, as perforce
treats these files as having no "fundamental" encoding, but current
behavior is clearly buggy. It's unclear whether properly/consistently
respecting the p4-client-configured encoding makes more sense (to
support local-git-repo workflows transparently) by default, or whether
it makes more to use "utf8" by default, to have a consistent and
correct representation of the Perforce repository contents in git by
default (and then the equivalent of P4CHARSET behavior in
individual/personal git repos would need to be implemented using git
smudge and clean filters, I assume, for "git p4 submit" to work
against non-utf8-configured p4 workspaces).
* If unicode mode is enabled and P4CHARSET is not set to "utf8" or
"utf8-bom", files of perforce type "utf8" will be imported according
to the client-configured charset instead of the actual/internal (and
p4-workspace-respected) encoding - and additionally, with "my fix",
they will still get a utf8 BOM prepended, which can result in corrupt
files in the git repo.

Perplexingly, most of the issues I've identified with unicode-mode
perforce servers are not caused by my changes: the only *new* problem
this year is that, if P4CHARSET is set to something other than "utf8"
or "utf8-bom", files of p4 type "utf8" will now not only be imported
into the git repo with the *wrong* encoding (prior behavior), but will
additionally have a utf8 BOM added, resulting in *corrupted* files
(if/when the charset encoding diverges, eg if the content includes
characters in the 128+ codepoint range for P4CHARSET=winansi).

It's worth noting there's also an existing workflow failure that
applies equally to unicode-mode and regular servers, when users delete
the BOM from existing "utf8" files in their git repo and submit that
change into perforce:
* "git p4 clone" imports your "utf8" files correctly, as utf8-with-BOM
(in the case of a unicode-mode perforce server, assume "P4CHARSET" is
set to "utf8" or "utf8-bom")
* "p4 sync" syncs your "utf8" files encoded as utf8-with-BOM - because
that's the only and correct behavior
* The user accidentally or intentionally removes the BOM as part of
some edits to the file in the git workspace
* "git p4 submit" applies the change to the p4 workspace successfully,
it is committed to perforce (but perforce ignores the BOM removal bit)
* "git p4 submit" does its rebase of the git branch on top of the
now-submitted-and-reimported p4 changes, but the rebase fails because
the changes are different: in the p4 history there was no BOM-removal.

Tzadik, could you please confirm:
* Your p4 client and server versions ("p4 info")
* The perforce "types" of files you are concerned with (check with "p4
files <PATH>" or in the p4v GUI)
* Your OS
* Your P4CHARSET value, if set explicitly to something other than "auto"

I will keep exploring how this all works, and proposing fixes to
git-p4's vulnerable use of "p4 print". The correct outcome/behavior is
clear (and achievable) for files of perforce type "utf8", but I'm not
confident as to what the best default behavior will be for type
"unicode" - I imagine it will be to keep current client-specific
local-workflow-supporting behavior, fixing it to support wide
encodings and BOM-specifying encodings, and recommending that
P4CHARSET be set to "utf8" explicitly for canonical git-p4 clone
operations / for shared repos.

I have not yet looked into filepath-handling, commit-message-handling,
username-handling etc on unicode-enabled servers, but I suspect
similar issues will apply there: we'll have to figure out what the
correct behavior would be for "canonical" git repos, whether it should
be any different for "local-workflow-only" git repos, and how either
desirable behavior compares to current behavior for various values of
P4CHARSET. Everything probably already works correctly when
P4CHARSET=utf8.

For the "reappearing BOM" problem/behavior of perforce, I'm also not
sure what the right approach is, even manually. I've tried asking p4
to reevaluate the "type" of the file upon edit, but can't find any way
to get it to *auto*-detect the new file type. Given that I've never
seen this issue described before, it doesn't feel hugely important
though. I will at least add a failing test case to cover this, when I
get to it.

For now, based on the original problem statement/bug report, I cannot
understand what specific problem is affecting Tzadik, and whether or
how it was caused by the 2022 utf8 bom-preserving fix to "git-p4
clone".

Thanks,
Tao