git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tao Klerks <tao@klerks.biz>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Tao Klerks via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Subject: Re: [PATCH v2] [RFC] git-p4: improve encoding handling to support inconsistent encodings
Date: Wed, 13 Apr 2022 17:18:51 +0200	[thread overview]
Message-ID: <CAPMMpoj3xZfKnH456AbiHatbBx98yXuj=yWBA8tdHhHdqn_H3Q@mail.gmail.com> (raw)
In-Reply-To: <220413.86zgkpf5g7.gmgdl@evledraar.gmail.com>

On Wed, Apr 13, 2022 at 4:01 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Wed, Apr 13 2022, Tao Klerks via GitGitGadget wrote:
>
> > Under python2, git-p4 "naively" writes the Perforce bytestream into git
> > metadata (and does not set an "encoding" header on the commits); this
> > means that any non-utf-8 byte sequences end up creating invalidly-encoded
> > commit metadata in git.
>
> If it doesn't have an "encoding" header isn't any sequence of bytes OK
> with git, so how does it create invalid metadata in git?

Just because git allows you to shove any sequence of bytes into a
commit header, doesn't mean the resulting text is "valid" metadata
text for all or most purposes. The correct way to encode text in git
commit metadata is utf-8 (OR tell any readers of this data that it's
something other than utf-8 via the encoding header) - it's just that
git itself, the official client, is tolerant of bad byte sequences.
Other clients are less tolerant. "Sublime Merge", for example, will
fail to display the commit text at all in some contexts if the bytes
are not valid utf-8 (or noted as being something else).

>
> Do you mean that something on the Python side gets confused and doesn't
> correctly encode it in that case, or that it's e.g. valid UTF-8 but
> we're lacking the metadata?

In git-p4 under python2, the bytes are simply copied from the perforce
commit metadata into the git commit metadata verbatim; if those bytes
happen to be valid utf-8, then they will be interpreted as such in git
and everything is great. If that is *not* the case, eg the bytes are
actually windows cp1252 (with bytes/characters in the x8a+ range),
then "git log" for example will output the raw bytes, and anything
looking at those bytes (a terminal, or a process that called git) will
get those unexpected bytes, and need to deal accordingly. A terminal
will probably display "unprintable character" glyphs, python3 will
blow up by default, python 2 will be perfectly happy by default, etc.

I summarize this "non-utf-8 bytes in a git commit message without a
qualifying 'encoding' header" situation as "invalidly-encoded commit
metadata in git", due to the impact on downstream consumers of git
metadata. Is there a better characterization?

Thanks,
Tao

  reply	other threads:[~2022-04-13 15:19 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-11  9:42 [PATCH] [RFC] git-p4: improve encoding handling to support inconsistent encodings Tao Klerks via GitGitGadget
2022-04-13  6:24 ` [PATCH v2] " Tao Klerks via GitGitGadget
2022-04-13 13:59   ` Ævar Arnfjörð Bjarmason
2022-04-13 15:18     ` Tao Klerks [this message]
2022-04-13 18:52       ` Ævar Arnfjörð Bjarmason
2022-04-14  9:38         ` Tao Klerks
2022-04-13 20:41   ` Andrew Oakley
2022-04-14  9:57     ` Tao Klerks
2022-04-17 18:11       ` Andrew Oakley
2022-04-19 20:30         ` Tao Klerks
2022-04-19 20:19   ` [PATCH v3] " Tao Klerks via GitGitGadget
2022-04-19 20:33     ` Tao Klerks
2022-04-30 19:26     ` [PATCH v4] " Tao Klerks via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPMMpoj3xZfKnH456AbiHatbBx98yXuj=yWBA8tdHhHdqn_H3Q@mail.gmail.com' \
    --to=tao@klerks.biz \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).