git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 1/1] mailmap: support hashed entries in mailmaps
Date: Mon, 14 Dec 2020 12:54:13 +0100	[thread overview]
Message-ID: <87eejswql6.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20201213010539.544101-2-sandals@crustytoothpaste.net>


On Sun, Dec 13 2020, brian m. carlson wrote:

> Many people, through the course of their lives, will change either a
> name or an email address.  For this reason, we have the mailmap, to map
> from a user's former name or email address to their current, canonical
> forms.  Normally, this works well as it is.
>
> However, sometimes people change a name or an email address and wish to
> wholly disassociate themselves from that former name or email address.
> For example, a person may have left a company which engaged in a deeply
> unethical act with which the person does not want to be associated, or
> they may have changed their name to disassociate themselves from an
> abusive family or partner.  In such a case, using the former name or
> address in any way may be undesirable and the person may wish to replace
> it as completely as possible.
>
> [...]
>
> Note that it is, of course, possible to perform a lookup on all commit
> objects to determine the actual entry which matches the hashed form of
> the data.

The commit message & cover letter are subtly different in a way that I
didn't even notice at first glance. E.g. I assume based on the cover
letter that one part of this this is a proposed solution do the whole
"deadname" problem. It would be nice if v2 were more explicit and
attempted to explicitly summarize the use-cases in the commit message.

But for now I'll attempt to read between the lines from having read
both.

I don't understand why either the problem of "I don't want to see my old
name again" or "I want to hide from other abusive people" (as an aside:
but not so much that you'd still take the risk of submitting a patch to
.mailmap?) require a hashing solution, as opposed to just some encoding
in the .mailmap file such as base64.

You can still trivially get the same information in the end, on git.git
running --pretty=format:"%aN %aE %an %ae" takes under a second. A part
of your commit message seems to address this:

> However, a project for which this feature is valuable may
> simply insert entries for many contributors in order to make discovery
> of "interesting" entries significantly less convenient.

But I don't get how that's helped at all by a sha256 hash. Since you can
trivially re-expand these again using log/check-mailmap the hashing
offers no extra protection beyond a trivial layer of obscurity in those
cases. You'd get the same safety in numbers by having everything a large
un-hashed .mailmap file, would you not?

I think the underlying use-case is legitimate, but I read it as
primarily a social signaling feature by a trivial addition of
obscurity. Someone called X would like not to be called Y anymore, or
not be found in a search engine or "git grep" when searching for "Y".

So I'd think purely from the perspective of the feature's appearance to
users matching its underlying security we'd be better served with
support for encoding of some sort. E.g. URL encoding, Base64, or even
just string_reverse() (ROT13 is out as not working for non-ASCII names).

The encoding versions of this have the added bonus of expanding the
use-case beyond what you're suggesting. If you're trying to map e.g. a
non-UTF-8 E-Mail address (in your project due to some encoding error)
you'd be able to put it into .mailmap without making the project
maintainers deal with invalid non-UTF-8 encoding in the file (the
existing support is sufficient to map names in most such cases).

Another reason I'd prefer some encoding solution is because .mailmap
isn't just used by git itself. Since the format got added it's become
how a lot of downstream systems do this mapping. E.g. I worked once on a
change management system that mapped lots of user actions across
different systems, and piggy-backed on .mailmap files in git to resolve
E-Mail addresses even in cases where the originating data wasn't within
git.

Now because of the trivialness of the format it's easy to e.g. import it
into a DB table and do a JOIN against it (or the same after converting
it from some trivial encoding). Use-cases like that would become a full
history walk for each project to extract the real E-Mails (or a re
implementation of the SHA256 trick in some sub-SELECT in the database).

Those are all solvable problems that are rather trivial in the end. I
just wonder if we're not making things needlessly hard to achieve the
stated aims. And to be fair, most of those aims I inferred (and might
have incorrectly inferred), since as noted above the patch itself
doesn't discuss the tradeoffs of potential alternate solutions).

> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  mailmap.c          | 39 +++++++++++++++++++++++++++++++++++++--
>  t/t4203-mailmap.sh | 35 +++++++++++++++++++++++++++++++++++
>  2 files changed, 72 insertions(+), 2 deletions(-)
>
> [...]
>
>  int map_user(struct string_list *map,
>  	     const char **email, size_t *emaillen,
>  	     const char **name, size_t *namelen)
> @@ -324,7 +359,7 @@ int map_user(struct string_list *map,
>  		 (int)*namelen, debug_str(*name),
>  		 (int)*emaillen, debug_str(*email));
>  
> -	item = lookup_prefix(map, *email, *emaillen);
> +	item = lookup_one(map, *email, *emaillen);
>  	if (item != NULL) {
>  		me = (struct mailmap_entry *)item->util;
>  		if (me->namemap.nr) {
> @@ -334,7 +369,7 @@ int map_user(struct string_list *map,
>  			 * simple entry.
>  			 */
>  			struct string_list_item *subitem;
> -			subitem = lookup_prefix(&me->namemap, *name, *namelen);
> +			subitem = lookup_one(&me->namemap, *name, *namelen);
>  			if (subitem)
>  				item = subitem;
>  		}

If you turn on DEBUG_MAILMAP=1 at the top of the file and run e.g. an
unbounded --pretty=format=:%aE you can see we'll call map_user() in a
loop for each commit shown. What I'm suggesting above can be read as
"can't we have some solution that achieves the same aims, but which we
can handle purely in add_mapping()?". Both for our case, and for
external parsers/re-implementations.

In any case it would be interesting if v2 amended
t/perf/p4205-log-pretty-formats.sh to test e.g. the impact of linux.git
with all-sha256 entries to see what the cost in the tight loop could be.

  parent reply	other threads:[~2020-12-14 12:41 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-13  1:05 [PATCH 0/1] Hashed mailmap support brian m. carlson
2020-12-13  1:05 ` [PATCH 1/1] mailmap: support hashed entries in mailmaps brian m. carlson
2020-12-13  9:34   ` Johannes Sixt
2020-12-13  9:45     ` Johannes Sixt
2020-12-13 20:38       ` brian m. carlson
2020-12-14  0:09   ` Junio C Hamano
2020-12-16  0:50     ` brian m. carlson
2020-12-14 11:54   ` Ævar Arnfjörð Bjarmason [this message]
2020-12-15 11:13   ` Phillip Wood
2020-12-15  1:48 ` [PATCH 0/1] Hashed mailmap support Jeff King
2020-12-15  2:40   ` Jeff King
2020-12-15 11:15   ` Phillip Wood
2020-12-18  2:29   ` brian m. carlson
2020-12-18  5:56     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87eejswql6.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).