git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "brian m. carlson" <sandals@crustytoothpaste.net>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 0/1] Hashed mailmap support
Date: Fri, 18 Dec 2020 02:29:45 +0000	[thread overview]
Message-ID: <X9wUGaR3IXcpV0nT@camp.crustytoothpaste.net> (raw)
In-Reply-To: <X9gV3mKwGrHL7PzV@coredump.intra.peff.net>

[-- Attachment #1: Type: text/plain, Size: 3706 bytes --]

On 2020-12-15 at 01:48:14, Jeff King wrote:
> On Sun, Dec 13, 2020 at 01:05:38AM +0000, brian m. carlson wrote:
> 
> > Note that this is not perfect, because a user can simply look up all the
> > hashed values and find out the old values.  However, for projects which
> > wish to adopt the feature, it can be somewhat effective to hash all
> > existing mailmap entries and include some no-op entries from other
> > contributors as well, so as to make this process less convenient.
> 
> I remain unconvinced of the value of any noop entries. Ultimately it's
> easy to invert a one-way hash that comes from a small known set of
> inputs. And that's true whether there are extra noops or not.
> 
> The interesting argument IMHO is that somebody has to _bother_ to invert
> the hash. So it means that the old and new identities do not show up
> next to each other in a file indexed by search engines, etc. That drops
> the low-hanging fruit.
> 
> And from that argument, I think the obvious question becomes: is it
> worth using a real one-way function, as opposed to just obscuring the
> raw bytes (which Ævar went into in more detail). I don't have a strong
> opinion either way (the obvious one in favor is that it's less expensive
> to do so; and something like "git log" will have to either compute a lot
> of these hashes, or cache the hash computations internally).

I don't disagree that it's easy to invert.  The question is, is somebody
going to look at a large set of (e.g., a couple hundred) hashed entries
and be able to easily find ones of people they'd like to make life
difficult for or into whose business they'd like to pry or is it going
to be too inconvenient?  I think base64 makes the job too easy and if it
were me in that situation, I'd prefer a little more effort.

I think there's also the benefit, at least for email addresses, in that
people can map a "private" email address that they used accidentally
into one with more robust filtering without letting bad actors invert it
trivially.  That doesn't mean spammers can't run through the log, but it
does mean that they can't write a simple tool to invert base64 email
addresses they've harvested out of Git repositories.  And we know that
spammers and recruiters (which, in this case, are also spammers) do
indeed scrape repositories via the repository web interfaces.

And as someone who had to download all 21 GB of the Chromium repository
for testing purposes recently, I can tell you that absent a very
compelling use case, nobody's going to want to download that entire
repository just to extract some personal information, especially since
the git index-pack operation is essentially guaranteed to take at least
7 minutes at maximum speed.  So by hashing, we've guaranteed significant
inconvenience unless you have the repository, whereas that's not the
case for base64.  And making abuse even slightly harder can often deter
a surprising amount of it[0].

So I think I'm firmly in favor of hashing.  If that means my patch needs
to implement caching, then I'll reroll with that change.  I think by
switching to a hash table I may be able to actually improve total
performance overall, at least in some cases.

> I think somebody also mentioned that there's value in the social
> signaling here, and I agree with that. But that is true even for a
> reversible encoding, I think.

That's true, I agree.  And for many projects, that will be sufficient.
If I saw a hashed mailmap entry, I would assume that it was intended to
be private and would respect that.

[0] See, for example, greylisting.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

  parent reply	other threads:[~2020-12-18  2:31 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-13  1:05 [PATCH 0/1] Hashed mailmap support brian m. carlson
2020-12-13  1:05 ` [PATCH 1/1] mailmap: support hashed entries in mailmaps brian m. carlson
2020-12-13  9:34   ` Johannes Sixt
2020-12-13  9:45     ` Johannes Sixt
2020-12-13 20:38       ` brian m. carlson
2020-12-14  0:09   ` Junio C Hamano
2020-12-16  0:50     ` brian m. carlson
2020-12-14 11:54   ` Ævar Arnfjörð Bjarmason
2020-12-15 11:13   ` Phillip Wood
2020-12-15  1:48 ` [PATCH 0/1] Hashed mailmap support Jeff King
2020-12-15  2:40   ` Jeff King
2020-12-15 11:15   ` Phillip Wood
2020-12-18  2:29   ` brian m. carlson [this message]
2020-12-18  5:56     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X9wUGaR3IXcpV0nT@camp.crustytoothpaste.net \
    --to=sandals@crustytoothpaste.net \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).