git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 0/1] Hashed mailmap support
Date: Fri, 18 Dec 2020 00:56:46 -0500	[thread overview]
Message-ID: <X9xEnpLeZ4mCjwWF@coredump.intra.peff.net> (raw)
In-Reply-To: <X9wUGaR3IXcpV0nT@camp.crustytoothpaste.net>

On Fri, Dec 18, 2020 at 02:29:45AM +0000, brian m. carlson wrote:

> > And from that argument, I think the obvious question becomes: is it
> > worth using a real one-way function, as opposed to just obscuring the
> > raw bytes (which Ævar went into in more detail). I don't have a strong
> > opinion either way (the obvious one in favor is that it's less expensive
> > to do so; and something like "git log" will have to either compute a lot
> > of these hashes, or cache the hash computations internally).
> [...]
> So I think I'm firmly in favor of hashing.  If that means my patch needs
> to implement caching, then I'll reroll with that change.  I think by
> switching to a hash table I may be able to actually improve total
> performance overall, at least in some cases.

OK. I agree it raises the bar a little bit. Whether that matters or not
depends on your threat model (e.g., casual spammers versus dedicated
information seekers). I don't have a particularly strong opinion on
what's realistic, but I don't mind erring on the side of caution here.

It might be worth making a short argument along those lines in the
commit message.

As far as caching goes, my main concern is mostly that people who are
not using the feature do not pay a performance penalty. So:

  - if the feature is not used in the repository's mailmap, it should
    have zero cost (i.e., we do not bother hashing lookup entries if
    there are no hashed entries in the map)

  - as soon as there is one hashed entry, we need to hash the key for
    every lookup in the map. I'm not sure what the overhead is like. It
    might be negligible. But I think we should confirm that before
    proceeding.

> And as someone who had to download all 21 GB of the Chromium repository
> for testing purposes recently, I can tell you that absent a very
> compelling use case, nobody's going to want to download that entire
> repository just to extract some personal information, especially since
> the git index-pack operation is essentially guaranteed to take at least
> 7 minutes at maximum speed.  So by hashing, we've guaranteed significant
> inconvenience unless you have the repository, whereas that's not the
> case for base64.  And making abuse even slightly harder can often deter
> a surprising amount of it[0].

They just need the objects that have ident lines in them, so:

  $ time git clone --bare --filter=tree:0 https://github.com/chromium/chromium
  Cloning into bare repository 'chromium.git'...
  remote: Enumerating objects: 202, done.
  remote: Counting objects: 100% (202/202), done.
  remote: Compressing objects: 100% (161/161), done.
  remote: Total 1105453 (delta 49), reused 194 (delta 41), pack-reused 1105251
  Receiving objects: 100% (1105453/1105453), 462.14 MiB | 11.13 MiB/s, done.
  Resolving deltas: 100% (99790/99790), done.
  
  real	0m49.304s
  user	0m21.330s
  sys	0m4.727s

gets you there much quicker. I don't think that negates your point about
raising the bar, but my guess is that the threat model of "casual
spammer" would probably be deterred, but "troll who wants to annoy
specific person" would probably not be.

-Peff

      reply	other threads:[~2020-12-18  5:57 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-13  1:05 [PATCH 0/1] Hashed mailmap support brian m. carlson
2020-12-13  1:05 ` [PATCH 1/1] mailmap: support hashed entries in mailmaps brian m. carlson
2020-12-13  9:34   ` Johannes Sixt
2020-12-13  9:45     ` Johannes Sixt
2020-12-13 20:38       ` brian m. carlson
2020-12-14  0:09   ` Junio C Hamano
2020-12-16  0:50     ` brian m. carlson
2020-12-14 11:54   ` Ævar Arnfjörð Bjarmason
2020-12-15 11:13   ` Phillip Wood
2020-12-15  1:48 ` [PATCH 0/1] Hashed mailmap support Jeff King
2020-12-15  2:40   ` Jeff King
2020-12-15 11:15   ` Phillip Wood
2020-12-18  2:29   ` brian m. carlson
2020-12-18  5:56     ` Jeff King [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X9xEnpLeZ4mCjwWF@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).