From: Junio C Hamano <gitster@pobox.com>
To: "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"brian m. carlson" <sandals@crustytoothpaste.net>,
"Georgios Kontaxis" <geko1702+commits@99rst.org>
Subject: Re: [PATCH v4] gitweb: redacted e-mail addresses feature.
Date: Mon, 22 Mar 2021 11:32:46 -0700 [thread overview]
Message-ID: <xmqqlfaf6nu9.fsf@gitster.g> (raw)
In-Reply-To: <pull.910.v4.git.1616396267010.gitgitgadget@gmail.com> (Georgios Kontaxis via GitGitGadget's message of "Mon, 22 Mar 2021 06:57:46 +0000")
"Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
writes:
[note to other reviewers. input from those who are more familiar
with gitweb and Perl is very much appreciated on this patch].
> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers and they may not respect robots.txt.
> This may result in unsolicited messages.
"... are exposed to web crawlers, which spammers may use." would be
sufficient as a problem description.
After giving a problem description, it is customery to describe the
solution as if you are ordering the codebase to "be like so", so
instead of this ...
> This is a feature for redacting e-mail addresses
> from the generated HTML, etc. content.
... we may say something like
Introduce an 'email-privacy' feature, which defaults to false,
that redacts e-mail addresses that appear as author/committer
info and in log messages from the generated HTML content.
> This feature does not prevent someone from downloading the
> unredacted commit log, e.g., by cloning the repository, and
> extracting information from it.
> It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.
And this is a good thing to add. Overall, nicely written.
> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---
> gitweb: redacted e-mail addresses feature.
>
> Gitweb extracts content from the Git log and makes it accessible over
> HTTP. As a result, e-mail addresses found in commits are exposed to web
> crawlers and they may not respect robots.txt. This may result in
> unsolicited messages. This is a feature for redacting e-mail addresses
> from the generated HTML, etc. content.
>
> This feature does not prevent someone from downloading the unredacted
> commit log, e.g., by cloning the repository, and extracting information
> from it. It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.
You do not need to repeat the above, which is in the log message above.
> Changes since v1:
>
> * Turned off the feature by default.
> * Removed duplicate code.
> * Added note about Gitweb consumers receiving redacted logs.
>
> Changes since v2:
>
> * The feature can be set on a per-project basis. ('override' => 1)
>
> Changes since v3:
>
> * Renamed feature to "email-privacy" and improved documentation.
> * Removed UI elements for git-format-patch since it won't be redacted.
> * Simplified calls to the address redaction logic.
> * Mail::Address is now used to reduce false-positive redactions.
Having these under the --- line like this is very helpful.
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..6630c76d92fd 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -21,6 +21,7 @@
> use File::Basename qw(basename);
> use Time::HiRes qw(gettimeofday tv_interval);
> use Digest::MD5 qw(md5_hex);
> +use Git::LoadCPAN::Mail::Address;
I'll defer to others who are more familiar with gitweb and Perl
ecosystem if this is warranted, but I have a feeling that importing
and using Mail::Address->parse() only because we want to see if a
given "<string>" is an address is a bit overkill and it might be
sufficient to do something as crude as m/^<[^@>]+@[a-z0-9-.]+>$/i
> + # Redact e-mail addresses.
> +
> + # To enable system wide have in $GITWEB_CONFIG
> + # $feature{'email-privacy'}{'default'} = [1];
> + 'email-privacy' => {
> + 'sub' => sub { feature_bool('email-privacy', @_) },
> + 'override' => 1,
> + 'default' => [0]},
> );
>
> sub gitweb_get_feature {
> @@ -3449,6 +3459,32 @@ sub parse_date {
> return %date;
> }
>
> +sub is_mailaddr {
> + my @addrs = Mail::Address->parse(shift);
> + if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
> + return 0;
> + }
> + return 1;
> +}
> +
> +sub hide_mailaddrs_if_private {
> + my $line = shift;
> + return $line unless gitweb_check_feature('email-privacy');
> + while ($line =~ m/(<[^>]+>)/g) {
> + my $match = $1;
> + if (!is_mailaddr($match)) {
> + next;
> + }
> + my $offset = pos $line;
> + my $head = substr $line, 0, $offset - length($match);
> + my $redaction = "<redacted>";
> + my $tail = substr $line, $offset;
> + $line = $head . $redaction . $tail;
> + pos $line = length($head) + length($redaction);
Hmmmm, Perl suggestions from others? It looks quite strange to see
that s/// operator is not used and replacement is done manually with
byte position in a Perl script.
> sub parse_tag {
> my $tag_id = shift;
> my %tag;
> @@ -3465,7 +3501,7 @@ sub parse_tag {
> } elsif ($line =~ m/^tag (.+)$/) {
> $tag{'name'} = $1;
> } elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
> - $tag{'author'} = $1;
> + $tag{'author'} = hide_mailaddrs_if_private($1);
> $tag{'author_epoch'} = $2;
> $tag{'author_tz'} = $3;
> if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
This (and the others that follow the same pattern) looks sensible.
> @@ -7489,7 +7526,8 @@ sub git_log_generic {
> -accesskey => "n", -title => "Alt-n"}, "next");
> }
> my $patch_max = gitweb_get_feature('patches');
> - if ($patch_max && !defined $file_name) {
> + if ($patch_max && !defined $file_name &&
> + !gitweb_check_feature('email-privacy')) {
> if ($patch_max < 0 || @commitlist <= $patch_max) {
> $paging_nav .= " ⋅ " .
> $cgi->a({-href => href(action=>"patches", -replay=>1)},
> @@ -7550,7 +7588,8 @@ sub git_commit {
> } @$parents ) .
> ')';
> }
> - if (gitweb_check_feature('patches') && @$parents <= 1) {
> + if (gitweb_check_feature('patches') && @$parents <= 1 &&
> + !gitweb_check_feature('email-privacy')) {
> $formats_nav .= " | " .
> $cgi->a({-href => href(action=>"patch", -replay=>1)},
> "patch");
> @@ -7863,7 +7902,8 @@ sub git_commitdiff {
> $formats_nav =
> $cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
> "raw");
> - if ($patch_max && @{$co{'parents'}} <= 1) {
> + if ($patch_max && @{$co{'parents'}} <= 1 &&
> + !gitweb_check_feature('email-privacy')) {
> $formats_nav .= " | " .
> $cgi->a({-href => href(action=>"patch", -replay=>1)},
> "patch");
I wouldn't have expected to see the above three hunks in the
"patch" codepath. Rather, I was hoping that you'd do something
like this at startup when you find out that the privacy feature
is enabled:
$feature{'patches'}{'default'} = [0]
if gitweb_get_feature('email-privacy');
so that nothing related to the 'patches' has to be modified.
That way, even if there were fourth place that can leak an e-mail
address in the 'patch' codepath that above three hunks in this patch
missed, crawlers won't be able to get at it, no?
> diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh
> index 1f32ca66ea51..77fc1298d4c6 100644
> --- a/t/lib-gitweb.sh
> +++ b/t/lib-gitweb.sh
> @@ -67,6 +67,9 @@ gitweb_run () {
> GITWEB_CONFIG=$(pwd)/gitweb_config.perl
> export GITWEB_CONFIG
>
> + PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
> + export PERL5LIB
> +
Why is this change suddenly become necessary? This addition is only
about tests---do we need to do something similar in the runtime
environment when the updated gitweb that requires Mail::Address gets
deployed, or is that covered already somewhere else?
Thanks.
next prev parent reply other threads:[~2021-03-22 18:33 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-20 23:42 [PATCH] gitweb: redacted e-mail addresses feature Georgios Kontaxis via GitGitGadget
2021-03-21 0:42 ` Ævar Arnfjörð Bjarmason
2021-03-21 1:27 ` brian m. carlson
2021-03-21 3:30 ` Georgios Kontaxis
2021-03-21 3:32 ` [PATCH v2] " Georgios Kontaxis via GitGitGadget
2021-03-21 17:28 ` [PATCH v3] " Georgios Kontaxis via GitGitGadget
2021-03-21 18:26 ` Ævar Arnfjörð Bjarmason
2021-03-21 18:48 ` Junio C Hamano
2021-03-21 19:48 ` Georgios Kontaxis
2021-03-21 18:42 ` Junio C Hamano
2021-03-21 18:57 ` Junio C Hamano
2021-03-21 19:05 ` Junio C Hamano
2021-03-21 20:07 ` Georgios Kontaxis
2021-03-21 22:17 ` Junio C Hamano
2021-03-21 23:14 ` Georgios Kontaxis
2021-03-22 4:25 ` Junio C Hamano
2021-03-22 6:57 ` [PATCH v4] " Georgios Kontaxis via GitGitGadget
2021-03-22 18:32 ` Junio C Hamano [this message]
2021-03-22 18:58 ` Georgios Kontaxis
2021-03-28 1:41 ` Junio C Hamano
2021-03-28 21:43 ` Georgios Kontaxis
2021-03-28 22:35 ` Junio C Hamano
2021-03-23 4:27 ` Georgios Kontaxis
2021-03-27 3:56 ` [PATCH v5] " Georgios Kontaxis via GitGitGadget
2021-03-28 23:26 ` [PATCH v6] " Georgios Kontaxis via GitGitGadget
2021-03-29 20:00 ` Junio C Hamano
2021-03-31 21:14 ` Junio C Hamano
2021-04-06 0:56 ` Junio C Hamano
2021-04-08 22:43 ` Ævar Arnfjörð Bjarmason
2021-04-08 22:51 ` Junio C Hamano
2021-03-29 1:47 ` [PATCH v5] " Eric Wong
2021-03-29 3:17 ` Georgios Kontaxis
2021-04-08 17:16 ` Eric Wong
2021-04-08 21:04 ` Junio C Hamano
2021-04-08 21:19 ` Eric Wong
2021-04-08 22:45 ` Ævar Arnfjörð Bjarmason
2021-04-08 22:54 ` Junio C Hamano
2021-03-21 6:00 ` [PATCH] " Junio C Hamano
2021-03-21 6:18 ` Junio C Hamano
2021-03-21 6:43 ` Georgios Kontaxis
2021-03-21 16:55 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqlfaf6nu9.fsf@gitster.g \
--to=gitster@pobox.com \
--cc=avarab@gmail.com \
--cc=geko1702+commits@99rst.org \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=sandals@crustytoothpaste.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).