git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Georgios Kontaxis <geko1702+commits@99rst.org>
Subject: Re: [PATCH] gitweb: redacted e-mail addresses feature.
Date: Sun, 21 Mar 2021 01:42:58 +0100	[thread overview]
Message-ID: <8735wpz699.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <pull.910.git.1616283780358.gitgitgadget@gmail.com>


On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:

> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers. This may result in unsolicited messages.
> This is a feature for redacting e-mail addresses from the generated
> HTML content.
>
> This feature does not prevent someone from downloading the
> unredacted commit log and extracting information from it.
> It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.

So web crawlers that aren't going to obey robots.txt?

I'm not opposed to this feature, but a glance at gitweb's documentation
seems to show that we don't discuss how to set robots.txt up for it at
all.

Perhaps having that in the docs or otherwise in the default setup would
get us most of the win of this feature?

> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---

Odd:

>     gitweb: Redacted e-mail addresses feature.
>     
>     Gitweb extracts content from the Git log and makes it accessible over
>     HTTP. As a result, e-mail addresses found in commits are exposed to web
>     crawlers. This may result in unsolicited messages. This is a feature for
>     redacting e-mail addresses from the generated HTML content.
>     
>     This feature does not prevent someone from downloading the unredacted
>     commit log and extracting information from it. It aims to hinder the
>     low-effort bulk collection of e-mail addresses by web crawlers.
>     
>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org

To have this duplication of the patch here below "---", some GGG feature
gone awry?

> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>
>  Documentation/gitweb.conf.txt | 12 ++++++++++++
>  gitweb/gitweb.perl            | 36 ++++++++++++++++++++++++++++++++---
>  2 files changed, 45 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
> index 7963a79ba98b..10653d8670a8 100644
> --- a/Documentation/gitweb.conf.txt
> +++ b/Documentation/gitweb.conf.txt
> @@ -896,6 +896,18 @@ same as of the snippet above:
>  It is an error to specify a ref that does not pass "git check-ref-format"
>  scrutiny. Duplicated values are filtered.
>  
> +email_privacy::
> +    Redact e-mail addresses from the generated HTML, etc. content.
> +    This hides e-mail addresses found in the commit log from web crawlers.
> +    Enabled by default.
> ++
> +It is highly recommended to keep this feature enabled unless web crawlers
> +are hindered in some other way. You can disable this feature as shown below:
> ++
> +---------------------------------------------------------------------------
> +$feature{'email_privacy'}{'default'} = [0];
> +---------------------------------------------------------------------------

I think there's plenty of gitweb users that are going to be relying on
the current behavior, so doesn't it make more sense for this to be
opt-in rather than opt-out?

>  
>  EXAMPLES
>  --------
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..9d21c2583e18 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -569,6 +569,15 @@ sub evaluate_uri {
>  		'sub' => \&feature_extra_branch_refs,
>  		'override' => 0,
>  		'default' => []},
> +
> +    # Redact e-mail addresses.
> +
> +    # To disable system wide have in $GITWEB_CONFIG
> +    # $feature{'email_privacy'}{'default'} = [0];
> +	'email_privacy' => {
> +		'sub' => sub { feature_bool('email_privacy', @_) },
> +		'override' => 0,
> +		'default' => [1]},
>  );
> [...]
>  sub gitweb_get_feature {
> @@ -3471,6 +3480,10 @@ sub parse_tag {
>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$tag{'author_name'}  = $1;
>  				$tag{'author_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$tag{'author_email'} = "private";
> +					$tag{'author'} =~ s/<([^>]+)>/<private>/;
> +				}
>  			} else {
>  				$tag{'author_name'} = $tag{'author'};
>  			}
> @@ -3519,6 +3532,10 @@ sub parse_commit_text {
>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'author_name'}  = $1;
>  				$co{'author_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$co{'author_email'} = "private";
> +					$co{'author'} =~ s/<([^>]+)>/<private>/;
> +				}
>  			} else {
>  				$co{'author_name'} = $co{'author'};
>  			}
> @@ -3529,6 +3546,10 @@ sub parse_commit_text {
>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'committer_name'}  = $1;
>  				$co{'committer_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$co{'committer_email'} = "private";
> +					$co{'committer'} =~ s/<([^>]+)>/<private>/;
> +				}
>  			} else {
>  				$co{'committer_name'} = $co{'committer'};
>  			}
> @@ -3568,9 +3589,13 @@ sub parse_commit_text {
>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>  	}
> -	# remove added spaces
> +	# remove added spaces, redact e-mail addresses if applicable.
>  	foreach my $line (@commit_lines) {
>  		$line =~ s/^    //;
> +		if (gitweb_check_feature('email_privacy') &&
> +			$line =~ m/^([^<]+) <([^>]*)>/) {
> +			$line =~ s/<([^>]+)>/<private>/;
> +		}
>  	}
>  	$co{'comment'} = \@commit_lines;

All of these hunks (and the below) should use some new function that
does this feature check + sanitizing instead of copy/pasting mostly the
same code N times. e.g.:
    
    sub maybe_hide_email {
        my $email = shift;
        return $email unless gitweb_check_feature('email_privacy');
        return hide_email($email);
    }

then:

    $tag{author_email} = maybe_hide_email($2);

Also it looks like this isn't a new issue, but does this need to
implement its own E-Mail parser? We ship with Mail::Address for
git-send-email, can gitweb (and the elided hide_email() function above)
use that too?


> @@ -8060,8 +8085,13 @@ sub git_commitdiff {
>  		close $fd
>  			or print "Reading git-diff-tree failed\n";
>  	} elsif ($format eq 'patch') {
> -		local $/ = undef;
> -		print <$fd>;
> +		while (my $line = <$fd>) {
> +			if (gitweb_check_feature('email_privacy') &&
> +				$line =~ m/^([^<]+) <([^>]*)>/) {
> +				$line =~ s/<([^>]+)>/<private>/;
> +			}
> +			print $line;
> +		}
>  		close $fd
>  			or print "Reading git-format-patch failed\n";

Is that "patch" output meant for "git am"? Won't this severely break
that use-case if so?

  reply	other threads:[~2021-03-21  1:05 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-20 23:42 [PATCH] gitweb: redacted e-mail addresses feature Georgios Kontaxis via GitGitGadget
2021-03-21  0:42 ` Ævar Arnfjörð Bjarmason [this message]
2021-03-21  1:27   ` brian m. carlson
2021-03-21  3:30   ` Georgios Kontaxis
2021-03-21  3:32 ` [PATCH v2] " Georgios Kontaxis via GitGitGadget
2021-03-21 17:28   ` [PATCH v3] " Georgios Kontaxis via GitGitGadget
2021-03-21 18:26     ` Ævar Arnfjörð Bjarmason
2021-03-21 18:48       ` Junio C Hamano
2021-03-21 19:48       ` Georgios Kontaxis
2021-03-21 18:42     ` Junio C Hamano
2021-03-21 18:57       ` Junio C Hamano
2021-03-21 19:05         ` Junio C Hamano
2021-03-21 20:07       ` Georgios Kontaxis
2021-03-21 22:17         ` Junio C Hamano
2021-03-21 23:14           ` Georgios Kontaxis
2021-03-22  4:25             ` Junio C Hamano
2021-03-22  6:57     ` [PATCH v4] " Georgios Kontaxis via GitGitGadget
2021-03-22 18:32       ` Junio C Hamano
2021-03-22 18:58         ` Georgios Kontaxis
2021-03-28  1:41           ` Junio C Hamano
2021-03-28 21:43             ` Georgios Kontaxis
2021-03-28 22:35               ` Junio C Hamano
2021-03-23  4:27         ` Georgios Kontaxis
2021-03-27  3:56       ` [PATCH v5] " Georgios Kontaxis via GitGitGadget
2021-03-28 23:26         ` [PATCH v6] " Georgios Kontaxis via GitGitGadget
2021-03-29 20:00           ` Junio C Hamano
2021-03-31 21:14             ` Junio C Hamano
2021-04-06  0:56             ` Junio C Hamano
2021-04-08 22:43           ` Ævar Arnfjörð Bjarmason
2021-04-08 22:51             ` Junio C Hamano
2021-03-29  1:47         ` [PATCH v5] " Eric Wong
2021-03-29  3:17           ` Georgios Kontaxis
2021-04-08 17:16             ` Eric Wong
2021-04-08 21:04               ` Junio C Hamano
2021-04-08 21:19                 ` Eric Wong
2021-04-08 22:45                   ` Ævar Arnfjörð Bjarmason
2021-04-08 22:54                     ` Junio C Hamano
2021-03-21  6:00 ` [PATCH] " Junio C Hamano
2021-03-21  6:18   ` Junio C Hamano
2021-03-21  6:43   ` Georgios Kontaxis
2021-03-21 16:55     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8735wpz699.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=geko1702+commits@99rst.org \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).