All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786
@ 2016-09-21 11:44 Ævar Arnfjörð Bjarmason
  2016-09-21 11:44 ` [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages Ævar Arnfjörð Bjarmason
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2016-09-21 11:44 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jakub Narebski, Kay Sievers,
	Ævar Arnfjörð Bjarmason

The Content-Type is application/xhtml+xml, not application/xhtm+xml.
---
 gitweb/gitweb.perl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 33d701d..9473daf 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1616,7 +1616,7 @@ sub esc_path {
 	return $str;
 }
 
-# Sanitize for use in XHTML + application/xml+xhtm (valid XML 1.0)
+# Sanitize for use in XHTML + application/xml+xhtml (valid XML 1.0)
 sub sanitize {
 	my $str = shift;
 
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages
  2016-09-21 11:44 [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786 Ævar Arnfjörð Bjarmason
@ 2016-09-21 11:44 ` Ævar Arnfjörð Bjarmason
  2016-09-21 16:26   ` Jakub Narębski
  2016-09-21 11:44 ` [PATCH 3/3] gitweb: Link to "git describe"'d commits in log messages Ævar Arnfjörð Bjarmason
  2016-09-21 13:33 ` [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786 Jakub Narębski
  2 siblings, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2016-09-21 11:44 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jakub Narebski, Kay Sievers,
	Ævar Arnfjörð Bjarmason

Change the minimum length of a commit we'll link to from 8 to 7.

This arbitrary minimum length of 8 was introduced in
v1.4.4.2-151-gbfe2191, but as seen in e.g. v1.7.4-1-gdce9648 the
default abbreviation length is 7.

It's still possible to reference SHA1s down to 4 characters in length,
see v1.7.4-1-gdce9648's MINIMUM_ABBREV, but I can't see how to make
git actually produce that, so I doubt anyone is putting that into log
messages in practice, but people definitely do put 7 character SHA1s
into log messages.

I think it's fairly dubious to link to things matching [0-9a-fA-F]
here as opposed to just [0-9a-f], that dates back to the initial
version of gitweb from 161332a. Git will accept all-caps SHA1s, but
didn't ever produce them as far as I can tell.
---
 gitweb/gitweb.perl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 9473daf..101dbc0 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -2036,7 +2036,7 @@ sub format_log_line_html {
 	my $line = shift;
 
 	$line = esc_html($line, -nbsp=>1);
-	$line =~ s{\b([0-9a-fA-F]{8,40})\b}{
+	$line =~ s{\b([0-9a-fA-F]{7,40})\b}{
 		$cgi->a({-href => href(action=>"object", hash=>$1),
 					-class => "text"}, $1);
 	}eg;
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/3] gitweb: Link to "git describe"'d commits in log messages
  2016-09-21 11:44 [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786 Ævar Arnfjörð Bjarmason
  2016-09-21 11:44 ` [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages Ævar Arnfjörð Bjarmason
@ 2016-09-21 11:44 ` Ævar Arnfjörð Bjarmason
  2016-09-21 16:50   ` Junio C Hamano
  2016-09-21 17:09   ` Jakub Narębski
  2016-09-21 13:33 ` [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786 Jakub Narębski
  2 siblings, 2 replies; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2016-09-21 11:44 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jakub Narebski, Kay Sievers,
	Ævar Arnfjörð Bjarmason

Change the log formatting function to know about "git describe" output
like v2.8.0-4-g867ad08 in addition to just plain 867ad08.

This also fixes a micro-regression in my change of the minimum SHA1
length from 8 to 7, which is that dated tags like
hadoop-20160921-113441-20-g094fb7d would start thinking the "20160921"
part was a commit.

There are still many valid refnames that we don't link to
e.g. v2.10.0-rc1~2^2~1 is also a valid way to refer to
v2.8.0-4-g867ad08, but I'm not supporting that with this commit,
similarly it's trivially possible to create some refnames like
"æ/var-gf6727b0" or whatever which won't be picked up by this regex.

There's surely room for improvement here, but I just wanted to address
the very common case of sticking "git describe" output into commit
messages without trying to link to all possible refnames, that's going
to be a rather futile exercise given that this is free text, and it
would be prohibitively expensive to look up whether the references in
question exist in our repository.
---
 gitweb/gitweb.perl | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 101dbc0..3a52bc7 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -2036,10 +2036,24 @@ sub format_log_line_html {
 	my $line = shift;
 
 	$line = esc_html($line, -nbsp=>1);
-	$line =~ s{\b([0-9a-fA-F]{7,40})\b}{
+	$line =~ s{
+        \b
+        (
+            # The output of "git describe", e.g. v2.10.0-297-gf6727b0
+            # or hadoop-20160921-113441-20-g094fb7d
+            (?<!-) # see strbuf_check_tag_ref(). Tags can't start with -
+            [A-Za-z0-9.-]+
+            (?!\.) # refs can't end with ".", see check_refname_format()
+            -g[0-9a-fA-F]{7,40}
+            |
+            # Just a normal looking Git SHA1
+            [0-9a-fA-F]{7,40}
+        )
+        \b
+    }{
 		$cgi->a({-href => href(action=>"object", hash=>$1),
 					-class => "text"}, $1);
-	}eg;
+	}egx;
 
 	return $line;
 }
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786
  2016-09-21 11:44 [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786 Ævar Arnfjörð Bjarmason
  2016-09-21 11:44 ` [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages Ævar Arnfjörð Bjarmason
  2016-09-21 11:44 ` [PATCH 3/3] gitweb: Link to "git describe"'d commits in log messages Ævar Arnfjörð Bjarmason
@ 2016-09-21 13:33 ` Jakub Narębski
  2016-09-21 14:17   ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 17+ messages in thread
From: Jakub Narębski @ 2016-09-21 13:33 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git; +Cc: Junio C Hamano, Kay Sievers

W dniu 21.09.2016 o 13:44, Ævar Arnfjörð Bjarmason napisał:

> Subject: [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786

It is more "ancient typo from v1.7.7-rc1-1-g0866786", but perhaps more
important is "ancient typo in a comment"

>
> The Content-Type is application/xhtml+xml, not application/xhtm+xml.

Right.  Thanks for the patch.

Signoff?

> ---
>  gitweb/gitweb.perl | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 33d701d..9473daf 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -1616,7 +1616,7 @@ sub esc_path {
>  	return $str;
>  }
>  
> -# Sanitize for use in XHTML + application/xml+xhtm (valid XML 1.0)
> +# Sanitize for use in XHTML + application/xml+xhtml (valid XML 1.0)

Nb. I wonder how common is use of XHTML nowadays, with HTML5 as standard...

>  sub sanitize {
>  	my $str = shift;
>  
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786
  2016-09-21 13:33 ` [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786 Jakub Narębski
@ 2016-09-21 14:17   ` Ævar Arnfjörð Bjarmason
  2016-09-21 17:14     ` Jakub Narębski
  0 siblings, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2016-09-21 14:17 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Git, Junio C Hamano, Kay Sievers

On Wed, Sep 21, 2016 at 3:33 PM, Jakub Narębski <jnareb@gmail.com> wrote:
> W dniu 21.09.2016 o 13:44, Ævar Arnfjörð Bjarmason napisał:
>
>> Subject: [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786
>
> It is more "ancient typo from v1.7.7-rc1-1-g0866786", but perhaps more
> important is "ancient typo in a comment"

Yeah, will rephrase.

>>
>> The Content-Type is application/xhtml+xml, not application/xhtm+xml.
>
> Right.  Thanks for the patch.
>
> Signoff?

Blast! I forgot that for these 3x patches. I'll re-submit pending
further comments on the rest of the code changes in the series.

>> ---
>>  gitweb/gitweb.perl | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 33d701d..9473daf 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -1616,7 +1616,7 @@ sub esc_path {
>>       return $str;
>>  }
>>
>> -# Sanitize for use in XHTML + application/xml+xhtm (valid XML 1.0)
>> +# Sanitize for use in XHTML + application/xml+xhtml (valid XML 1.0)
>
> Nb. I wonder how common is use of XHTML nowadays, with HTML5 as standard...

It's sent to modern browsers, I noticed it because when doing the rest
of the patches in the series the slightest mistake in the HTML syntax
would cause the page not to render in Chrome, because
application/xml+xhtm activates its anal parsing mode.

>>  sub sanitize {
>>       my $str = shift;
>>
>>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages
  2016-09-21 11:44 ` [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages Ævar Arnfjörð Bjarmason
@ 2016-09-21 16:26   ` Jakub Narębski
  2016-09-21 18:04     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Narębski @ 2016-09-21 16:26 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git; +Cc: Junio C Hamano

W dniu 21.09.2016 o 13:44, Ævar Arnfjörð Bjarmason napisał:

> Subject: [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages

This is modification of a feature, not a new feature it sounds like.
I think the following title / subject would be better:

  Subject: [PATCH 2/3] gitweb: Link to 7-char+ SHA1s, not only 8-char+

>
> Change the minimum length of a commit we'll link to from 8 to 7.

I think it would read better as:

  Change the minimum length of an abbreviated object identifier in the
  commit message gitweb tries to turn into link from 8 hexchars to 7.

> 
> This arbitrary minimum length of 8 was introduced in
> v1.4.4.2-151-gbfe2191, but as seen in e.g. v1.7.4-1-gdce9648 the
> default abbreviation length is 7.

Right. I wonder why it was 8 in gitweb...

> 
> It's still possible to reference SHA1s down to 4 characters in length,
> see v1.7.4-1-gdce9648's MINIMUM_ABBREV, but I can't see how to make
> git actually produce that, so I doubt anyone is putting that into log
> messages in practice, but people definitely do put 7 character SHA1s
> into log messages.

There is an additional problem: the shorter SHA1 abbrev we try to
match, the more possibility of false positives, words that only look
like (shortened SHA-1).

For 7 characters there is at last one word that can be mistaken
for SHA1 abbrev, namely 'deedeed' (hopefully rare in commit messages).
For 6 characters we have 'accede', 'beaded', 'decade' (!), 'deface',
'facade' (!!), and possibly more (and of course all 7 character
hexdigit words).

Also, the number of digits provided as an optional parameter to
--abbrev or --abbrev-commit options is only a minimal number of 
hexdigits: Git would use as many as needed for the abbreviated SHA-1
to be unambiguous, at current time.


I think allowing 7-character shortened SHA-1, which is what Git
produces for smaller repositories by default is (might be?) a good
idea.  Thanks for the patch.

> 
> I think it's fairly dubious to link to things matching [0-9a-fA-F]
> here as opposed to just [0-9a-f], that dates back to the initial
> version of gitweb from 161332a. Git will accept all-caps SHA1s, but
> didn't ever produce them as far as I can tell.

All right, thanks for reminder.

Signoff?

> ---
>  gitweb/gitweb.perl | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 9473daf..101dbc0 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -2036,7 +2036,7 @@ sub format_log_line_html {
>  	my $line = shift;
>  
>  	$line = esc_html($line, -nbsp=>1);
> -	$line =~ s{\b([0-9a-fA-F]{8,40})\b}{
> +	$line =~ s{\b([0-9a-fA-F]{7,40})\b}{
>  		$cgi->a({-href => href(action=>"object", hash=>$1),
>  					-class => "text"}, $1);
>  	}eg;
> 

Nice and simple.

P.S. I have reworking of commit message parsing and enhancement in my
long, long and dated gitweb TODO list :-(

P.P.S. Kay Sievers no longer works on gitweb, and I think no longer
works at SuSE but at RedHat.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] gitweb: Link to "git describe"'d commits in log messages
  2016-09-21 11:44 ` [PATCH 3/3] gitweb: Link to "git describe"'d commits in log messages Ævar Arnfjörð Bjarmason
@ 2016-09-21 16:50   ` Junio C Hamano
  2016-09-21 17:49     ` Jakub Narębski
  2016-09-21 17:09   ` Jakub Narębski
  1 sibling, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2016-09-21 16:50 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Jakub Narebski, Kay Sievers

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> There's surely room for improvement here, but I just wanted to address
> the very common case of sticking "git describe" output into commit
> messages without trying to link to all possible refnames, that's going
> to be a rather futile exercise given that this is free text, and it
> would be prohibitively expensive to look up whether the references in
> question exist in our repository.

When I saw 2/3 I wondered about one thing and 3/3 shares the same,
which is that we only use regex match and do not validate for a
false match.  Would it be too expensive to pick up what _looks_ like
a rev (e.g. hex or g(refname regexp)-hex) then validate it with
"rev-parse --verify --quiet" to make sure it is a rev, before
actually making it a link?  Even if are we trying to account for
people referring to commits that do not exist in this repository
(e.g. some other project, in a submodule repository, or just an
earlier incarnation of rebasing that has since been lost), it seems
to me that it does not help to mark them with a link that won't
resolve.

> ---
>  gitweb/gitweb.perl | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 101dbc0..3a52bc7 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -2036,10 +2036,24 @@ sub format_log_line_html {
>  	my $line = shift;
>  
>  	$line = esc_html($line, -nbsp=>1);
> -	$line =~ s{\b([0-9a-fA-F]{7,40})\b}{
> +	$line =~ s{
> +        \b
> +        (
> +            # The output of "git describe", e.g. v2.10.0-297-gf6727b0
> +            # or hadoop-20160921-113441-20-g094fb7d
> +            (?<!-) # see strbuf_check_tag_ref(). Tags can't start with -
> +            [A-Za-z0-9.-]+
> +            (?!\.) # refs can't end with ".", see check_refname_format()
> +            -g[0-9a-fA-F]{7,40}
> +            |
> +            # Just a normal looking Git SHA1
> +            [0-9a-fA-F]{7,40}
> +        )
> +        \b
> +    }{
>  		$cgi->a({-href => href(action=>"object", hash=>$1),
>  					-class => "text"}, $1);
> -	}eg;
> +	}egx;
>  
>  	return $line;
>  }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] gitweb: Link to "git describe"'d commits in log messages
  2016-09-21 11:44 ` [PATCH 3/3] gitweb: Link to "git describe"'d commits in log messages Ævar Arnfjörð Bjarmason
  2016-09-21 16:50   ` Junio C Hamano
@ 2016-09-21 17:09   ` Jakub Narębski
  2016-09-21 17:58     ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 17+ messages in thread
From: Jakub Narębski @ 2016-09-21 17:09 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git; +Cc: Junio C Hamano

W dniu 21.09.2016 o 13:44, Ævar Arnfjörð Bjarmason napisał:

> Change the log formatting function to know about "git describe" output
> like v2.8.0-4-g867ad08 in addition to just plain 867ad08.

All right, that is a good plan.

> 
> This also fixes a micro-regression in my change of the minimum SHA1
> length from 8 to 7, which is that dated tags like
> hadoop-20160921-113441-20-g094fb7d would start thinking the "20160921"
> part was a commit.

Actually 20160921 is 8 characters, so assuming that '-' is treated
as word boundary by Perl, it is not a regression; this false positive
was there.  The new feature would help, instead of linking false match
it links whole git-describe output.

So this paragraph needs to be changed wrt. the above.

Note that there are quite a bit of shortened SHA-1 that are composed
entirely from digits, without a-f characters.

> 
> There are still many valid refnames that we don't link to
> e.g. v2.10.0-rc1~2^2~1 is also a valid way to refer to
> v2.8.0-4-g867ad08, but I'm not supporting that with this commit,
> similarly it's trivially possible to create some refnames like
> "æ/var-gf6727b0" or whatever which won't be picked up by this regex.

Hopefully hierarchical tags are rare.  We need to reduce false
positives.

> 
> There's surely room for improvement here, but I just wanted to address
> the very common case of sticking "git describe" output into commit
> messages without trying to link to all possible refnames, that's going
> to be a rather futile exercise given that this is free text, and it
> would be prohibitively expensive to look up whether the references in
> question exist in our repository.

Note that we do not ask Git at the time of displaying commit message
if the link is valid for performance reasons; we link it, and the link
may be invalid if it was a false positive.

Note that recommended way to refer to other commit in commit mesages
is (see Documentation/SubmittingPatches):

  If you want to reference a previous commit in the history of a stable
  branch, use the format "abbreviated sha1 (subject, date)",
  with the subject enclosed in a pair of double-quotes, like this:

     Commit f86a374 ("pack-bitmap.c: fix a memleak", 2015-03-30)
     noticed that ...

Hmmm... this makes previous commit even more important.

> ---
>  gitweb/gitweb.perl | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 101dbc0..3a52bc7 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -2036,10 +2036,24 @@ sub format_log_line_html {
>  	my $line = shift;
>  
>  	$line = esc_html($line, -nbsp=>1);
> -	$line =~ s{\b([0-9a-fA-F]{7,40})\b}{
> +	$line =~ s{
> +        \b
> +        (
> +            # The output of "git describe", e.g. v2.10.0-297-gf6727b0
> +            # or hadoop-20160921-113441-20-g094fb7d

All right, for more complex regular expressions using in-line comments
(extended regexp in Perl) is a good idea.

> +            (?<!-) # see strbuf_check_tag_ref(). Tags can't start with -
> +            [A-Za-z0-9.-]+
> +            (?!\.) # refs can't end with ".", see check_refname_format()

If we can assume that tag name is at least two characters (instead of
at least one character), we could get rid of those extended regexp
lookaround assertions:

  (?<!pattern) - zero-width negative lookbehind assertion
  (?!pattern)  - zero-width negative lookahead  assertion

That is:

  +            [A-Za-z0-9.]   # see strbuf_check_tag_ref(). Tags can't start with -
  +            [A-Za-z0-9.-]*
  +            [A-Za-z0-9-]   # refs can't end with ".", see check_refname_format()

Also, the canonical documentation for what is allowed in refnames
is git-check-ref-format(1)... though it does not look like it includes
"tags cannot start with '-'".

Anyway, perhaps 'is it valid refname' could be passed to a subroutine,
or a named regexp (which might be more involved, like disallowing two
consecutive dots, e.g. "(?!.*\.{2})" at beginning).

> +            -g[0-9a-fA-F]{7,40}

If we are limiting to git-describe output, we can get rid of A-F here.

> +            |
> +            # Just a normal looking Git SHA1
> +            [0-9a-fA-F]{7,40}
> +        )
> +        \b
> +    }{
>  		$cgi->a({-href => href(action=>"object", hash=>$1),
>  					-class => "text"}, $1);
> -	}eg;
> +	}egx;
>  
>  	return $line;
>  }
> 

Good work. 

I assume that you are using git-describe output in commit messages
a lot, isn't it?
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786
  2016-09-21 14:17   ` Ævar Arnfjörð Bjarmason
@ 2016-09-21 17:14     ` Jakub Narębski
  2016-09-21 17:17       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Narębski @ 2016-09-21 17:14 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git, Junio C Hamano

W dniu 21.09.2016 o 16:17, Ævar Arnfjörð Bjarmason napisał:
> On Wed, Sep 21, 2016 at 3:33 PM, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 21.09.2016 o 13:44, Ævar Arnfjörð Bjarmason napisał:
[...]

>>> -# Sanitize for use in XHTML + application/xml+xhtm (valid XML 1.0)
>>> +# Sanitize for use in XHTML + application/xml+xhtml (valid XML 1.0)
>>
>> Nb. I wonder how common is use of XHTML nowadays, with HTML5 as standard...
> 
> It's sent to modern browsers, I noticed it because when doing the rest
> of the patches in the series the slightest mistake in the HTML syntax
> would cause the page not to render in Chrome, because
> application/xml+xhtml activates its anal parsing mode.

What I wanted to say is if we should support XHTML mimetype at all;
the future is HTML5 and perhaps gitweb should always use 'text/html'.

But this is unrelated to this change...
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786
  2016-09-21 17:14     ` Jakub Narębski
@ 2016-09-21 17:17       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2016-09-21 17:17 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Git, Junio C Hamano

On Wed, Sep 21, 2016 at 7:14 PM, Jakub Narębski <jnareb@gmail.com> wrote:
> W dniu 21.09.2016 o 16:17, Ævar Arnfjörð Bjarmason napisał:
>> On Wed, Sep 21, 2016 at 3:33 PM, Jakub Narębski <jnareb@gmail.com> wrote:
>>> W dniu 21.09.2016 o 13:44, Ævar Arnfjörð Bjarmason napisał:
> [...]
>
>>>> -# Sanitize for use in XHTML + application/xml+xhtm (valid XML 1.0)
>>>> +# Sanitize for use in XHTML + application/xml+xhtml (valid XML 1.0)
>>>
>>> Nb. I wonder how common is use of XHTML nowadays, with HTML5 as standard...
>>
>> It's sent to modern browsers, I noticed it because when doing the rest
>> of the patches in the series the slightest mistake in the HTML syntax
>> would cause the page not to render in Chrome, because
>> application/xml+xhtml activates its anal parsing mode.
>
> What I wanted to say is if we should support XHTML mimetype at all;
> the future is HTML5 and perhaps gitweb should always use 'text/html'.

Regardless of what MIME type we'd normally use, as long as browsers
support application/xml+xhtml developing with it is very handy,
because you get the instant equivalent of compile errors for your
HTML, as opposed to the usual behavior of "oh this doesn't parse, but
let's try to make sense of it anyway", which often leads to fruitless
debugging sessions just because you forgot to close some tag or
quotation.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] gitweb: Link to "git describe"'d commits in log messages
  2016-09-21 16:50   ` Junio C Hamano
@ 2016-09-21 17:49     ` Jakub Narębski
  2016-09-21 18:01       ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Narębski @ 2016-09-21 17:49 UTC (permalink / raw)
  To: Junio C Hamano, Ævar Arnfjörð Bjarmason; +Cc: git

W dniu 21.09.2016 o 18:50, Junio C Hamano pisze:
> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
> 
>> There's surely room for improvement here, but I just wanted to address
>> the very common case of sticking "git describe" output into commit
>> messages without trying to link to all possible refnames, that's going
>> to be a rather futile exercise given that this is free text, and it
>> would be prohibitively expensive to look up whether the references in
>> question exist in our repository.
> 
> When I saw 2/3 I wondered about one thing and 3/3 shares the same,
> which is that we only use regex match and do not validate for a
> false match.  Would it be too expensive to pick up what _looks_ like
> a rev (e.g. hex or g(refname regexp)-hex) then validate it with
> "rev-parse --verify --quiet" to make sure it is a rev, before
> actually making it a link?  Even if are we trying to account for
> people referring to commits that do not exist in this repository
> (e.g. some other project, in a submodule repository, or just an
> earlier incarnation of rebasing that has since been lost), it seems
> to me that it does not help to mark them with a link that won't
> resolve.

I think it could be a good *option*, but revision verification
could be costly, for example in the 'log' view with multiple commits
and multiple revision-like looking candidates, even if we were able
to do it with one command.

Also, "git rev-parse --verify [--quiet]" can verify only one
revision at once, isn't it? Maybe something like 'git cat-file
--batch-check' would be better (one fork)?

It's a matter of balance between false positives (and unresolving
links) and performance...

Best,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] gitweb: Link to "git describe"'d commits in log messages
  2016-09-21 17:09   ` Jakub Narębski
@ 2016-09-21 17:58     ` Ævar Arnfjörð Bjarmason
  2016-09-21 19:13       ` Jakub Narębski
  0 siblings, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2016-09-21 17:58 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Git, Junio C Hamano

On Wed, Sep 21, 2016 at 7:09 PM, Jakub Narębski <jnareb@gmail.com> wrote:
> W dniu 21.09.2016 o 13:44, Ævar Arnfjörð Bjarmason napisał:
>
>> Change the log formatting function to know about "git describe" output
>> like v2.8.0-4-g867ad08 in addition to just plain 867ad08.
>
> All right, that is a good plan.
>
>>
>> This also fixes a micro-regression in my change of the minimum SHA1
>> length from 8 to 7, which is that dated tags like
>> hadoop-20160921-113441-20-g094fb7d would start thinking the "20160921"
>> part was a commit.
>
> Actually 20160921 is 8 characters, so assuming that '-' is treated
> as word boundary by Perl, it is not a regression; this false positive
> was there.  The new feature would help, instead of linking false match
> it links whole git-describe output.
>
> So this paragraph needs to be changed wrt. the above.

Yeah I just miscounted there, will change that.

> Note that there are quite a bit of shortened SHA-1 that are composed
> entirely from digits, without a-f characters.

*nod*

>>
>> There are still many valid refnames that we don't link to
>> e.g. v2.10.0-rc1~2^2~1 is also a valid way to refer to
>> v2.8.0-4-g867ad08, but I'm not supporting that with this commit,
>> similarly it's trivially possible to create some refnames like
>> "æ/var-gf6727b0" or whatever which won't be picked up by this regex.
>
> Hopefully hierarchical tags are rare.  We need to reduce false
> positives.
>
>>
>> There's surely room for improvement here, but I just wanted to address
>> the very common case of sticking "git describe" output into commit
>> messages without trying to link to all possible refnames, that's going
>> to be a rather futile exercise given that this is free text, and it
>> would be prohibitively expensive to look up whether the references in
>> question exist in our repository.
>
> Note that we do not ask Git at the time of displaying commit message
> if the link is valid for performance reasons; we link it, and the link
> may be invalid if it was a false positive.
>
> Note that recommended way to refer to other commit in commit mesages
> is (see Documentation/SubmittingPatches):
>
>   If you want to reference a previous commit in the history of a stable
>   branch, use the format "abbreviated sha1 (subject, date)",
>   with the subject enclosed in a pair of double-quotes, like this:
>
>      Commit f86a374 ("pack-bitmap.c: fix a memleak", 2015-03-30)
>      noticed that ...
>
> Hmmm... this makes previous commit even more important.
>
>> ---
>>  gitweb/gitweb.perl | 18 ++++++++++++++++--
>>  1 file changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 101dbc0..3a52bc7 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -2036,10 +2036,24 @@ sub format_log_line_html {
>>       my $line = shift;
>>
>>       $line = esc_html($line, -nbsp=>1);
>> -     $line =~ s{\b([0-9a-fA-F]{7,40})\b}{
>> +     $line =~ s{
>> +        \b
>> +        (
>> +            # The output of "git describe", e.g. v2.10.0-297-gf6727b0
>> +            # or hadoop-20160921-113441-20-g094fb7d
>
> All right, for more complex regular expressions using in-line comments
> (extended regexp in Perl) is a good idea.
>
>> +            (?<!-) # see strbuf_check_tag_ref(). Tags can't start with -
>> +            [A-Za-z0-9.-]+
>> +            (?!\.) # refs can't end with ".", see check_refname_format()
>
> If we can assume that tag name is at least two characters (instead of
> at least one character), we could get rid of those extended regexp
> lookaround assertions:
>
>   (?<!pattern) - zero-width negative lookbehind assertion
>   (?!pattern)  - zero-width negative lookahead  assertion
>
> That is:
>
>   +            [A-Za-z0-9.]   # see strbuf_check_tag_ref(). Tags can't start with -
>   +            [A-Za-z0-9.-]*
>   +            [A-Za-z0-9-]   # refs can't end with ".", see check_refname_format()

Why get rid of them? I'm all for improving the regex, there's bound to
be lots of bugs in it, but since it's perl we can freely use its
extended features.

> Also, the canonical documentation for what is allowed in refnames
> is git-check-ref-format(1)... though it does not look like it includes
> "tags cannot start with '-'".

Yeah, looks like that manpage needs to be patched.

> Anyway, perhaps 'is it valid refname' could be passed to a subroutine,
> or a named regexp (which might be more involved, like disallowing two
> consecutive dots, e.g. "(?!.*\.{2})" at beginning).
>
>> +            -g[0-9a-fA-F]{7,40}
>
> If we are limiting to git-describe output, we can get rid of A-F here.

Indeed.

>> +            |
>> +            # Just a normal looking Git SHA1
>> +            [0-9a-fA-F]{7,40}
>> +        )
>> +        \b
>> +    }{
>>               $cgi->a({-href => href(action=>"object", hash=>$1),
>>                                       -class => "text"}, $1);
>> -     }eg;
>> +     }egx;
>>
>>       return $line;
>>  }
>>
>
> Good work.
>
> I assume that you are using git-describe output in commit messages
> a lot, isn't it?

Yeah, and I got tired of gitweb not linking to any of the commits I
was referencing.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] gitweb: Link to "git describe"'d commits in log messages
  2016-09-21 17:49     ` Jakub Narębski
@ 2016-09-21 18:01       ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2016-09-21 18:01 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Ævar Arnfjörð Bjarmason, git

Jakub Narębski <jnareb@gmail.com> writes:

>> When I saw 2/3 I wondered about one thing and 3/3 shares the same,
>> which is that we only use regex match and do not validate for a
>> false match.  Would it be too expensive...
>
> It's a matter of balance between false positives (and unresolving
> links) and performance...

Yes, and that is why I asked a simple yes-or-no question.  Would it
be too expensive?  Your answer seems to be yes.

Have we measured?  Is that really a bottleneck?  Would it help to
update parse_commits to call a new command "gitweb--helper" that
produces the result of what git_print_log would have done to its
$log argument, for example?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages
  2016-09-21 16:26   ` Jakub Narębski
@ 2016-09-21 18:04     ` Ævar Arnfjörð Bjarmason
  2016-09-21 18:28       ` Jakub Narębski
  0 siblings, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2016-09-21 18:04 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Git, Junio C Hamano

On Wed, Sep 21, 2016 at 6:26 PM, Jakub Narębski <jnareb@gmail.com> wrote:

> P.S. I have reworking of commit message parsing and enhancement in my
> long, long and dated gitweb TODO list :-(

Anything specific you could share?

One thing that would be a lot faster in Perl is if we didn't have to
pass the log around as split-up lines and could just operate on it as
one big string.

It would make some code like git_print_log() a bit more complex /
fragile, since it would have to work on multi-line strings, but
anything that needed to do a regex match / replacement would be much
faster.

But OTOH I think perhaps we're worrying about nothing when it comes to
the performance. I haven't been able to make gitweb display more than
a 100 or so commits at a time (haven't found where exactly in the code
these limits are), any munging we do on the log messages would have to
be pretty damn slow to matter.

> P.P.S. Kay Sievers no longer works on gitweb, and I think no longer
> works at SuSE but at RedHat.

Yup, been getting bounces from his address.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages
  2016-09-21 18:04     ` Ævar Arnfjörð Bjarmason
@ 2016-09-21 18:28       ` Jakub Narębski
  2016-09-21 20:09         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Narębski @ 2016-09-21 18:28 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git, Junio C Hamano

W dniu 21.09.2016 o 20:04, Ævar Arnfjörð Bjarmason pisze:
> On Wed, Sep 21, 2016 at 6:26 PM, Jakub Narębski <jnareb@gmail.com> wrote:
> 
>> P.S. I have reworking of commit message parsing and enhancement in my
>> long, long and dated gitweb TODO list :-(
> 
> Anything specific you could share?

Some of TODO I would have to bring from backups, as the computer on
which I did majority of gitweb development has since died (from old
age).

The list includes:
- implement caching of gitweb output
- revamp handling of encoding (UTF-8 with fallback encoding)
- split gitweb into modules, while maintaining ease of install
- refactor handling of diffs
- better handling of config files
- document URI structure, perhaps revamp URI parsing and generation
- make commit message transformation generic
  (see below)

> 
> One thing that would be a lot faster in Perl is if we didn't have to
> pass the log around as split-up lines and could just operate on it as
> one big string.

Well, there are a few transformations that commit message undergoes
in gitweb, including linking SHA1, optional linking of bug numbers
to bug tracker, and syntax highlighting of signoff lines (trailer
lines).  

I would like to have this cleaned up, and refactored.  With all
those transformations we would need to keep account which parts
are HTML, and which not and need escaping (note: URI escape !=
HTML escape).

> 
> It would make some code like git_print_log() a bit more complex /
> fragile, since it would have to work on multi-line strings, but
> anything that needed to do a regex match / replacement would be much
> faster.

Would it?  Did you perform any synthetic micro-benchmark?

> 
> But OTOH I think perhaps we're worrying about nothing when it comes to
> the performance. I haven't been able to make gitweb display more than
> a 100 or so commits at a time (haven't found where exactly in the code
> these limits are), any munging we do on the log messages would have to
> be pretty damn slow to matter.

sub git_log_generic {

	# [...]

	my @commitlist =
		parse_commits($commit_hash, 101, (100 * $page),
		              defined $file_name ? ($file_name, "--full-history") : ());

Here you have it (it probably should be a constant; this number can be
found in a few other places).

Best,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] gitweb: Link to "git describe"'d commits in log messages
  2016-09-21 17:58     ` Ævar Arnfjörð Bjarmason
@ 2016-09-21 19:13       ` Jakub Narębski
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Narębski @ 2016-09-21 19:13 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git, Junio C Hamano

W dniu 21.09.2016 o 19:58, Ævar Arnfjörð Bjarmason pisze:
> On Wed, Sep 21, 2016 at 7:09 PM, Jakub Narębski <jnareb@gmail.com> wrote:
>> W dniu 21.09.2016 o 13:44, Ævar Arnfjörð Bjarmason napisał:

>>> +            (?<!-) # see strbuf_check_tag_ref(). Tags can't start with -
>>> +            [A-Za-z0-9.-]+
>>> +            (?!\.) # refs can't end with ".", see check_refname_format()
>>
>> If we can assume that tag name is at least two characters (instead of
>> at least one character), we could get rid of those extended regexp
>> lookaround assertions:
>>
>>   (?<!pattern) - zero-width negative lookbehind assertion
>>   (?!pattern)  - zero-width negative lookahead  assertion
>>
>> That is:
>>
>>   +            [A-Za-z0-9.]   # see strbuf_check_tag_ref(). Tags can't start with -
>>   +            [A-Za-z0-9.-]*
>>   +            [A-Za-z0-9-]   # refs can't end with ".", see check_refname_format()
> 
> Why get rid of them? I'm all for improving the regex, there's bound to
> be lots of bugs in it, but since it's perl we can freely use its
> extended features.

Ah, all right. I was wondering how zero width assertions / patterns
interact with each other, but zero-width negative lookaround assertions
are really quite simple.

> 
>> Also, the canonical documentation for what is allowed in refnames
>> is git-check-ref-format(1)... though it does not look like it includes
>> "tags cannot start with '-'".
> 
> Yeah, looks like that manpage needs to be patched.

Right.

> 
>> Anyway, perhaps 'is it valid refname' could be passed to a subroutine,
>> or a named regexp (which might be more involved, like disallowing two
>> consecutive dots, e.g. "(?!.*\.{2})" at beginning).

I wonder if rules for valid tag name can be described in extended
regexp, and if it is, how readable would it be.

Regards,
-- 
Jakub Narębski


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages
  2016-09-21 18:28       ` Jakub Narębski
@ 2016-09-21 20:09         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2016-09-21 20:09 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: Git, Junio C Hamano

On Wed, Sep 21, 2016 at 8:28 PM, Jakub Narębski <jnareb@gmail.com> wrote:
> W dniu 21.09.2016 o 20:04, Ævar Arnfjörð Bjarmason pisze:
>> It would make some code like git_print_log() a bit more complex /
>> fragile, since it would have to work on multi-line strings, but
>> anything that needed to do a regex match / replacement would be much
>> faster.
>
> Would it?  Did you perform any synthetic micro-benchmark?

No, just experience. With the caveat that this may not matter at all
in this context, C-like code in Perl is slow, if you can offload
things to one big regex operation it's usually faster.

>>
>> But OTOH I think perhaps we're worrying about nothing when it comes to
>> the performance. I haven't been able to make gitweb display more than
>> a 100 or so commits at a time (haven't found where exactly in the code
>> these limits are), any munging we do on the log messages would have to
>> be pretty damn slow to matter.
>
> sub git_log_generic {
>
>         # [...]
>
>         my @commitlist =
>                 parse_commits($commit_hash, 101, (100 * $page),
>                               defined $file_name ? ($file_name, "--full-history") : ());
>
> Here you have it (it probably should be a constant; this number can be
> found in a few other places).

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-09-21 20:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-21 11:44 [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786 Ævar Arnfjörð Bjarmason
2016-09-21 11:44 ` [PATCH 2/3] gitweb: Link to 7-character SHA1SUMS in commit messages Ævar Arnfjörð Bjarmason
2016-09-21 16:26   ` Jakub Narębski
2016-09-21 18:04     ` Ævar Arnfjörð Bjarmason
2016-09-21 18:28       ` Jakub Narębski
2016-09-21 20:09         ` Ævar Arnfjörð Bjarmason
2016-09-21 11:44 ` [PATCH 3/3] gitweb: Link to "git describe"'d commits in log messages Ævar Arnfjörð Bjarmason
2016-09-21 16:50   ` Junio C Hamano
2016-09-21 17:49     ` Jakub Narębski
2016-09-21 18:01       ` Junio C Hamano
2016-09-21 17:09   ` Jakub Narębski
2016-09-21 17:58     ` Ævar Arnfjörð Bjarmason
2016-09-21 19:13       ` Jakub Narębski
2016-09-21 13:33 ` [PATCH 1/3] gitweb: Fix an ancient typo in v1.7.7-rc1-1-g0866786 Jakub Narębski
2016-09-21 14:17   ` Ævar Arnfjörð Bjarmason
2016-09-21 17:14     ` Jakub Narębski
2016-09-21 17:17       ` Ævar Arnfjörð Bjarmason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.